Tidy Data
In a tidy dataset each variable is saved in its own column and each observation is saved in its own row. Common operations to tidy up datasets are: find and drop empty rows, columns or duplicates, impute data, remove unwanted characters.
Drop a row if ANY values are missing
df = df.dropna()
Drop a row only if ALL values are missing
df = df.dropna(how='all')
Drop a column if ANY values are missing
df = df.dropna(axis=1)
Drop a column only if ALL values are missing
df = df.dropna(how='all', axis=1)
Drop duplicate rows keeping first occurrence
df = df.drop_duplicates()
Drop duplicate rows keeping last occurrence
df = df.drop_duplicates(keep='last')
Drop all duplicates
df = df.drop_duplicates(keep=False)
Drop duplicate rows identifying duplicates on the giving columns
df = df.drop_duplicates(subset=['column_x', 'column_y'])
Find duplicates based on giving columns
dup_df = df[df.duplicated(subset=['column_x', 'column_y'], keep=False)]
Count duplicates rows
df.duplicated().sum()
Count duplicates on column
df.column_x.duplicated().sum()
Replace all missing values with 0
df = df.fillna(value = 0)
Replace missing values with 0 on one column
df.column_x = df.column_x.fillna(value=0)
Replace missing values on multiple columns
df = df.fillna({
'col1': 'missing',
'col2': '99.999',
'col3': '999',
'col4': 'missing',
'col5': 'missing',
'col6': '99'
})
Replace missing values with forward fill (propagate non-null values forward)
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df_filled = df.fillna(method='ffill')
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
>>> df_filled
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace missing values with backward fill (propagate non-null values backward)
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
# use NEXT valid observation to fill gap/NAN
df_filled = df.fillna(method='bfill')
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
>>> df_filled
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN 3.0 NaN 5
3 NaN 3.0 NaN 4
Replace all instances of a value in a column
df.column_y = df.column_y.replace('old_string', 'new_string')
Count missing values in data
df.isnull().sum().sum()
Count missing values on each column / return Series
df.isnull().sum()
Count missing values on specific column
df.column_x.isnull().sum()
Percentage of missing values in data
# count of missing diveded by total data cells
(df.isnull().sum().sum()) / (df.shape[0] * df.shape[1]) * 100
Show rows with missing values on column
import pandas as pd
df[df.column_x.isnull()]
# or
df[pd.isnull(df.column_x)]
Show rows with no missing values on column
df[df.column_x.notnull()]
Convert all NaNs to None
df = df.where(pd.notnull(df), None)
Convert column to numeric (will error if column has non-numeric values)
df.column_x = pd.to_numeric(df.column_x)
# to avoid error and convert non-numeric to NaN
pd.to_numeric(df.column_x, errors='coerce')
Change data type of DataFrame column
df.column_x = df.column_x.astype('int64')
#to ignore errors
df.column_x = df.column_x.astype('int64', errors='ignore')
# pandas dtypes: float64 (float), int64 (integer), datetime64 (datetime), object (string)
Convert column to datetime (will error if column has datetime values)
df.column_x = pd.to_datetime(df.column_x)
# to avoid error and convert non-datetime to NaT
df.column_x = pd.to_datetime(df.column_x, errors='coerce')
# specify format of input data
# format can be any code of strftime directive: http://strftime.org/
df.column_x = pd.to_datetime(df.column_x, errors='coerce', format='%Y%m%d')
Split delimited values in a DataFrame column into two new columns
# `col` column may look like `string1:string2`
df['new_c1'], df['new_c2'] = zip(*df['col'].apply(lambda x: x.split(':', 2)))