Tidy Data

In a tidy dataset each variable is saved in its own column and each observation is saved in its own row. Common operations to tidy up datasets are: find and drop empty rows, columns or duplicates, impute data, remove unwanted characters.

Drop a row if ANY values are missing

df = df.dropna()

Drop a row only if ALL values are missing

df = df.dropna(how='all')

Drop a column if ANY values are missing

df = df.dropna(axis=1)

Drop a column only if ALL values are missing

df = df.dropna(how='all', axis=1)

Drop duplicate rows keeping first occurrence

df = df.drop_duplicates()

Drop duplicate rows keeping last occurrence

df = df.drop_duplicates(keep='last')

Drop all duplicates

df = df.drop_duplicates(keep=False)

Drop duplicate rows identifying duplicates on the giving columns

df = df.drop_duplicates(subset=['column_x', 'column_y'])

Find duplicates based on giving columns

dup_df = df[df.duplicated(subset=['column_x', 'column_y'], keep=False)]

Count duplicates rows

df.duplicated().sum()

Count duplicates on column

df.column_x.duplicated().sum()

Replace all missing values with 0

df = df.fillna(value = 0)

Replace missing values with 0 on one column

df.column_x = df.column_x.fillna(value=0)

Replace missing values on multiple columns

df = df.fillna({
    'col1': 'missing',
    'col2': '99.999',
    'col3': '999',
    'col4': 'missing',
    'col5': 'missing',
    'col6': '99'
})

Replace missing values with forward fill (propagate non-null values forward)

import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                  [3, 4, np.nan, 1],
                  [np.nan, np.nan, np.nan, 5],
                  [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))

df_filled = df.fillna(method='ffill')

>>> df
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

>>> df_filled
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  3.0  4.0 NaN  5
3  3.0  3.0 NaN  4

Replace missing values with backward fill (propagate non-null values backward)

import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                  [3, 4, np.nan, 1],
                  [np.nan, np.nan, np.nan, 5],
                  [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))

# use NEXT valid observation to fill gap/NAN
df_filled = df.fillna(method='bfill')

>>> df
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

>>> df_filled
     A    B   C  D
0  3.0  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  3.0 NaN  5
3  NaN  3.0 NaN  4

Replace all instances of a value in a column

df.column_y = df.column_y.replace('old_string', 'new_string')

Count missing values in data

df.isnull().sum().sum()

Count missing values on each column / return Series

df.isnull().sum()

Count missing values on specific column

df.column_x.isnull().sum()

Percentage of missing values in data

# count of missing diveded by total data cells
(df.isnull().sum().sum()) / (df.shape[0] * df.shape[1]) * 100

Show rows with missing values on column

import pandas as pd

df[df.column_x.isnull()]
# or
df[pd.isnull(df.column_x)]

Show rows with no missing values on column

df[df.column_x.notnull()]

Convert all NaNs to None

df = df.where(pd.notnull(df), None)

Convert column to numeric (will error if column has non-numeric values)

df.column_x = pd.to_numeric(df.column_x)
# to avoid error and convert non-numeric to NaN
pd.to_numeric(df.column_x, errors='coerce')

Change data type of DataFrame column

df.column_x = df.column_x.astype('int64')
#to ignore errors
df.column_x = df.column_x.astype('int64', errors='ignore')
# pandas dtypes: float64 (float), int64 (integer), datetime64 (datetime), object (string)

Convert column to datetime (will error if column has datetime values)

df.column_x = pd.to_datetime(df.column_x)
# to avoid error and convert non-datetime to NaT
df.column_x = pd.to_datetime(df.column_x, errors='coerce')
# specify format of input data
# format can be any code of strftime directive: http://strftime.org/
df.column_x = pd.to_datetime(df.column_x, errors='coerce', format='%Y%m%d')

Split delimited values in a DataFrame column into two new columns

# `col` column may look like `string1:string2`
df['new_c1'], df['new_c2'] = zip(*df['col'].apply(lambda x: x.split(':', 2)))