Reshape Data
Common functions to reshape pandas DataFrames are: concat, append, merge, merge_asof, transpose. They are used to combine together rows and columns or bring the dataset to its tidy (long) form.
Concatenate two DataFrames together, rows of df1 followed by rows of df2
df = pd.concat([df1, df2])
Concatenate two DataFrames together, df2 to the right side of df1
df = pd.concat([df1, df2], axis=1)
Concatenate multiple DataFrames together, appending rows horizontally
frames = [process_input_file(f) for f in files ]
df = pd.concat(frames)
Append a single row to a DataFrame by passing a Series
df_new = df.append(s1, ignore_index=True)
Append a single row to a DataFrame by passing a Series
data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
>>> print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
s1 = pd.Series(['X0', '100'], index=['Name', 'Age'])
>>> print(s1)
Name X0
Age 100
dtype: object
df_new = df.append(s1, ignore_index=True)
>>> print(df_new)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 X0 100
Append a single row to a DataFrame by passing a dict
df_new = df.append(dict1, ignore_index=True)
Append a single row to a DataFrame by passing a dict
data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
>>> df
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
dicts = [{'Name': 'X1', 'Age': 100},
{'Name': 'X2', 'Age': 200}]
>>> dicts
[{'Name': 'X1', 'Age': 100}, {'Name': 'X2', 'Age': 200}]
df_new = df.append(dicts, ignore_index=True, sort=False)
>>> df_new
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 X1 100
4 X2 200
Inner Join two DataFrames on common key column
# Use intersection of keys from both frames
df = pd.merge(df_left df_right, on='column_id')
# identical with
df = pd.merge(df_left df_right, left_on='column_id', right_on='column_id')
Left Join two DataFrames
# Use keys from left frame only
df = pd.merge(df_left df_right, left_on='column_x', right_on='column_y', how='left')
Right Join two DataFrames
# Use keys from right frame only
df = pd.merge(df_left df_right, left_on='column_x', right_on='column_y', how='right')
Full Outer Join two DataFrames on common key column
# Use union of keys from both frames
df = pd.merge(df_left df_right, on='column_id', how='outer')
Inner Join two DataFrames on index
df = pd.merge(df_left df_right, left_index=True, right_index=True)
Join two DataFrames on nearest key (asof merge)
# This is similar to a left-join except that we match on nearest key rather than equal keys.
# Both DataFrames must be sorted by the key.
df = pd.merge_asof(df_left df_right, on='column_id', by='column_to_match')
Join two DataFrames on nearest key (asof merge)
# This is similar to a left-join except that we match on nearest key rather than equal keys.
# Both DataFrames must be sorted by the key.
>>> trades
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
>>> quotes
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
df_merge_asof = pd.merge_asof(trades, quotes, on='time', by='ticker')
>>> df_merge_asof
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
Transpose data frame (rows become columns, columns become rows)
df.T
Pivot rows to columns (long format to wide), one value
# giving DataFrame df
>>> df
date ticker open close
0 2020-03-01 AAPL 100 90
1 2020-03-02 AAPL 110 120
2 2020-03-01 IBM 200 205
3 2020-03-02 IBM 205 210
# pivot distinct values of column ticker as new columns
df1 = df.pivot_table(index=['date'], columns='ticker', values='close')
# remove columns name
df1.columns.name = ''
# bring `date` as column
df1 = df1.reset_index()
df1
>>> df1
date AAPL IBM
0 2020-03-01 90 205
1 2020-03-02 120 210
Pivot rows to columns (long format to wide), multiple values
# giving DataFrame df
>>> df
date ticker open close
0 2020-03-01 AAPL 100 90
1 2020-03-02 AAPL 110 120
2 2020-03-01 IBM 200 205
3 2020-03-02 IBM 205 210
df1 = df.pivot_table(index=['date'], columns='ticker', values=['open', 'close'])
# flatten columns multi-index
df1.columns = ['_'.join(col).strip() for col in df1.columns.values]
# bring `date` as column
df1 = df1.reset_index()
>>> df1
date close_AAPL close_IBM open_AAPL open_IBM
0 2020-03-01 90 205 100 200
1 2020-03-02 120 210 110 205
Pivot table with totals
# giving DataFrame df
>>> df
Province City Sales
0 ON Toronto 13
1 ON Toronto 1
2 QC Montreal 6
3 BC Vancouver 16
4 AL Calgary 8
5 AL Edmonton 4
6 MN Winnipeg 3
7 ON Windsor 1
df1 = pd.pivot_table(df, index=['Province'], columns=['City'], values=['Sales'],
aggfunc=np.sum, fill_value=0, margins=True)
# `index` - distinct values of cols provided will go on rows
# `columns` - distinct values of cols provided will go on columns
# `values` = data cells
# `aggfunc` = aggregation function(s)
# `margins=True` = display grand totals
>>> df1
Sales
City Calgary Edmonton Montreal Toronto Vancouver Windsor Winnipeg All
Province
AL 8 4 0 0 0 0 0 12
BC 0 0 0 0 16 0 0 16
MN 0 0 0 0 0 0 3 3
ON 0 0 0 14 0 1 0 15
QC 0 0 6 0 0 0 0 6
All 8 4 6 14 16 1 3 52
Melt columns to rows (wide format to long)
# giving DataFrame df
>>> df
date AAPL IBM
0 2020-03-01 90 205
1 2020-03-02 120 210
df1 = df.melt(id_vars=['date'], var_name='ticker', value_name='close')
>>> df1
date ticker close
0 2020-03-01 AAPL 90
1 2020-03-02 AAPL 120
2 2020-03-01 IBM 205
3 2020-03-02 IBM 210
Melt selected columns to rows (wide format to long)
# giving DataFrame df
>>> df
date AAPL IBM MSFT TSLA
0 2020-03-01 90 205 100 200
1 2020-03-02 120 210 110 205
df1 = df.melt(id_vars=['date'],
var_name='ticker',
value_name='close',
value_vars=['AAPL', 'TSLA'])
>>> df1
date ticker close
0 2020-03-01 AAPL 90
1 2020-03-02 AAPL 120
2 2020-03-01 TSLA 200
3 2020-03-02 TSLA 205