Reshape Data

Common functions to reshape pandas DataFrames are: concat, append, merge, merge_asof, transpose. They are used to combine together rows and columns or bring the dataset to its tidy (long) form.

Concatenate two DataFrames together, rows of df1 followed by rows of df2

df = pd.concat([df1, df2])

Concatenate two DataFrames together, df2 to the right side of df1

df = pd.concat([df1, df2], axis=1)

Concatenate multiple DataFrames together, appending rows horizontally

frames = [process_input_file(f) for f in files ]
df = pd.concat(frames)

Append a single row to a DataFrame by passing a Series

df_new = df.append(s1, ignore_index=True)

Append a single row to a DataFrame by passing a Series

data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
>>> print(df)
     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13

s1 = pd.Series(['X0', '100'], index=['Name', 'Age'])
>>> print(s1)
Name    X0
Age     100
dtype: object

df_new = df.append(s1, ignore_index=True)
>>> print(df_new)
     Name Age
0    Alex  10
1     Bob  12
2  Clarke  13
3      X0  100

Append a single row to a DataFrame by passing a dict

df_new = df.append(dict1, ignore_index=True)

Append a single row to a DataFrame by passing a dict

data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
>>> df
     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13

dicts = [{'Name': 'X1', 'Age': 100},
         {'Name': 'X2', 'Age': 200}]
>>> dicts
[{'Name': 'X1', 'Age': 100}, {'Name': 'X2', 'Age': 200}]

df_new = df.append(dicts, ignore_index=True, sort=False)
>>> df_new
     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13
3      X1  100
4      X2  200

Inner Join two DataFrames on common key column

# Use intersection of keys from both frames
df = pd.merge(df_left df_right, on='column_id')

# identical with
df = pd.merge(df_left df_right, left_on='column_id', right_on='column_id')

Left Join two DataFrames

# Use keys from left frame only
df = pd.merge(df_left df_right, left_on='column_x', right_on='column_y', how='left')

Right Join two DataFrames

# Use keys from right frame only
df = pd.merge(df_left df_right, left_on='column_x', right_on='column_y', how='right')

Full Outer Join two DataFrames on common key column

# Use union of keys from both frames
df = pd.merge(df_left df_right, on='column_id', how='outer')

Inner Join two DataFrames on index

df = pd.merge(df_left df_right, left_index=True, right_index=True)

Join two DataFrames on nearest key (asof merge)

# This is similar to a left-join except that we match on nearest key rather than equal keys.
# Both DataFrames must be sorted by the key.
df = pd.merge_asof(df_left df_right, on='column_id', by='column_to_match')

Join two DataFrames on nearest key (asof merge)

# This is similar to a left-join except that we match on nearest key rather than equal keys.
# Both DataFrames must be sorted by the key.

>>> trades
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

>>> quotes
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

df_merge_asof = pd.merge_asof(trades, quotes, on='time', by='ticker')

>>> df_merge_asof
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

Transpose data frame (rows become columns, columns become rows)

df.T

Pivot rows to columns (long format to wide), one value

# giving DataFrame df
>>> df
         date ticker  open  close
0  2020-03-01   AAPL   100     90
1  2020-03-02   AAPL   110    120
2  2020-03-01    IBM   200    205
3  2020-03-02    IBM   205    210

# pivot distinct values of column ticker as new columns
df1 = df.pivot_table(index=['date'], columns='ticker', values='close')
# remove columns name
df1.columns.name = ''
# bring `date` as column
df1 = df1.reset_index()
df1

>>> df1
         date  AAPL  IBM
0  2020-03-01    90  205
1  2020-03-02   120  210

Pivot rows to columns (long format to wide), multiple values

# giving DataFrame df
>>> df
         date ticker  open  close
0  2020-03-01   AAPL   100     90
1  2020-03-02   AAPL   110    120
2  2020-03-01    IBM   200    205
3  2020-03-02    IBM   205    210

df1 = df.pivot_table(index=['date'], columns='ticker', values=['open', 'close'])
# flatten columns multi-index
df1.columns = ['_'.join(col).strip() for col in df1.columns.values]
# bring `date` as column
df1 = df1.reset_index()

>>> df1
         date  close_AAPL  close_IBM  open_AAPL  open_IBM
0  2020-03-01          90        205        100       200
1  2020-03-02         120        210        110       205

Pivot table with totals

# giving DataFrame df
>>> df
  Province       City  Sales
0       ON    Toronto     13
1       ON    Toronto      1
2       QC   Montreal      6
3       BC  Vancouver     16
4       AL    Calgary      8
5       AL   Edmonton      4
6       MN   Winnipeg      3
7       ON    Windsor      1

df1 = pd.pivot_table(df, index=['Province'], columns=['City'], values=['Sales'],
                     aggfunc=np.sum, fill_value=0, margins=True)
# `index` - distinct values of cols provided will go on rows
# `columns` - distinct values of cols provided will go on columns
# `values` = data cells
# `aggfunc` = aggregation function(s)
# `margins=True` = display grand totals
>>> df1
           Sales
City     Calgary Edmonton Montreal Toronto Vancouver Windsor Winnipeg All
Province
AL             8        4        0       0         0       0        0  12
BC             0        0        0       0        16       0        0  16
MN             0        0        0       0         0       0        3   3
ON             0        0        0      14         0       1        0  15
QC             0        0        6       0         0       0        0   6
All            8        4        6      14        16       1        3  52

Melt columns to rows (wide format to long)

# giving DataFrame df
>>> df
         date  AAPL  IBM
0  2020-03-01    90  205
1  2020-03-02   120  210

df1 = df.melt(id_vars=['date'], var_name='ticker', value_name='close')

>>> df1
         date ticker  close
0  2020-03-01   AAPL     90
1  2020-03-02   AAPL    120
2  2020-03-01    IBM    205
3  2020-03-02    IBM    210

Melt selected columns to rows (wide format to long)

# giving DataFrame df
>>> df
         date  AAPL  IBM  MSFT  TSLA
0  2020-03-01    90  205   100   200
1  2020-03-02   120  210   110   205

df1 = df.melt(id_vars=['date'],
              var_name='ticker',
              value_name='close',
              value_vars=['AAPL', 'TSLA'])

>>> df1
         date ticker  close
0  2020-03-01   AAPL     90
1  2020-03-02   AAPL    120
2  2020-03-01   TSLA    200
3  2020-03-02   TSLA    205