Group Data
In pandas group by involves one or more of the following steps:
- splitting the data into groups based on some criteria
- applying a function to each group independently (i.e aggregation, transformation, filtration)
- combining the results into a data structure (based on apply operation type)
Group by one column, one column aggregated, one aggregation
# returns series because we aggregate one column
df.groupby('col_grp_1')['col_agg_1']\
.min()
Group by two columns, one column aggregated, one aggregation
# returns series because we aggregate one column
df.groupby(['col_grp_1', 'col_grp_2'])['col_agg_1']\
.min()
Group by two columns, multiple columns aggregated, one aggregation
# returns data frame because we aggregate multiple columns, one aggregation
df.groupby(['col_grp_1', 'col_grp_2'])['col_agg_1', 'col_agg_2']\
.min()
Group by one column, one column aggregated, multiple aggregations
# returns data frame because of multiple aggregations
df.groupby('col_grp_1')['col_agg_1'].agg(['min', 'max', 'mean'])
Group by one column, multiple columns aggregated, multiple aggregations
# returns data frame with hierarchical index in columns
df1 = df.groupby('col_grp_1')['col_agg_1', 'col_agg_2']\
.agg(['min', 'max', 'mean', 'count'])
# flatten the hierarchical index in columns
df1.columns = ['_'.join(col).strip() for col in df1.columns.values]
Group by named aggregation: group by one column, multiple columns aggregated, multiple aggregations
# from pandas version >= 0.25.0
df.groupby('col_grp_1')\
.agg(new_col1_name=('col_agg_1', 'mean'),
new_col2_name=('col_agg_2', 'count'))
Group by one column, one column aggregated, one aggregation
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns series because we aggregate one column
df.groupby('kind')['weight'].min()
kind
cat 7.9
dog 5.5
Name: weight, dtype: float64
Group by two columns, one column aggregated, one aggregation
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns series because we aggregate one column
df.groupby(['kind', 'breed'])['weight'].min()
kind breed
cat persian 9.9
siamese 7.9
dog chihuahua 5.5
labrador 198.0
Name: weight, dtype: float64
Group by two columns, multiple columns aggregated, one aggregation
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns data frame because we aggregate multiple columns, one aggregation
df.groupby(['kind', 'breed'])['height', 'weight'].min()
height weight
kind breed
cat persian 9.5 9.9
siamese 9.1 7.9
dog chihuahua 3.5 5.5
labrador 34.0 198.0
Group by one column, one column aggregated, multiple aggregations
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
df.groupby('kind')['weight'].agg(['min', 'max', 'mean'])
# returns data frame because of multiple aggregations
min max mean
kind
cat 7.9 9.9 8.900000
dog 5.5 198.0 70.333333
Group by one column, multiple columns aggregated, multiple aggregations
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns data frame with hierarchical index in columns
df1 = df.groupby('kind')['weight', 'height']\
.agg(['min', 'max', 'count'])
# flatten the hierarchical index in columns
df1.columns = ['_'.join(col).strip() for col in df1.columns.values]
print(df1)
weight_min weight_max weight_count height_min height_max height_count
kind
cat 7.9 9.9 2 9.1 9.5 2
dog 5.5 198.0 3 3.5 34.0 3
Filter (within a GroupBy), filters the rows on a property of the group they belong to
# will return a subset DataFrame of rows belonging to the groups satisfying condition
df1 = df.groupby('col_grp').filter(lambda x : x['col'].mean() > 20)
Transform (within a GroupBy), calculates a new value for each row based on a property of the group
df['grp_mean'] = df.groupby('col_grp')['col'].transform(lambda x : x.mean())
Center the data by subtracting the group-wise mean
# will return a new DataFrame with same size as the original
df_new = df.groupby('col_grp').transform(lambda x: x - x.mean())
Apply (within a GroupBy), applies an arbitrary function to the group results
def func_normalize(x):
# x is a DataFrame of group values
x['col1'] = x['col1'] / x['col2'].sum()
return x
# normalizes the first column by the sum of the second:
df = df.groupby('col_grp').apply(func_normalize)
Get top n rows for each group of columns in a sorted dataframe
# sort DataFrame first by some columns
df = df.sort_values(['col_x','col_y'])
df_top_5 = df.groupby('col_x').head(5)
Get bottom n rows for each group of columns in a sorted dataframe
# sort DataFrame first by some columns
df = df.sort_values(['col_x','col_y'])
df_bottom_5 = df.groupby('col_x').tail(5)