Examine Data
Common operations to explore pandas DataFrames are: find number of rows and columns, display top and tail rows, show columns types, sort values.
Show the first 30 and last 30 rows
df
Show info on index, data types, memory usage
df.info()
Show type of df object
type(df)
# >>> <class 'pandas.core.frame.DataFrame'>
Show the first 5 rows
df.head()
Show the first 10 rows
df.head(10)
Show the last 5 rows
df.tail()
Show “the index” (aka “the labels”)
df.index
type(df.index)
# >>> <class 'pandas.core.indexes.base.Index'>
Show the column names
df.columns
type(df.columns)
# >>> <class 'pandas.core.indexes.base.Index'>
Show data types of each column
df.dtypes
Show number of rows and columns
df.shape
Show number of rows only
df.shape[0]
Show number of columns only
df.shape[1]
Get DataFrame values as numpy array
df.values
Show info on rows and columns indexes
df.axes
Get a concise summary of a DataFrame
import pandas as pd
data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'], index=['a', 'b', 'c'])
>>> df.info()
#<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 2 columns):
Name 3 non-null object
Age 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 72.0+ bytes
Get memory usage by column
df.memory_usage()
Set maximum number of rows and columns printed to unlimited
# default is 60 rows
pd.set_option('max_rows', None)
# default is 20 columns
pd.set_option('max_columns', None)
Reset maximum number of rows and columns printed to default
pd.reset_option('max_rows')
pd.reset_option('max_columns')
Suppress scientific notation
# display all floats with commas and two decimal places
pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x))
# or
pd.options.display.float_format = "{:,.2f}".format
Reset floats display
pd.reset_option('display.float_format')
Change float format to two decimal places
# using rounding
df['col_x'].round(2)
# using apply
df['col_x'].apply(lambda x: '{:.2f}'.format(x))
Set maximum number of rows and columns printed to unlimited temporary
# settings are restored when you exit the 'with' block
with pd.option_context('max_rows', None, 'max_columns', None):
print(df)
Sort dataframe by one column values ascending
df = df.sort_values('col_x')
Sort the dataframe by one column values descending
df = df.sort_values('col_x', ascending=False)
Sort the dataframe by one column values ascending, inplace
df.sort_values('col_x', inplace=True)
Sort the dataframe by multiple columns values
df = df.sort_values(['col_x', 'col_y', 'col_z'], ascending=[True, True, False])
Sort the dataframe based on the index labels ascending
df = df.sort_index()
Sort the dataframe based on the columns labels ascending
df = df.sort_index(axis=1)
Sort the dataframe based on the columns labels ascending
import pandas as pd
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data)
>>> df
Name Age
0 Tom 28
1 Jack 34
2 Steve 29
3 Ricky 42
df = df.sort_index(axis=1)
# note how `Age` column moved first
>>> df
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky