Pandas snippets
- Basics
- Import the pandas library and aliasing as pd
- Create an empty DataFrame
- Create a DataFrame from List of Lists. Specify values for each row
- Create a DataFrame from a Dictionary of Lists. Specify values for each column
- Create a DataFrame with index
- Generate a random DataFrame
- Generate a random DataFrame with date index
- Set a column as the dataframe index
- Reset the dataframe index to the default integer index
- Reset the dataframe index without inserting it into dataframe columns
- Rename several DataFrame columns
- Replace all column names
- Lowercase DataFrame columns names
- Remove/drop one column
- Remove/drop several columns
- Remove/drop rows by index
- Remove/drop first two rows
- Remove/drop last two rows
- Remove/drop rows by names
- Add column with all values a constant
- Add column on specific position (as second column)
- Add column based on values of other columns
- Add column with values from list
- Add new column based on values of other column and where condition
- Add new column based on values of other columns and where condition
- Flatten/collapse the hierarchical index in columns
- Iterate over DataFrame rows (slow operation)
- Axis
- Examine Data
- Show the first 30 and last 30 rows
- Show info on index, data types, memory usage
- Show type of df object
- Show the first 5 rows
- Show the first 10 rows
- Show the last 5 rows
- Show “the index” (aka “the labels”)
- Show the column names
- Show data types of each column
- Show number of rows and columns
- Show number of rows only
- Show number of columns only
- Get DataFrame values as numpy array
- Show info on rows and columns indexes
- Get a concise summary of a DataFrame
- Get memory usage by column
- Set maximum number of rows and columns printed to unlimited
- Reset maximum number of rows and columns printed to default
- Suppress scientific notation
- Reset floats display
- Change float format to two decimal places
- Set maximum number of rows and columns printed to unlimited temporary
- Sort dataframe by one column values ascending
- Sort the dataframe by one column values descending
- Sort the dataframe by one column values ascending, inplace
- Sort the dataframe by multiple columns values
- Sort the dataframe based on the index labels ascending
- Sort the dataframe based on the columns labels ascending
- Sort the dataframe based on the columns labels ascending
- Files
- File formats that can be read in/write to
- Read csv file
- Read csv file with | as fields separator
- Read only 10 rows of csv file
- Read csv file and skip first 3 rows
- Read csv file and parse/treat columns as dates
- Read csv file while specifying column names
- Read csv file and set the index as column_id
- Read csv file and set the index as multi columns
- Read csv file and set “.” as missing values
- Read csv file and set different missing values on columns
- Read csv file and prevent pandas from interpreting “NA” as NaN in a string
- Read csv file and ignore missing value markers
- Read csv file and change the data type of a column
- Write csv file
- Write csv file with | as fields separator
- Write csv file but not save the DataFrame index
- Write csv file but not save the DataFrame header
- Write csv file but only specific columns
- Write csv file; don’t use quotes and use backslash as escape character
- Append DataFrame to csv file
- Tidy Data
- Drop a row if ANY values are missing
- Drop a row only if ALL values are missing
- Drop a column if ANY values are missing
- Drop a column only if ALL values are missing
- Drop duplicate rows keeping first occurrence
- Drop duplicate rows keeping last occurrence
- Drop all duplicates
- Drop duplicate rows identifying duplicates on the giving columns
- Find duplicates based on giving columns
- Count duplicates rows
- Count duplicates on column
- Replace all missing values with 0
- Replace missing values with 0 on one column
- Replace missing values on multiple columns
- Replace missing values with forward fill (propagate non-null values forward)
- Replace missing values with backward fill (propagate non-null values backward)
- Replace all instances of a value in a column
- Count missing values in data
- Count missing values on each column / return Series
- Count missing values on specific column
- Percentage of missing values in data
- Show rows with missing values on column
- Show rows with no missing values on column
- Convert all NaNs to None
- Convert column to numeric (will error if column has non-numeric values)
- Change data type of DataFrame column
- Convert column to datetime (will error if column has datetime values)
- Split delimited values in a DataFrame column into two new columns
- Summarize Data
- Basic descriptive statistics for numeric columns
- Basic descriptive statistics for “object” columns (e.g. strings or timestamps)
- Basic descriptive statistics for all columns
- Basic descriptive statistics for only one column
- Count the number of occurrences of each value (excludes missing values)
- Count the number of occurrences of each value (includes missing values)
- Show the 3 most frequent occurances of column_x
- Count number of rows in a DataFrame
- Count number of distinct values in a column
- Get distinct values in a column
- Randomly select 30% of rows without replacement
- Randomly select 30% of rows with replacement
- Randomly select 10 rows
- Randomly split a DataFrame into train/test
- Get first 7 rows ordered by the given columns in descending order
- Get first 7 rows ordered by the given columns in ascending order
- Select Data With .loc & .iloc
- .loc selection by index/labels and columns
- .loc select all rows and some columns by index/labels
- .loc select some rows and all columns by index/labels
- .loc select some rows and some columns by index/labels
- Select rows whose column values equal some value/scalar
- Select rows whose column values do not equal some_value/scalar
- Select rows whose column values are in a list
- Select rows whose column values are not in a list
- Select rows that satisfy multiple and (&) boolean conditions
- Select rows that satisfy multiple or (|) boolean conditions
- Update column values based on rows condition
- Update multiple columns values based on rows condition
- .loc selection for single label, all columns / returns a Series
- .loc selection for list of label, all columns / returns a DataFrame
- .loc selection for single label, single column / returns one value
- .loc selection for all label, single column / returns a Series
- .loc selection for multiple labels, multiple columns / returns a DataFrame
- .loc selection for slice of labels for row, all columns / returns a DataFrame
- .loc selection for all rows, slice of columns / returns a DataFrame
- .loc selection for rows and columns with boolean conditions
- Select rows 1 and 3, and second column with boolean conditions
- Select rows whose column values equals a scalar
- Select rows that satisfy multiple boolean conditions; keep selected columns
- Update column values based on rows condition
- Update multiple columns values based on rows condition
- .iloc position based selection (from 0 to length-1 of the axis)
- .iloc single row selections, all columns
- .iloc single column selections, all rows
- .iloc multiple row and column selections
- .iloc selection for rows and columns with boolean conditions
- Select the rows whose index label is an even number
- Select slice of rows with .iloc
- Update/set column values based on rows condition with .iloc
- Reshape Data
- Concatenate two DataFrames together, rows of df1 followed by rows of df2
- Concatenate two DataFrames together, df2 to the right side of df1
- Concatenate multiple DataFrames together, appending rows horizontally
- Append a single row to a DataFrame by passing a Series
- Append a single row to a DataFrame by passing a Series
- Append a single row to a DataFrame by passing a dict
- Append a single row to a DataFrame by passing a dict
- Inner Join two DataFrames on common key column
- Left Join two DataFrames
- Right Join two DataFrames
- Full Outer Join two DataFrames on common key column
- Inner Join two DataFrames on index
- Join two DataFrames on nearest key (asof merge)
- Join two DataFrames on nearest key (asof merge)
- Transpose data frame (rows become columns, columns become rows)
- Pivot rows to columns (long format to wide), one value
- Pivot rows to columns (long format to wide), multiple values
- Pivot table with totals
- Melt columns to rows (wide format to long)
- Melt selected columns to rows (wide format to long)
- Group Data
- Group by one column, one column aggregated, one aggregation
- Group by two columns, one column aggregated, one aggregation
- Group by two columns, multiple columns aggregated, one aggregation
- Group by one column, one column aggregated, multiple aggregations
- Group by one column, multiple columns aggregated, multiple aggregations
- Group by named aggregation: group by one column, multiple columns aggregated, multiple aggregations
- Group by one column, one column aggregated, one aggregation
- Group by two columns, one column aggregated, one aggregation
- Group by two columns, multiple columns aggregated, one aggregation
- Group by one column, one column aggregated, multiple aggregations
- Group by one column, multiple columns aggregated, multiple aggregations
- Filter (within a GroupBy), filters the rows on a property of the group they belong to
- Transform (within a GroupBy), calculates a new value for each row based on a property of the group
- Center the data by subtracting the group-wise mean
- Apply (within a GroupBy), applies an arbitrary function to the group results
- Get top n rows for each group of columns in a sorted dataframe
- Get bottom n rows for each group of columns in a sorted dataframe
- Time Series Data
- Create datetime object with python’s native method
- Create datetime object with python’s native method
- Extract date, time parts from datetime
- Create datetime object with numpy (datetime64)
- Create datetime object with pandas (Timestamp)
- Create datetime object with pandas (datetime64)
- Create DataFrame with DatetimeIndex as index
- Slice DataFrame with DatetimeIndex on specific year/month/period
- Resample time series with aggregation
- Resample time series with selection
- Add missing business dates in a time series, backfill them with NaN
- Add missing business dates in a time series, backfill them with previous row’s data
- Add missing business dates in a time series, backfill them with next row’s data
- Find the 3rd Friday of the month i.e 2022-03