Pandas “cheat sheet” for data wrangling
If you like to cook you know this very well. Turning on the stove and cooking food is a tiny part of the whole cooking process. Much of your sweat and tears actually go into preparing the right ingredients.
Cliché, but worth saying it again — data preparation is 80% of work in any data science project. Whether it is about making a dashboard, a simple statistical analysis, or fitting advanced machine learning model — it all starts with finding the data and transforming it into the right format so the algorithm can take care of the rest.
If you are a Python fan, then pandas
is your best friend in your data science journey. Equipped with all the tools, it helps you get through the most difficult parts of a project.
That said, like any new tool you first need to learn it’s functionalities and how to put them into use. Many beginners in data science still struggle to make the best use of Pandas and instead spend much of their time on Stack Overflow. The principal reason for this is, I’d say, not being able to match Pandas functionalities with their analytics needs.
Much of this struggle can be overcome simply by making an inventory of typical data preparation problems and matching them with appropriate Pandas tools. Below I am presenting a typical data preparation and exploratory analysis workflow and matching with necessary Pandas functions. I am not trying to document everything under the sun on Pandas rather demonstrating the process of creating your own data wrangling cheatsheet.
Set up
Soon after you fire up your favorite Python IDE you might want to get started right away and import the necessary libraries. That’s fine, but you still need to set up your environment for setting the working directory, locate data and other files etc.
# find out your current directory
import os
os.getcwd()# if you want to set a different working directory
os.chdir("folder-path")# to get a list of all files in the directory
os.listdir()
Data import
Next up data import, and this is where you’ll be using Pandas for the first time.
Your data may be sitting anywhere in the world — your local machine, SQL database, in the cloud or even in an online database. And data can be saved in a variety of formats — csv, txt, excel, sav etc.
Depending on where the data is coming from and it’s file extension, you’d need different Pandas commands. Below are a couple of examples.
# import pandas and numpy libraries
import pandas as pd
import numpy as np# import a csv file from local machine
df = pd.read_csv("file_path")# import a csv file from an online database
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")
Data inspection
After importing data you’d like to inspect it for a number of things such as the number of columns and rows, columns names etc.
# description of index, entries, columns, data types, memory info
df.info() # check out first few rows
df.head(5) # head# number of columns & rows
df.shape # column names
df.columns # number of unique values of a column
df["sepal_length"].nunique()# show unique values of a column
df["sepal_length"].unique()# number of unique values alltogether
df.columns.nunique()# value counts
df['species'].value_counts()
Dealing with NA values
Next, check for NA, NaN or missing values. Some algorithms can handle missing values but others require that missing values are taken care of before putting data into use. Regardless, checking for missing values and understanding how to handle them is an essential part of your “getting to know” the data.
# show null/NA values per column
df.isnull().sum()# show NA values as % of total observations per column
df.isnull().sum()*100/len(df)# drop all rows containing null
df.dropna()# drop all columns containing null
df.dropna(axis=1)# drop columns with less than 5 NA values
df.dropna(axis=1, thresh=5)# replace all na values with -9999
df.fillna(-9999)# fill na values with NaN
df.fillna(np.NaN)# fill na values with strings
df.fillna("data missing")# fill missing values with mean column values
df.fillna(df.mean())# replace na values of specific columns with mean value
df["columnName"] = df["columnName"].fillna(df["columnName"].mean())# interpolation of missing values (useful in time-series)
df["columnName"].interpolate()
Column operation
As often the case, you may need to perform a wide range of column operations such as renaming or dropping a column, sorting column values, creating new calculated columns etc.
# select a column
df["sepal_length"]# select multiple columns and create a new dataframe X
X = df[["sepal_length", "sepal_width", "species"]]# select a column by column number
df.iloc[:, [1,3,4]]# drop a column from dataframe X
X = X.drop("sepalL", axis=1)# save all columns to a list
df.columns.tolist()# Rename columns
df.rename(columns={"old colum1": "new column1", "old column2": "new column2"})# sorting values by column "sepalW" in ascending order
df.sort_values(by = "sepal_width", ascending = True)# add new calculated column
df['newcol'] = df["sepal_length"]*2# create a conditional calculated column
df['newcol'] = ["short" if i<3 else "long" for i in df["sepal_width"]]
Row operation (sort, filter, slice)
Up until the previous section you have mostly cleaned up your data, but another important part of data preparation is slicing and filtering data to go into the next round of the analytics pipeline.
# select rows 3 to 10
df.iloc[3:10,]# select rows 3 to 49 and columns 1 to 3
df.iloc[3:50, 1:4]# randomly select 10 rows
df.sample(10)# find rows with specific strings
df[df["species"].isin(["Iris-setosa"])]# conditional filtering
df[df.sepal_length >= 5]# filtering rows with multiple values e.g. 0.2, 0.3
df[df["petal_width"].isin([0.2, 0.3])]# multi-conditional filtering
df[(df.petal_length > 1) & (df.species=="Iris-setosa") | (df.sepal_width < 3)]# drop rows
df.drop(df.index[1]) # 1 is row index to be deleted
Grouping
Last but not least, often you will need to group data by different categories — and it is especially useful in exploratory data analysis and in getting insights on categorical variables.
# data grouped by column "species"
X = df.groupby("species")# return mean values of a column ("sepal_length" ) grouped by "species" column
df.groupby("spp")["sepal_length"].mean()# return mean values of ALL columns grouped by "species" category
df.groupby("species").mean()# get counts in different categories
df.groupby("spp").nunique()
Summary
The purpose of this article was to show some essential Pandas functions needed for making data analysis-ready. In this demonstration, I followed a typical analytics process rather than showing codes in a random fashion, which will allow data scientists to find the right tool in the right order in the project. Of course, I did not intend to show every single code required to deal with every single problem in data preparation, rather the intention was to show how to create an essential Pandas cheatsheet.