Tips and Tricks for Fast Data Analysis in Python

Rebecca Vickery Rebecca Vickery
February 14, 2020 Big Data, Cloud & DevOps

Quickly summarise and describe datasets with python

The python programming language has a large number of both built-in functions and libraries for data analysis. Combining some of these libraries can produce very powerful methods of summarising, describing and filtering large amounts of data.

In this article, I want to share some tips on how to combine pandas, matplotlib and some built-in python functionality to very quickly analyse a dataset.

All libraries in this post can be installed via the package manager pip.


Data

In this article, I am going to be using a data set known as the adult income dataset which can be downloaded from the UCI machine learning repository. This dataset contains a number of features about each adult and a target variable which tells us whether or not they earn over £50,000 pa.

Here are all the imports for the libraries that I am using.

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline

I am using pandas to read in the dataset and return the first few rows.

data = pd.read_csv('adults_data.csv')
data.head()

 

This dataset is usually used to build a machine learning model which predicts the income class from the features. However, before getting to the model building stage it is useful to perform some data analysis first.

Describe

The describe function allows us to very quickly look at some basic descriptive statistics for the numerical features in the dataset. Running data.describe() we can see that our dataset has 32,561 rows, we can see the mean value in each numerical feature and get a view of the distribution of values in each of these features.

Value counts

In this dataset, we also have categorical variables and it will also be useful to get a basic understanding of the distribution for these. The value_counts() function provides a very simple way to do this. Let’s use this to inspect the marital-status feature.

data['marital-status'].value_counts()

To make this easier to visualise we can quickly create a bar plot for this value by adding just a small amount of extra code. The title is optional, and you can customise axis labels, colours and other aspects of the chart with the usual matplotlib functionality.

plt.title('Marital Status')
data['marital-status'].value_counts().plot.bar()

Plotting with value counts doesn’t work so well when we have a feature with high cardinality (a large number of unique values).

plt.title('Native Country')
data['native-country'].value_counts().plot.bar()

For a feature like native-country, it would be more useful to plot only the top n values as this gives us a useful insight. We can do this by adding just a little more code.

plt.title('Native Country')
data['native-country'].value_counts().nlargest(10).plot.bar()

Pandas groupby

The pandas groupby function is very useful when we have data where we want to compare segments. In this dataset, we want to perform analysis to understand the differences, and magnitude of differences in the features between the two income classes. The pandas groupby function provides a very quick way to do this.

If we run the code below we can analyse the differences in mean, for all numerical values, between the two income groups.

round(data.groupby(['income']).mean(),2)

A better way to compare the differences would be to view the variance in distributions for the two groups. A boxplot is a useful way to do that. This can be accomplished by using the plotting functionality alongside groupby. The visualisation is shown below.

data.groupby('income').boxplot(fontsize=20,rot=90,figsize=(20,10),patch_artist=True)

You will notice that as the values are on different scales it is difficult to compare the two distributions. To overcome this we can scale the values. To do this I am using the scikit-learn MinMaxScaler function. This scales the values so that they all lie between 0 and 1. We can now clearly see substantial differences between some of the features such as age and hours-per-week.

 

We can also use the groupby function to compare categorical features. In the below graph we can quickly identify that there are a higher number of males compared to females in the higher income bracket.

data.groupby('income').gender.value_counts().unstack(0).plot.barh()

Pivot tables

Pandas has functionality that enables you to create spreadsheet-style pivot tables in python. Pivot tables allow you to quickly summarise, group and filter data to perform more complex analyses.

We can use the pivot table to explore more complex relationships. Let’s look a little deeper into the relationship between gender and income class. Do females earn less because they work fewer hours per week?

pivot_workclass = pd.pivot_table(data, values=['hours-per-week'],
                                 index = 'gender',
                                 columns = 'income', aggfunc=np.mean, fill_value=0)

We can add plotting functionality to make this easier to visualise.

pivot_workclass = pd.pivot_table(data, values=['hours-per-week'],
                                 index = 'gender',
                                 columns = 'income', aggfunc=np.mean, fill_value=0).plot.bar()

All the methods described above can be extended to create much richer and more complex analyses. 

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Rebecca Vickery

    Tags
    Data Science
    Leave a Comment
    Next Post
    SaaS Companies Have Untapped Platform Potential

    SaaS Companies Have Untapped Platform Potential

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.