How to measure feature importance in a binary classification model

Gianluca Malato Gianluca Malato
October 22, 2019 AI & Machine Learning

An example in R language of how to check feature relevance in a binary classification problem

One of the main tasks that a data scientist must face when he builds a machine learning model is the selection of the most predictive variables. Selecting predictors with low predictive power can lead, in fact, to overfitting or low model performance. In this article, I’ll show you some techniques to better select the predictors of a dataset in a binary classification model.


When a data scientist starts working at some model, he often doesn’t have a real idea of which the predictors should be. Maybe the previous phase of business understanding discarded some useless variables but we often have to face a giant table of hundreds of variables.

Training a model on such a huge table is not a good idea. You really run the risk of collinearity (i.e. correlations between variables). So we have to choose the best set of variables to use in order to make our model learn properly from the business information we are giving it.

Our goal is to increase the predictive power of our model against our binary target, so we must find those variables that are strongly correlated with it. Remember: information is hidden inside the dataset and we must provide all the necessary conditions to make our model extract it. So we have to prepare data before the training phase in order to make the model work properly.

Numerical and categorical predictors have a different kind of approach and I’ll show you both.

Numerical variables

Since our problem is a binary classification task, we can consider our outcome as a number which can be either equal to 0 or 1. In order to check if a variable is relevant or not, we can calculate the absolute value of the Pearson linear correlation coefficient between the target and the predictors.

If we have 2 variables, say x and y, their linear correlation coefficient is given by the formula:

That is the covariance divided by the product of the standard deviations.

We are not interested in the sign of correlation. We just need to know its intensity. That’s why we use the absolute value.

I have often seen this kind of approach in many AI projects and tools. Honestly, I have to say that it’s not completely correct to calculate the correlation coefficient in this way. For a perfect predictor, we expect a Pearson coefficient absolute value equal to 1, but we could not achieve this value if we treat binary outcome as a binary number. It’s not important, however. We are using Pearson correlation coefficient to sort our features from the most relevant to the least one, so as long as the coefficient calculation is the same, we can compare the features between them.

Pearson correlation coefficient is not flawless, however. It only measures linear correlation and our variables couldn’t be linearly correlated. But in first approximation, we can easily calculate and use it for our purpose.

Categorical variables

For the categorical variables, there’s no Pearson correlation coefficient, but we can use another great discovery of Pearson, which is the chi-square test.

Let’s say we have a histogram of N different categories with O observation that sum up to n and let’s say want to compare it with a theoretical histogram made by probabilities p. We can build a chi-square variable in this way:

This variable is asymptotically distributed as a chi-square distribution with N-1 degrees of freedom.

If our variable is not correlated to the target, we expect that, for each one of its values, we get 50% zeroes and 50% ones on our dataset. This is a theoretical histogram we could expect to have if there’s no correlation, so a one-tailed chi-square test performed to check whether the real histogram is similar to this one, should give us a p-value equal to 1 (i.e. a low chi-square value) if our variable is not correlated to the target. On the contrary, a perfect predictor will push p-value towards lower values (i.e. higher chi-square values).

Example in R

To better explain the procedure I’ll show you an example in R code. I’ll work with the famous iris dataset.

Remember that R has a powerful function cor that calculates the correlation matrix and the function chisq.test that performs the chi-square test.

First, we create a column named target that is equal to 1 when the species is virginica and 0 otherwise. Then we’ll check the correlations with the other variables.

Let’s start with the numerical features. With this simple code, it’s very easy to find the most correlated ones.

# Load iris dataset data(“iris”)

 

# Generate a binary target column
iris$target = ifelse(iris$Species == “virginica”,1,0)
numeric_columns = setdiff(names(iris),”Species”)

 

target_corr = abs(cor(iris[,numeric_columns])["target”,])

 

As you can see, the most correlated one is the petal width, then comes the petal length and so on. The correlation of the target with itself is obviously 1.

Let’s take a look at the plot of the target variable against the petal width: 

As you can see, higher values of petal width lead to 1 and lower values lead to 0. That’s a clear correlation.

Now, let’s take a look at the plot of the target against the sepal length, which has been classified as the least representative variable:

It’s clear that there is a wide region approximately between 5.5 and 7 inside which we get 0 and 1 almost alternatively. The lack of a graphical pattern is always a good reason to suspect the lack of correlation.

For the categorical case, we’ll calculate the correlation between the target and the species variables. Of course, we expect a strong correlation, because we have built the target as a direct function of the species.

I’ll show you the single-line code and the results:

The table function generates the contingency table and the chisq.test function has been built in order to perform the chi-square test as we want for our case.

A very low p-value means a very strong difference from the uncorrelated case. As usual in the hypothesis tests, you don’t actually accept the null hypothesis, but refuse to neglect it.

We can get further confirmation by taking a look at the contingency table:

As you can see, the column values are very unbalanced, which is exactly what we are looking for.

Conclusions

In this article, I’ve shown you two simple techniques in R to measure the importance of numerical and categorical variables against a binary target. There are many more methods that can be used both with a multi-class categorical target and for a numerical target (i.e. regression).

However, this simple procedure can be used to check at first the most important variables and start a deeper analysis to find the best set of predictors for our model.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Gianluca Malato

    Tags
    Machine Learning
    Leave a Comment
    Next Post
    The Role of Mobile Apps and Big Data in Enhancing Healthcare

    The Role of Mobile Apps and Big Data in Enhancing Healthcare

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.