Feature selection and collinearity
I have already written an article about feature selection. It was an unsupervised way to measure feature importance in a binary classification model, using Pearson’s chi-square test and correlation coefficient.
Generally speaking, an unsupervised approach is often enough for a simple feature selection. However, each model has its own way of “thinking” the features and treat their correlation with the target variable. Moreover, there are models that do not care too much about collinearity (i.e., the correlation between the features) and other models that show very big problems when it occurs (for example, linear models).
Although it’s possible to rank the features by some kind of relevance metrics introduced by the model (for example, the p-value of the t-test performed on the coefficients of linear regression), taking only the most relevant variables couldn’t be enough. Think about a feature that is equal to another one, just multiplied by two. The linear correlation between these features if 1 and this simple multiplication doesn’t affect the correlation with the target variable, so if we take only the most relevant variables, we’ll take the original feature and the multiplied one. This leads to collinearity, which can be quite dangerous for our model.
That’s why we must introduce some way to better select our features.
Random search
Random search is a really useful tool in a data scientist toolbox. It’s a very simple technique very often used, for example, in cross-validation and hyperparameter optimization.
It’s very simple. If you have a multi-dimensional grid and want to look for the point on this grid which maximizes (or minimizes) some objective function, random search works as follows:
- Take a random point on the grid and measure the objective function value
- If the value is better than the best one achieved so far, keep the point in memory.
- Repeat for a certain, pre-defined number of times
That’s it. Just generating random points and look for the best one.
Is this a good way to find the global minimum (or maximum)? Of course, it’s not. The point we are looking for is only one (if we are lucky) in a very large space and we have only a limited number of iterations. The probability of getting that single point in an N-point grid is 1/N.
So, why is random search so much used? Because we never really want to maximize our performance measure; we want a good, reasonably high value that it’s not the highest possible, to avoid overfitting.
That’s why random search works and can be used in feature selection.
How to use random search for feature selection
Random search can be used for feature selection with quite good results. An example of a procedure similar to a random search is the Random Forest model which performs a random selection of the features for each tree.
The idea is pretty simple: choose the features randomly, measure the model performances by k-fold cross-validation, and repeat many times. The feature combination that gives the best performances is the one we are looking for.
More precisely, these are the steps to follow:
- Generate a random integer number Nbetween 1 and the number of features.
- Generate a random sequence of Ninteger numbers between 0 and N-1without repetition. This sequence represents our feature array. Remember that Python arrays start from 0.
- Train the model on these features and cross-validate it with k-fold cross-validation, saving the average value of some performance measure.
- Repeat from point 1 as many times as you want.
- Finally, get the feature array that gives the best performances according to the chosen performance measure.
A practical example in Python
For this example, I’ll use the breast cancer dataset included in sklearn module. Our model will be a logistic regression, and we’ll perform a 5-fold cross-validation using accuracy as the performance measure.
First of all, we must import the necessary modules.
Then we can import breast cancer data and break it in input and target.
We can now create a logistic regression object.
lr = LogisticRegression()Then, we can measure the average accuracy in k-fold CV with all the features.</p.
Now, we can implement a random search with, for example, 300 iterations.
At the end of the loop and the sorting function, the first element of result list is the object we are looking for.
We can use this value to calculate the new performance measure with this subset of the features.
As you can see, accuracy has increased.
Conclusions
Random search can be a powerful tool to perform feature selection. It’s not meant to give the reasons why some features are more useful than other ones (as opposed to other feature selection procedures like Recursive Feature Elimination), but it can be a useful tool to reach good results in less time.