I recently discovered data science practice problems on Analytics Vidhya. These allow you access to simple datasets on which to practice your machine learning skills and benchmark yourself against others. I think they offer a great introduction to approaching these problems before perhaps moving onto something a bit more challenging such as Kaggle competitions.
Over the last few weeks, I spent some time working on a simple practice problem. The problem is to create a model to predict whether or not a person would be approved for a loan, given a number of fields from an online search form. In the following post, I am going to walk through step by step how to get started with building the first model for this task using Python in JupyterLab. I am not going to go into great detail about each step but wanted to show the end-to-end process of how to approach a classification problem.
For this problem, you are given two CSV files, train.csv and test.csv. We are going to use the training file to train and test the model, and we then use the model to predict the unknown values in the text file. The first steps are to read in the files and perform some preliminary analysis to determine what preprocessing we will need before attempting to train a model.
Most machine learning models are unable to handle non-numeric columns, and missing values in data it is, therefore, necessary to perform a number of pre-processing steps.
This line of code will quickly indicate the percentage of missing values you have in each column.
There are a number of methods you can use to fill these all with various tradeoffs, and it is likely that in order to get to the best performing model you may have to try a number of these. However, I don’t have time to cover this here. So for simplicity, we are going to fill with the most common value, as this works for all data types.
There are a number of other pre-processing steps we have to perform before we begin to create the model. It is important that we apply exactly the same steps to both the test and train data so that we can make the predictions from identical features. When performing this I found it easier to create a function that carries out these pre-processing steps which can then be applied to both data sets.
This function fills missing values with the most common values.
It also creates dummy values from the categorical columns. This is one way of converting non-numeric data types into numeric. It takes each distinct value in the column and turns it into a new Boolean column where 1 indicates that the value is present and 0 indicates that it is not. The below indicates how this works with the ‘Married’ column.
There are also a number of columns that should be numeric but are in fact objects. You can view the data types for each column by using the types function in pandas. The function above simply converts these columns to numeric using the pandas’ to_numeric function.
And finally, the function drops the “Loan_ID” column since this is not a feature we would use in the model.
The final piece of necessary pre-processing we need to do is convert the “Loan_Status” column in the training data into a numeric field. We do this using the sci-kit-learn LabelEncoder. This automatically encodes labels into numbers.
The data is now almost ready to feed into a model. But we want to be able to determine the accuracy of the model and the best way to do this is to reserve some of the data to make predictions. We can then compare how many of these predictions the model got right.
The following code assigns the list of features we will be using in the model to the variable features, and the column we are trying to predict to target.
We are going to split the training data into train and test sets using the sci-kit-learn train_test_split function. In the code below we have chosen 0.20 for the test_size, this will randomly divide the data so that 20% is reserved for the test set. We set the random_state to 1 so that the results will be reproducible.
Creating a baseline model
Now that the data is prepared we need to find the most accurate model and set of parameters to make our predictions. In order to determine how well the models are performing it is useful to first create a baseline score so that we know if the model we are selecting is better than if we were simply guessing the predictions.
One way to do this is to use a dummy classifier. The code below trains a Dummy Classifier using the training data, and performs predictions on the reserved test set. This model, with the strategy variable set to most_frequent, always uses the most frequently occurring label to make predictions. Using this model we get an initial accuracy of 0.68.
Next, we see if we can improve this score by using the Random Forest algorithm. To begin with I am using default parameters. This gives an improvement in our baseline, and the accuracy is now 0.72.
Random forest has a number of parameters, and one way to optimize your model is to try a find the most optimal set of parameters to use. This is called hyperparameter optimization, and again there are a number of different approaches you can use for this. For this problem, I chose to use grid search. As you can see optimization yielded a substantial improvement in the performance of the model.
Making a submission
The final step is to create a submission file to upload to the competition site. The code below uses the model to predict the Loan Status for each row in the text file. Encodes the numerical Loan Status values back into the correct labels. It then creates a new data frame with just two columns, Loan_ID, and the predictions. Finally, this is exported as a CSV file ready to upload.
I hope that this has provided a useful walkthrough of how to get started with these sorts of competitions. Once you have made the first submission you can go back and try other ways to improve the accuracy. These might include selecting a different model, engineering new features from the data or handling the missing information differently.