A Common Data Science Mistake: Prediction/Recommendation by Manipulating Model Inputs

“We trained a machine learning model with high performance. However, it did not work and was not useful in practice.” I have heard this sentence several times, and each time I was eager to find out the reason. There could be different reasons that a model failed to work in practice. As these issues are not usually addressed in data science courses, in this article I address one of the common mistakes in designing and deploying a machine learning model.

In the rest of this article, first, I will discuss the confusion between Correlation and Causation that leads to the misuse of machine learning models. I will illustrate the discussion with an example. After that, different possibilities between inputs and outputs of the model are shown. Finally, I provide some suggestions to avoid this mistake.

Correlation not Causation

Mistaking Correlation with Causation can lead to wrong results. An example of confusion between Correlation and Causation is the analysis of Freakonomics in which Illinois sent books to students because the analysis revealed that books available at home are directly correlated to high test marks. However, the reality is that houses wherein parents usually buy books have an exhilarated learning environment. Further analysis revealed students from homes which have several books performed better in their academics even if they have never read the books. In fact, getting higher marks was not an effect of the books, but they both result from the environment.

Back to our topic, after you develop a model, you cannot manipulate the input parameters (features) to see the effect on the output. The reason is that an input feature could be an effect of the output and it is not necessarily the cause of the output. What a high-performance machine learning model tells you is that there is a correlation between the input and output. You cannot adjust the inputs to get the desired output and then provide the recommendations based on the adjusted inputs.

Example

Here is an example in which we develop a regression model but the model provides a false prediction/recommendation. Assume we have the outside temperature and temperature of a room. We can develop a linear regression model to estimate the outside temperature based on the temperature of the room.

T(Outside)= C1*T(Inside)+C2

where C1 and C2 are the constant coefficients derived from the data. Assume this model has very high performance (e.g. more than 99%).

Working with the model, we find out that if the inside temperature increase by 5C, the outside temperature will increase by 10C. Can we buy a heater for the room and increase the inside temperature to enjoy a warm day??!! Of course not. The reason is that the inside temperature is the effect, not the cause. The same thing can happen when a data scientist manipulates the inputs of a model (e.g. inside temperature) to get the desired output (e.g. outside temperature). The recommendations based on manipulating the inputs are usually useless in practice.

Input and Output Relationship

Now, let’s see different cases when there is a correlation between one of the features A and the output B. The following figures show different cases.

It is clear that in cases 2, 3, and 4, the output of the model for a manipulated value of A is different from what we see in the real world. It should be noted that even in case 1, the output might be different because A may have some correlation with other inputs of the model. This means when the value of Achanges the other inputs will also change. Therefore, it is not correct to change only one of the input features, and investigate its effect.

How to Avoid?

First, be aware of this issue. You should be aware that by manipulating the inputs, you cannot predict the output. Keeping this in your mind would affect how you design your model and how to choose the futures.

Second, if you would like to design a prediction model, you need to have the historical data that tells your model the effect of changing the inputs. By having snapshots, you cannot predict what will happen if an input changes. In this case, you can train the model based on the historical data. In our example, when we want to see the effect of the room temperature on the outside temperature, we need to have some samples that include changes of inside temperature and their effects on the outside temperature (e.g. after 1 hour). In this case, the model learns that the room temperature has no effect on the outside temperature.

Third, use your domain knowledge or talk to experts and see if your prediction/recommendation results make sense or not. This leads to avoiding not only this mistake but other logical mistakes. For example, there might be some bugs in your coding that you are not aware of. Sense check can help you validate the model in general.

Conclusion

Designing a machine learning model is a tricky task. A model may not work in practice although it has high performance on the training data. In this article, I discussed the misuse of a machine learning model that causes the predictions not to work in the real world situation. The other reasons could be overfitting, duplicated samples, and unbiased data. It is always good to use your domain knowledge or talk to some experts and see if your prediction/recommendation results make sense or not.