How to combine weak learners to build a stronger learner to reduce bias and variance in your ML model
The bias and variance tradeoff is one of the key concerns when working with machine learning algorithms. Fortunately there are some Ensemble Learning based techniques that machine learning practitioners can take advantage of in order to tackle the bias and variance tradeoff, these techniques are bagging and boosting. So, in this blog we are going to explain how bagging and boosting works, what theirs components are and how you can implement them in your ML problem, thus this blog will be divided in the following sections:
- What is Bagging?
- What is Boosting?
- AdaBoost
- Gradient Boosting
What is Bagging?
Bagging or Bootstrap Aggregation was formally introduced by Leo Breiman in 1996 [3]. Bagging is an Ensemble Learning technique which aims to reduce the error learning through the implementation of a set of homogeneous machine learning algorithms. The key idea of bagging is the use of multiple base learners which are trained separately with a random sample from the training set, which through a voting or averaging approach, produce a more stable and accurate model.
The main two components of bagging technique are: the random sampling with replacement (bootstraping)and the set of homogeneous machine learning algorithms (ensemble learning). The bagging process is quite easy to understand, first it is extracted “n” subsets from the training set, then these subsets are used to train “n” base learners of the same type. For making a prediction, each one of the “n” learners are feed with the test sample, the output of each learner is averaged (in case of regression) or voted (in case of classification). Figure 2 shows an overview of the bagging architecture.
It is important to notice that the number of subsets as well as the number of items per subset will be determined by the nature of your ML problem, the same for the type of ML algorithm to be used. In addition, Leo Breiman mention in his paper that he noticed that for classification problems are required more subsets in comparison with regression problems.
For implementing bagging, scikit-learn provides a function to do it easily. For a basic execution we only need to provide some parameters such as the base learner, the number of estimators and the maximum number of samples per subset.
In the previous code snippet was created a bagging based model for the well know breast cancer dataset. As base learner was implemented a Decision Tree, 5 subsets were created randomly with replacement from the training set (to train 5 decision tree models). The number of items per subset were 50. By running it we will get:
Train score: 0.9583568075117371
Test score: 0.941048951048951
One of the key advantages of bagging is that it can be executed in parallel since there is no dependency between estimators. For small datasets, a few estimators will be enough (such as the example above), larger dataset may require more estimators.
Great, so far we’ve already seen what bagging is and how it works. Let’s see what boosting is, its components and why it is related to bagging, let’s go for it!
What is Boosting?
Boosting is an Ensemble Learning technique that, like bagging, makes use of a set of base learners to improve the stability and effectiveness of an ML model. The idea behind a boosting architecture is the generation of sequential hypotheses, where each hypothesis tries to improve or correct the mistakes made in the previous one [4]. The central idea of boosting is the implementation of homogeneous ML algorithms in a sequential way, where each of these ML algorithms tries to improve the stability of the model by focusing on the errors made by the previous ML algorithm. The way in which the errors of each base learner is considered to be improved with the next base learner in the sequence, is the key differentiator between all variations of the boosting technique.
The boosting technique has been studied and improved over the years, several variations have been added to the core idea of boosting, some of the most popular are: AdaBoost (Adaptive Boosting),Gradient Boosting and XGBoost (Extreme Gradient Boosting). As mentioned above, the key differentiator between boosting-based techniques is the way in which errors are penalized (by modifying weights or minimizing a loss function) as well as how the data is sampled.
For a better understanding of the differences between some of the boosting techniques, let’s see in a general way how AdaBoost and Gradient Boosting work, two of the most common variations of the boosting technique, let’s go for it!
AdaBoost
AdaBoost is an algorithm based on the boosting technique, it was introduced in 1995 by Freund and Schapire [5]. AdaBoost implements a vector of weights to penalize those samples that were incorrectly inferred (by increasing the weight) and reward those that were correctly inferred (by decreasing the weight). Updating this weight vector will generate a distribution where it will be more likely to extract those samples with higher weight (that is, those that were incorrectly inferred), this sample will be introduced to the next base learner in the sequence. This will be repeated until a stop criterion is met. Likewise, each base learner in the sequence will have assigned a weight, the higher the performance, the higher the weight and the greater the impact of this base learner for the final decision. Finally, to make a prediction, each base learner in the sequence will be fed with the test data, each of the predictions of each model will be voted (for the classification case) or averaged (for the regression case). In Figure 3 we observe the descriptive architecture of the AdaBoost operation.
Scikit-learn provides the function to implement the AdaBoost technique, let’s see how to perform a basic implementation:
As we can see, the base learner that we are using is a decision tree (it is suggested that it be a decision tree, however, you can try some other ML algorithm), we are also defining only 5 estimators for the base learner sequence (this is enough for the toy dataset that we are trying to predict), running this we would obtain the following results:
Train score: 0.9694835680751174
Test score: 0.958041958041958
Great, we already saw in a general way how AdaBoost works, now let’s see what about Gradient Boosting and how we can implement it.
Gradient Boosting
The Gradient Boosting method does not implement a vector of weights like AdaBoost does. As its name implies, it implements the calculation of the gradient for the optimization of a given loss function. The core idea of Gradient Boosting is based on minimizing the residuals of each learner base in a sequential way, this minimization is carried out through the calculation of the gradient applied to a specific loss function (either for classification or regression). Then each base learner added to the sequence will minimize the residuals determined by the previous base learner. This will be repeated until the error function is as close to zero or until a specified number of base learners is completed. Finally, to make a prediction, each of the base learners are fed with the test data whose outputs are parameterized and subsequently added to generate the final prediction.
Just like Bagging and AdaBoost, scikit-learn provides the function to implement Gradient Boosting, let’s see how to make a basic implementation:
Gradient Boosting works with decision trees by default, that is why in the implementation we do not define a specific base learner. We are defining that each tree in the sequence will have a maximum depth of 2, the number of trees will be 5 and the learning rate for each tree will be 0.1, running this we obtain:
Train score: 0.9906103286384976
Test score: 0.965034965034965
Fantastic, with this we finish this exploration on bagging, boosting and some boosting implementation, that’s it!
Conclusion
In this blog we have seen two of the most widely implemented Ensemble Learning techniques.
As we have seen, bagging is a technique that performs random samples with replacement to train “n” base learners, this allows the model to be processed in parallel. It is because of this random sampling that bagging is a technique that mostly allows to reduce the variance. On the other hand, boosting is a sequentially constructed technique where each model in the sequence tries to focus on the error of the previous base learner. Although boosting is a technique that mainly allows to reduce the variance, it is highly prone to over-fitting the model.
References
[1] https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
[2] https://arxiv.org/pdf/0804.2752.pdf
[3] https://link.springer.com/article/10.1023/A:1018054314350
[4] https://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/boosting-survey.pdf
[5] https://www.face-rec.org/algorithms/Boosting-Ensemble/decision-theoretic_generalization.pdf