This article gives an idea and high level implementation of Boosting in layman terms, which will kick start to read/view more formulated theory about Boosting and Gradient Boosting.
What is Boosting
Boosting can be thought of a methodology or a framework where we can use ’n’ no, of models which does not have that much expected performance (called weak learners. for e.g Decision tree of 1 depth) and combine them to produce top notch model. How many models can be combined? Well, that can be decided by using cross validation.
Why Decision Trees
These weak models can be anything, they can be linear regression, logistic , neural networks , decision trees e.t.c , 99.99% boosting is explained through decision trees because it will be easy to understand how it works.
How are Models Combined
As we discussed earlier, the idea of boosting is to combine many models. The first model will be a weak model which just gives average value. for e.g. let say we are predicting age of a person , then the first model always gives the prediction answer as the average of ages in the given dataset if we pass any datapoint. Such a model is called as Weak Learner.
Now, the second model will be created in such a way that, the features will be same as first model , but the labels are the errors that is made by the first model.
The third model will be created in such a way that, the features will be same as first model , but the labels are the errors that is made by the second model.
So on and so forth till we end up with minimum errors and sum up all the models to get the final model
The question is why we need to create a model that is based on the error.
Why model is created with Errors(Residuals) as Labels.
The idea here is, when we have the predicted value as well as the quantity of the error, then we can sum up both the values to arrive to the expected value.
This will be very clear when we test the algorithm step by step using a simple dataset.
To Illustrate , Consider we have a data point as below
So the first model will provide the average of all the weights (28+30+32+18)/4 = 27
The errors made by the First Model i.e Age minus MODEL 1
We now create another model — Say Model 2, by training the datapoints on top of “ERROR MADE BY MODEL 1” and let say we get the predictions as below from Model 2.
Calculate the Error made by Model 2.
We now create another model — Say Model 3, by training the datapoints on top of “ERROR MADE BY MODEL 2” and let say we get the predictions as below from Model 3.
Calculate the Error made by Model 3.
Lets top here, so at this point we have created 3 Models consequently , the no. of models is hyperparameter which we can tune.
Now lets experiment the theory of Boosting which says FINALMODEL = MODEL1+MODEL2+MODEL3
For the first datapoint, lets take MODEL1+MODEL2+MODEL3 as see how well the model is performing with respect to the actuals. PREDICTION=27+0.8+0.19 = 27.99 which is very much close the actual value 28. Lets check for other datapoints as well.
If we observe the actuals with the prediction, we can understand the power of Boosting.
So , to generalize, the task of boosting is to identify(predict) the quantity of how much far way or close to the target label and add/subtract them while we are predicting.
What is our objective?
Now that we understand how Boosting works , so the objective of Boosting is to train the Model 2 , Model 3 such that it should predict how much far away/near from the average value. The more the performance of Model 2 and Model 3, high the accuracy of prediction.
One of the boosting framework — Gradient boosting is widely used in Kaggle competations because of its high predicting power, there are other varations such as providing weightage for the individual models based on how well they are preforming to improve the above setup far better.