Gradient Boosting in Machine Learning
Gradient boosting is one of the most powerful techniques for building predictive models. This boosting classifiers are the AdaBoosting method combined with weighted minimization, after which the classifiers and weighted inputs are recalculated. The objective of Gradient Boosting classifiers is to minimize the loss, or the difference between the actual class value of the training example and the predicted class value. It isn’t required to understand the process for reducing the classifier’s loss, but it operates similarly to gradient descent in a neural network.
Refinements to this process were made and Gradient Boosting Machines were created.
In the case of Gradient Boosting Machines, every time a new weak learner is added to the model, the weights of the previous learners are frozen or cemented in place, left unchanged as the new layers are introduced. This is distinct from the approaches used in AdaBoosting where the values are adjusted when new learners are added.
The power of gradient boosting machines comes from the fact that they can be used on more than binary classification problems, they can be used on multi-class classification problems and even regression problems.
Theory Behind Gradient Boost
The Gradient Boosting Classifier depends on a loss function. A custom loss function can be used, and many standardized loss functions are supported by gradient boosting classifiers, but the loss function has to be differentiable.
Classification algorithms frequently use logarithmic loss, while regression algorithms can use squared errors. Gradient boosting systems don’t have to derive a new loss function every time the boosting algorithm is added, rather any differentiable loss function can be applied to the system.
Gradient boosting systems have two other necessary parts: a weak learner and an additive component. This boosting systems use decision trees as their weak learners. Regression trees are used for the weak learners, and these regression trees output real values. Because the outputs are real values, as new learners are added into the model the output of the regression trees can be added together to correct for errors in the predictions.
The additive component of a gradient boosting model comes from the fact that trees are added to the model over time, and when this occurs the existing trees aren’t manipulated, their values remain fixed.
A procedure similar to gradient descent is used to minimize the error between given parameters. This is done by taking the calculated loss and performing gradient descent to reduce that loss. Afterwards, the parameters of the tree are modified to reduce the residual loss.
The new tree’s output is then appended to the output of the previous trees used in the model. This process is repeated until a previously specified number of trees is reached, or the loss is reduced below a certain threshold.
Steps to Gradient Boosting
In order to implement a gradient boosting classifier, we’ll need to carry out a number of different steps. We’ll need to:
Fit the model
Tune the model’s parameters and Hyperparameters
Make predictions
Interpret the results
Fitting models with Scikit-learn is fairly easy, as we typically just have to call the fit() command after setting up the model.
However, tuning the model’s hyperparameters requires some active decision-making on our part. There are various arguments/hyperparameters we can tune to try and get the best accuracy for the model. One of the ways we can do this is by altering the learning rate of the model. We’ll want to check the performance of the model on the training set at different learning rates, and then use the best learning rate to make predictions.
Predictions can be made in Scikit-learn very simply by using the predict() function after fitting the classifier. You’ll want to predict on the features of the testing dataset, and then compare the predictions to the actual labels. The process of evaluating a classifier typically involves checking the accuracy of the classifier and then tweaking the parameters/hyperparameters of the model until the classifier has an accuracy that the user is satisfied with.