Decision Trees, Random Forests and XGBoost

Decision Trees are one of the most intuitive models in the world of perplexing and obscure ML models. This is because of the similarity in the human decision making process and a decision tree.

A decision tree can be visualized and we can actually see how a computer arrived at a decision, which is rather difficult in case of other models. Hence, it is also called as a white box model.

The purpose of this post is to explore some of the intuition behind building a stand alone decision tree and it's ensemble variants, \( Random Forests (RF) \) and \( Extreme Gradient Boosting (XGB) \).

Decision Trees:

What is a Decision Tree?
A decision tree is a tree in which each node denotes a decision and the corresponding path to take depending on the decision made.
Decision trees are versatile and are widely used for both classification and regression models and are called CART (Classification and Regression Trees).
One of the main advantages of a decision tree is it's ability to handle missing data gracefully.

How is a decision tree built?
There are two metrics on which a decision tree is built, \( Information Gain \) and \( Standard Deviation \). Information gain is used to build a classification tree and standard deviation is used to build a regression tree.

In this post, I will be using a regression tree as an example.

High level steps in building a regression tree are as follows:


How does a decision tree handle missing data?
During the process of building a tree, at each decision node, a decision is made for the missing data. All the data points with the missing data is first clubbed with the left subtree and the drop in standard deviation is calculated. The same is done by combining the missing data points with the right subtree. The branch with the highest drop in standard deviation is assigned to be the path to be followed for missing data points.

What are ensembles and why do we need them?
Ensembles are a combination of various learning models which is practically observed to provide a better performance than stand alone la carte models. This practise is also called as bagging.

One of the main disadvantage of a stand alone model, like a decision tree,which is addressed by an ensemble, is that they are prone to over fitting (high variance). Ensembles are used to average out the noisy data and unbiased (or low biased) models and to create a low variance model.

Two such ensembles for decision trees are Random Forest and XGBoost.

The fundamental issue that both RF and XGB try to address is that decision trees are weak learners (prone to over fitting and depends heavily on the training distribution). Hence, by combining a number of weak learners, we can build a strong learner.

Another variant of RF is called as XGBoost, which uses gradient boosting in order to build the trees. XGB models are used in cases where the data contains high collinearity. This is called as multicollinearity, where two or more features are highly correlated and one can be predicted with reasonable accuracy given the other.

Unlike RF, where the trees are built parallelly with no correlation between the trees, XGB model builds the trees sequentially (and hence, computationally expensive). It learns from each tree and builds the subsequent tree so that the model can better learn the distribution of the target variable i.e the errors are propagated from one tree to the other.

How are these models validated?
RF models are usually validated using Out-Of-Bag (OOB) validation and the XGB models are validated using k-Fold cross validation, which is explored in another post.

Given the simplicity and the intuitive nature of these models, they are one of the most widely used models for competitive ML like Kaggle. In fact, XGBoost models have won 65% of the competitions on Kaggle.