Decision Trees, Random Forests and XGBoost
26 Feb 2017 Decision Trees are one of the most intuitive models in the world of perplexing and obscure ML models. This is because of the similarity in the human decision making process and a decision tree.A decision tree can be visualized and we can actually see how a computer arrived at a decision, which is rather difficult in case of other models. Hence, it is also called as a white box model.
The purpose of this post is to explore some of the intuition behind building a stand alone decision tree and it's ensemble variants, \( Random Forests (RF) \) and \( Extreme Gradient Boosting (XGB) \).
Decision Trees:
What is a Decision Tree?
A decision tree is a tree in which each node denotes a decision and the corresponding path to take depending on the decision made.
Decision trees are versatile and are widely used for both classification and regression models and are called CART (Classification and Regression Trees).
One of the main advantages of a decision tree is it's ability to handle missing data gracefully.
How is a decision tree built?
There are two metrics on which a decision tree is built, \( Information Gain \) and \( Standard Deviation \). Information gain is used to build a classification tree and standard deviation is used to build a regression tree.
In this post, I will be using a regression tree as an example.
- Decision tree is built in a top down approach.
- The goal of a tree is to partition the data into subsets such that each subset contains homogeneous values (with same target classes in case of classification and with minimum standard deviation in case of regression). $$S(T) = \sqrt{\frac{\sum (x-\mu )^{2}}{N}}\; \mu\rightarrow mean$$ $$Entropy\; H(X) = -\sum p(X)\log p(X)\\ Information\ Gain\; I(X,Y)= H(X)-H(X|Y)$$
- In case of regression, if the sample is homogeneous i.e equal valued, standard deviation is 0.
- In case of regression, the goal is to decrease the standard deviation after the data is split into subsets. Hence, the attribute that can provide the highest decrease in standard deviation is chosen.
- In case of classification, the attribute that can provide the highest information gain is chosen.
- In simple words, a decision tree can be viewed as a decision table. If the tree is allowed to grow till the last depth, each leaf node will have only one sample. If the training data has all possible permutations of the independent variables, the decision tree becomes a representation of a truth table.
- Step 1: Calculate the standard deviation of the target (dependant) variable \( S(T) \).
- Step 2:
- Split the data on all the attributes \( (X) \).
- Calculate the standard deviation of the subsets \( S(T,X) \).
- Choose the attribute with the highest difference.
- \( SDR(T,X) = S(T) - S(T,X) \)
- Step 3: Split the data according to the decision attribute selected in the previous step.
- Step 4: Repeat the process recursively.
- Step 5: We can repeat the process recursively until each subset of the data has 0 standard deviation. This is not feasible at a large scale and it represents a model with very high variation. Hence, we need a termination criteria. For example, the termination criteria could be that the standard deviation at each leaf node must be lesser than 5% or that each node must contain a minimum number of data points.
- Step 6: When the number of data points at a leaf node is more than one, the average of the target variable is taken as the output at that node.
How does a decision tree handle missing data?
During the process of building a tree, at each decision node, a decision is made for the missing data. All the data points with the missing data is first clubbed with the left subtree and the drop in standard deviation is calculated. The same is done by combining the missing data points with the right subtree. The branch with the highest drop in standard deviation is assigned to be the path to be followed for missing data points.
What are ensembles and why do we need them?
Ensembles are a combination of various learning models which is practically observed to provide a better performance than stand alone la carte models. This practise is also called as bagging.
One of the main disadvantage of a stand alone model, like a decision tree,which is addressed by an ensemble, is that they are prone to over fitting (high variance). Ensembles are used to average out the noisy data and unbiased (or low biased) models and to create a low variance model.
Two such ensembles for decision trees are Random Forest and XGBoost.
The fundamental issue that both RF and XGB try to address is that decision trees are weak learners (prone to over fitting and depends heavily on the training distribution). Hence, by combining a number of weak learners, we can build a strong learner.
- RF consists of a large number of decorrelated decision trees.
- Given a training data set, we create a number of subsets randomly. These subsets can be based on random selection of data or features (variables). These subsets can be overlapping.
- Build a decision tree for each of these subsets.
- In order to get a classification from a RF, the output from each tree can be polled in order to arrive at the decision.
- There can be a number of polling mechanisms, the most common being, the target variable with the highest frequency is chosen i.e the decision arrived to by the majority of the trees. We can also use a weighted average, where certain trees are given more weight-age than others.
Unlike RF, where the trees are built parallelly with no correlation between the trees, XGB model builds the trees sequentially (and hence, computationally expensive). It learns from each tree and builds the subsequent tree so that the model can better learn the distribution of the target variable i.e the errors are propagated from one tree to the other.
How are these models validated?
RF models are usually validated using Out-Of-Bag (OOB) validation and the XGB models are validated using k-Fold cross validation, which is explored in another post.
Given the simplicity and the intuitive nature of these models, they are one of the most widely used models for competitive ML like Kaggle. In fact, XGBoost models have won 65% of the competitions on Kaggle.