Random forests consist of multiple single trees each based on a random sample of the training data. They are typically more accurate than single decision trees. The following figure shows the decision boundary becomes more accurate as more trees are added.
Here we’ll provide two intuitive reasons why random forests outperform single decision trees.
Higher resolution in the feature space
Trees are unpruned. While a single decision tree like CART is often pruned, a random forest tree is fully grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.
Trees are diverse. Each random forest tree is learned on a random sample, and at each node, a random set of features are considered for splitting. Both mechanisms create diversity among the trees.
Two random trees each with one split are illustrated below. For each tree, two regions can be assigned with different labels. By combining the two trees, there are four regions that can be labeled differently.
Unpruned and diverse trees lead to a high resolution in the feature space. For continuous features, it means a smoother decision boundary, as shown in the following.
Handling Overfitting
Single decision tree method needs pruning to avoid overfitting. The following shows the decision boundary from an unpruned tree. The boundary is smoother but makes obvious mistakes (overfitting).
So how can random forests build unpruned trees without overfitting? Let’s provide an explanation below.
For the two-class (blue and red) problem below, both splits x1=3 and x2=3 can fully separate the two classes.
The two splits, however, result in very different decision boundaries. In other words, these boundaries conflict with each other in some regions, and may not be reliable.
Now consider random forests. For each random sample used for training a tree, the probability that the red point missing from the sample is.
So roughly 1 out of 3 trees is built with all blue data and always predict class blue. The other 2/3 of the trees have the red point in the training data. Since at each node a random subset of features is considered, we expect roughly 1/3 of the trees use x1, and the rest 1/3 uses x2. The splits from the two types of trees are illustrated below.