#012 Machine Learning – Introduction to Random Forest
Highlights: Hello and welcome. In the previous post, we talked about one intuitive algorithm which is used to classify objects called the Decision tree algorithm. In this post, we will talk about the Random forests algorithm which is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees. We will provide an overview of the random forest algorithm and explain how it works. Furthermore, we will present the algorithm’s features and how it is employed in real-life applications. Finally, we will also point out the advantages and disadvantages of this algorithm. So, let’s begin.
Tutorial overview:
1. What is a random forest?
A random forest is a machine learning algorithm that’s used to solve regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.
A random forest algorithm is built on decision trees. It consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.
The random forest algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.
So, the random forest algorithm can be used to improve the performance of the decision tree-based model. Now, let’s highlight some of the most important features of the random forest algorithm.
- It’s more accurate than the decision tree algorithm.
- It provides an effective way of handling missing data.
- It can produce a reasonable prediction without tuning hyperparameters.
- It solves the issue of overfitting in decision trees.
- In every random forest tree, a subset of features is selected randomly at the node’s splitting point.
To better understand how the random forest algorithm improves the decision tree let’s remind ourselves how the decision tree algorithm works.
A decision tree consists of three components: decision nodes, leaf nodes, and root nodes. The algorithm divides a training dataset into branches, which further segregates it into other branches. This sequence continues until a leaf node is attained. The leaf node cannot be segregated further.
The nodes in the decision tree represent attributes that are used for predicting the outcome. Decision nodes provide a link to the leaves. The following diagram shows the three types of nodes in a decision tree.
The process of building a decision tree given a training set has a few steps. Let’s take a look at the overall process of what we need to do to build a decision tree.
- Select the best feature
- Make that feature a decision node and break the dataset into smaller subsets.
- Starts tree building by repeating this process recursively for each child until one of the conditions will match:
- All the tuples belong to the same attribute value.
- There are no more remaining attributes.
- There are no more instances.
Now when we reminded ourselves how the decision algorithm tree works, let’s see how we can use it to construct the random forest algorithm.
The main difference between the decision tree algorithm and the random forest algorithm is that establishing root nodes and segregating nodes is done randomly in the random forest algorithm. The random forest employs the bagging method to generate the required prediction.
The bagging method
The bagging method involves using different samples of training data rather than just one sample. A training dataset comprises observations and features that are used for making predictions. The decision trees produce different outputs, depending on the training data fed to the random forest algorithm. These outputs will be ranked, and the highest will be selected as the final output.
Now, let’s explain and illustrate the classification process in the random forests algorithm.
2. Classification in random forests
Classification in random forests employs an ensemble methodology to attain the outcome. The training data is fed to train various decision trees. This dataset consists of observations and features that will be selected randomly during the splitting of nodes.
A train forest system relies on various decision trees. Every decision tree consists of decision nodes, leaf nodes, and a root node. The leaf node of each tree is the final output produced by that specific decision tree. The selection of the final output follows the majority-voting system. In this case, the output chosen by the majority of the decision trees becomes the final output of the train forest system.
Let’s take an example of a training dataset consisting of various fruits such as bananas, apples, pineapples, and peaches.
The random forest classifier divides this dataset into subsets. These subsets are given to every decision tree in the random forest system. Each decision tree produces its specific output. For example, the prediction for trees 1 and 2 is apple.
However, the third has predicted banana as the outcome. The random forest classifier collects the majority voting to provide the final prediction. The majority of the decision trees have chosen an apple as their prediction. This makes the classifier choose apple as the final prediction.
Now, let’s see how we can apply the random forest algorithm in Python.
3. Random Forests algorithm in Python
The major problem of the decision tree algorithm is that it is prone to overfitting, especially when a tree is particularly deep. This is due to the amount of specificity we look at leading to a smaller sample of events that meet the previous assumptions. This small sample could lead to unsound conclusions.
To avoid overfitting we can use the random forest algorithm. Let’s see how we can do that using Scikit-Learn
libray.
First, let’s import the necessary libraries.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
Next, let’s create a decision tree classifier. Consider the following two-dimensional data, which has one of four class labels:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');
This process of fitting a decision tree to our data can be done in Scikit-Learn
with the DecisionTreeClassifier
estimator. First, we will import the DecisionTreeClassifier
estimator, and then we will use the fit()
method to calculate the predictions.
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)
Now, let’s visualize our prediction. For that, we will create the function visualize_classifier()
to examine what the decision tree classification looks like.
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
ax = ax or plt.gca()
# Plot the training points
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
clim=(y.min(), y.max()), zorder=3)
ax.axis('tight')
ax.axis('off')
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# fit the estimator
model.fit(X, y)
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Create a color plot with the results
n_classes = len(np.unique(y))
contours = ax.contourf(xx, yy, Z, alpha=0.3,
levels=np.arange(n_classes + 1) - 0.5,
cmap=cmap, clim=(y.min(), y.max()),
zorder=1)
ax.set(xlim=xlim, ylim=ylim)
visualize_classifier(DecisionTreeClassifier(), X, y)
If we examine the result of the decision tree classifier we can notice very strangely shaped classification regions. For example, we can see a tall and skinny purple region between the yellow and green regions. It’s clear that this is less a result of the true data distribution, and more a result of the particular sampling or noise properties of the data. That means that the decision tree is clearly overfitting our data.
Now, let’s see how we can avoid this overfitting using the method called begging.
The begging is an ensemble method that combines multiple overfitting estimators to reduce the effect of this overfitting. Bagging makes use of an ensemble of parallel estimators, each of which over-fits the data, and averages the results to find a better classification. An ensemble of randomized decision trees is known as a random forest.
This type of bagging classification can be done manually using BaggingClassifier
meta-estimator.
from sklearn.ensemble import BaggingClassifier
tree = DecisionTreeClassifier()
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,
random_state=1)
bag.fit(X, y)
visualize_classifier(bag, X, y)
As you can see, the purple region is gone which means that the begging method does not overfit the data.
Note, that in this example, we have randomized the data by fitting each estimator with a random subset of 80% of the training points. In practice, decision trees are more effectively randomized by injecting some stochasticity in how the splits are chosen. In this way, all the data contributes to the fit each time, but the results of the fit still have the desired randomness. For example, when determining which feature to split on, the randomized tree might select from among the top several features.
Such an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier
estimator, which takes care of all the randomization automatically. All you need to do is select a number of estimators, and it will very quickly fit the ensemble of trees:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=0)
visualize_classifier(model, X, y);
Now, let’s see how we can use the random forest classifier to classify the hand-written digits.
Classifying Digits using random forest
First, let’s import the dataset of hand-written digits.
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
With the following code, we’ll visualize the first few images of the hand-written digits.
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
We can quickly classify the digits using a RandomForestClassifier()
.
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target,
random_state=0)
model = RandomForestClassifier(n_estimators=1000)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
To check the results we can look at the classification report for this classifier.
from sklearn import metrics
print(metrics.classification_report(ypred, ytest))
precision recall f1-score support
0 1.00 0.97 0.99 38
1 0.98 0.98 0.98 43
2 0.95 1.00 0.98 42
3 0.98 0.98 0.98 45
4 0.97 1.00 0.99 37
5 0.98 0.96 0.97 49
6 1.00 1.00 1.00 52
7 1.00 0.96 0.98 50
8 0.96 0.98 0.97 47
9 0.98 0.98 0.98 47
accuracy 0.98 450
Also, we can create a confusion matrix.
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
As you can see, the random forest gives us very accurate results in the classification of the digits data.
Summary
In this post, we talked about the random forest algorithm. We learned that it is a machine learning algorithm that is very flexible and easy to use. It uses ensemble learning to obtain better predictive performance than performance obtained from any of the constituent learning algorithms alone.
This is an ideal algorithm for developers because it solves the problem of overfitting that usually accrues when we use the decision tree algorithm.