You are currently viewing Random forest algorithm in Python

Random forest algorithm in Python

Loading

Random forest algorithm is considered as the most powerful algorithm in machine learning. It is also called ensemble learning because it is a combination of multiple decision trees.

We can use this algorithm for solving classification problems aka categorical outputs or regression problems (continuous outputs).

To understand Random forest in detail we need an understanding of bagged decision trees. This term bagged decision trees requires knowledge of decision tree as well as bootstrap. Let us understand the terms one by one.

Bootstrap Aggregation

Before learning bootstrap aggregation first of all we will learn about sampling. Sampling is a procedure in which, a small sample will be taken out from the population.

Population and Sample

Two types of sampling

  • Sampling without replacement
  • Sampling with replacement

Sampling without replacement

Sampling without replacement means one ball can be selected only one time in a given sample.

Sampling without replacement

Sampling with replacement

Subsampling with replacement means one ball can be selected more than one time in a given sample. In general term bootstrap means “subsampling with replacement”.

Sampling with replacement

(1) Consider we have given 100 samples of integers and your goal is to find the mean. Normally we add the 100 samples and divide by the total no of samples to get the mean.

(2) However in bootstrapping we pick a subsample size, let say 20 samples, we find the mean of these 20 samples and call it μ1. Now put these 20 samples back into the pile and pick another random 20 samples and call it μ2.

(3) Now repeat this process which is subsampling with replacement 30 times and we end up with 30 means. At the last, we take the mean of these mean and that is the final answer.

Now the question arises is that why we use bootstrapped mean instead of the average mean.

And the answer to that question is because if we take the average over an average decreases the variance of the predicted mean. It means there is less chance of overfitting and that is a very desirable property to use bootstrap.

Out of Bag Sample

Out of Bag Sample

Now let us understand the second term which is Decision trees

Decision Trees

A decision tree is a flowchart like structure. Let us understand the tree structure in detail.

1- Root is used for taking an input sample. Now the roots ask a sample series of questions at a non-leaf node and output a value at the leaf nodes.

Random forest architecture

Depending upon the nature of its values we have different types of decision trees.

If the output variable is categorical then it is a classification tree and it the output variable is continuous then it is regression trees. This is the reason we also called it a CART(classification and regression trees)

Bagging

The general definition of bagging is to create several subsets of data from the training sample chosen randomly with replacement. Each subset of data is used to train their decision tree. It is also called Bootstrap aggregation.

The main goal of the bagging is to reduce the variance of our decision tree classifier. It is only possible by combining several decision trees classifier.

So the question arises why do we use decision trees usually?

And the answer to that question is

1- A single decision trees are unstable

2- A minor change in the training data will highly affect the tree structure. Hence, there is a high variance in predictions.

Some of bagging algorithm which have high variance are decision tree, support vector machine and K nearest neighbors. Random forest algorithm is also a bagging algorithm

Classification

Bagging (Classification) Random forest algorithm in Python

Regression

Bagging (Regression) Random forest algorithm in Python

What is Random forest algorithm

Random forest is usually a bunch of decision trees. In this where we ask a random set of the question. And based on that question it makes the split.

“How do we determine the optimal value at which to split the decision trees?”

The answer to that question is by using Gini Index we can find our magical number.

Let us dig deeper and understand what is Gini Index

A Gini Index is a cost function that is used to evaluate the splits. And minimize it.

The mathematical formula is

where

  • p of i is the proportion of the samples that belong to class c for a particular node and
  • n is the total number of class.

Example: Loan Default Dataset

Let’s say we have a feature: Credit Score and a target: Loan Default (Yes = 1, No = 0)

We want to split on:

  • Credit Score < 600

We have 10 samples:

Decision Tree split using GINI index

Before Split (root):

  • 6 people defaulted (class = 1)
  • 4 did not default (class = 0)

Now split into two groups:

Left Node (Credit Score < 600):

  • 4 defaulted
  • 1 did not

Right Node (Credit Score ≥ 600):

  • 2 defaulted
  • 3 did not

Weighted Gini after split:

Gini Gain = Before Split Gini – After Split Gini

0.48−0.4=0.08

The split reduces impurity, so it’s considered a “good” split.

Now we have understands the concepts in detail. Let us move to the coding part

Random forest algorithm (Classification) implementation

The dataset that we are going to use is the Fashion MNIST dataset. You can download this data from Kaggle.

The dataset contains 10 labels. Some of the labels are

  • 0 T-shirt/top
  • 1 Trouser
  • 2 Pullover
  • 3 Dress
  • 4 Coat
  • 5 Sandal
  • 6 Shirt
  • 7 Sneaker
  • 8 Bag
  • 9 Ankle boot

Step 1- Import all the required libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.manifold import TSNE

from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

Step 2- Load the training and test dataset using the pandas library and then shuffle the training data. You need username and API Key which you can get it from Kaggle website settings page.


import kagglehub

# Download latest version
path = kagglehub.dataset_download("zalando-research/fashionmnist")

print("Path to dataset files:", path)
df_train = pd.read_csv('/kaggle/input/fashionmnist/fashion-mnist_train.csv')

shuffle_index = np.random.permutation(df_train.shape[0])
df_train = df_train.iloc[shuffle_index]

df_test = pd.read_csv('/kaggle/input/fashionmnist/fashion-mnist_test.csv')

Step 3- Take a look at our data using the matplotlib library

samples = np.random.randint(0, df_train.shape[0], 3)
for i, idx in enumerate(samples):
    sample = np.reshape(df_train.iloc[idx, 1:].values/255, (28, 28))
    plt.subplot(2, 3, i+1)
    plt.title('category {}'.format(df_train.iloc[idx, 0]))
    plt.subplots_adjust(top=1, bottom=0.1)
    plt.imshow(sample, 'gray')
fashion-MNIST-images
Sample Fashion MNIST images

Step 4- Split the dataset into training and testing and then normalize it. After that convert the labels to the categorical format with the help of to_categorical function.


train_data, val_data = train_test_split(df_train.iloc[:, 1:], test_size=0.2, random_state=42)

train_data, val_data = train_data/255, val_data/255 # normalize training and validation data

train_label, val_label = train_test_split(df_train.iloc[:, 0], test_size=0.2, random_state=42)

test_data = df_test.iloc[:, 1:]
test_label = df_test.iloc[:, 0]
test_data /= 255 # normalize test data


train_label_cat = to_categorical(train_label)
val_label_cat = to_categorical(val_label)
test_label_cat = to_categorical(test_label)

Step 5- Apply Random forest classifier

clf = RandomForestClassifier(n_estimators=100, max_depth = 50, 
                                criterion='entropy', n_jobs=-1)
clf.fit(train_data, train_label_cat)

Step 6- Test the classifier on testing data and then calculate accuracy.

label_names = [f'Class-{i}' for i in range(1, 11)]

# If y_pred_rf is a one-hot encoded matrix
y_pred_rf_vec = y_pred_rf.argmax(axis=1)

print(classification_report(test_label, y_pred_rf_vec, target_names=label_names))
print(accuracy_score(test_label, y_pred_rf_vec))

If you want to further enhance the performance you can apply hyperparameter tunning. It is a technique to find the best parameter which gives the best results.

There are various ways to perform hyperparameter tunning like using GridSearchCV, Randomized Search CV, and Optuna. The most popular know method is GridSearchCV.

Wrap up the Session

In this tutorial, we learned about the random forest algorithm in detail. We learn about bootstrap, decision trees, and bagging terms. We understand how decision trees make splits based on the Gini index. we have also learned how to implement random forest classifier using sklearn.

You’ve come a long way in understanding one of the most important areas of machine learning! If you have questions or comments, then please put them in the comments section below.

If you like the article and would like to support me, make sure to: