You are currently viewing Understanding Random forest algorithm and it’s implementation in Python

Understanding Random forest algorithm and it’s implementation in Python

Loading

Introduction

Random forest algorithm is considered as the most powerful algorithm in machine learning. It is also called ensemble learning because it is a combination of multiple decision trees.

We can use this algorithm for solving classification problems aka categorical outputs or regression problems (continuous outputs).

To understand Random forest in detail we need an understanding of bagged decision trees. This term bagged decision trees requires knowledge of decision tree as well as bootstrap. Let us understand the terms one by one.

Some other blog post that you may want to read is

Bootstrap

In general term bootstrap means “subsampling with replacement”.

Subsampling with replacement means a subsampling is said to be with a replacement if the selected subsample is replaced into the population before drawing the next unit.

(1) Consider we have given 100 samples of integers and your goal is to find the mean. Normally we add the 100 samples and divide by the total no of samples to get the mean.

(2) However in bootstrapping we pick a subsample size, let say 20 samples, we find the mean of these 20 samples and call it μ1. Now put these 20 samples back into the pile and pick another random 20 samples and call it μ2.

(3) Now repeat this process which is subsampling with replacement 30 times and we end up with 30 means. At the last, we take the mean of these mean and that is the final answer.

Now the question arises is that why we use bootstrapped mean instead of the average mean.

And the answer to that question is because if we take the average over an average decreases the variance of the predicted mean. It means there is less chance of overfitting and that is a very desirable property to use bootstrap.

Now let us understand the second term which is Decision trees

Decision Trees

A decision tree is a flowchart like structure. Let us understand the tree structure in detail.

1- Root is used for taking an input sample. Now the roots ask a sample series of questions at a non-leaf node and output a value at the leaf nodes.

Random forest architecture

Depending upon the nature of its values we have different types of decision trees.

If the output variable is categorical then it is a classification tree and it the output variable is continuous then it is regression trees. This is the reason we also called it a CART(classification and regression trees)

Bagging

The general definition of bagging is to create several subsets of data from the training sample chosen randomly with replacement. Each subset of data is used to train their decision tree. It is also called Bootstrap aggregation.

The main goal of the bagging is to reduce the variance of our decision tree classifier. It is only possible by combining several decision trees classifier.

So the question arises why do we use decision trees usually?

And the answer to that question is

1- A single decision trees are unstable

2- A minor change in the training data will highly affect the tree structure. Hence, there is a high variance in predictions.

What is Random Forest

Random forest is usually a bunch of decision trees. In this where we ask a random set of the question. And based on that question it makes the split.

So the question arises is how do we know that at which value we have to split the decision trees?

The answer to that question is by using Gini Index we can find our magical number.

Let us dig deeper and understand what is Gini Index

A Gini Index is a cost function that is used to evaluate the splits. And minimize it.

The mathematical formula is

Gini Index formula

where p of i is the proportion of the samples that belong to class c for a particular node.

Now we have understands the concepts in detail. Let us move to the coding part

Coding random forest classifier using sklearn

The dataset that we are going to use is the Fashion MNIST dataset. You can download this data from Kaggle.

The dataset contains 10 labels. Some of the labels are

  • 0 T-shirt/top
  • 1 Trouser
  • 2 Pullover
  • 3 Dress
  • 4 Coat
  • 5 Sandal
  • 6 Shirt
  • 7 Sneaker
  • 8 Bag
  • 9 Ankle boot

Step 1- Import all the required libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

Step 2- Load the training and test dataset using the pandas library and then shuffle the training data.

df_train = pd.read_csv('fashion-mnist_train.csv')

shuffle_index = np.random.permutation(df_train.shape[0])
df_train = df_train.iloc[shuffle_index]

df_test = pd.read_csv('fashion-mnist_test.csv')

Step 3- Take a look at our data using the matplotlib library

samples = np.random.randint(0, df_train.shape[0], 3)
for i, idx in enumerate(samples):
    sample = np.reshape(df_train.iloc[idx, 1:].values/255, (28, 28))
    plt.subplot(2, 3, i+1)
    plt.title('category {}'.format(df_train.iloc[idx, 0]))
    plt.subplots_adjust(top=1, bottom=0.1)
    plt.imshow(sample, 'gray')
fashion-MNIST-images

Step 4- Visualize the 10 classes using the TSNE algorithm. TSNE stands for t-distributed stochastic neighbor embedding. It is used to preserve points within the clusters.

from sklearn.manifold import TSNE
X = df_train.iloc[:800, 1:]
y = df_train.iloc[:800, 0]

tsne = TSNE(n_components=2, random_state=0,
            perplexity=5, learning_rate=10, n_iter=5000)
X_2d = tsne.fit_transform(X)

target_ids = range(10)
colors = ['#67001f','#b2182b','#d6604d','#f4a582','#fddbc7',
          '#d1e5f0','#92c5de','#4393c3','#2166ac','#053061']
          
fig, ax = plt.subplots(figsize=(6, 6/1.2))
for i, c, label in zip(target_ids, colors, range(10)):
    ax.scatter(X_2d[y==i, 0], X_2d[y == i, 1],
               cmap="viridis", label=label, s=15)
    ax.legend()
plt.show()
TSNE_FASHION_MNIST

Step 5- Split the dataset into training and testing and then normalize it

from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

train_data, val_data = train_test_split(df_train.iloc[:, 1:], test_size=0.2, random_state=42)
train_data, val_data = train_data/255, val_data/255 # normalize training and validation data
train_label, val_label = train_test_split(df_train.iloc[:, 0], test_size=0.2, random_state=42)

test_data = df_test.iloc[:, 1:]
test_label = df_test.iloc[:, 0]
test_data /= 255 # normalize test data


train_label_cat = to_categorical(train_label)
val_label_cat = to_categorical(val_label)
test_label_cat = to_categorical(test_label)

Step 6- Apply Random forest classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=100, max_depth = 50, 
                                criterion='entropy', n_jobs=-1)
clf.fit(train_data, train_label_cat)

Step 7- Test the classifier on testing data and then calculate accuracy.

label_names = ['Class-{}'.format(i) for i in range(1, 11)]

y_pred_rf = clf.predict(test_data)

def sparse_matrix_vec(mat):
    n, m = mat.shape
    vec = np.zeros(n)
    for i in range(n):
        for j in range(m):
            if mat[i, j]:
                vec[i] = j
    return vec

y_pred_rf_vec = sparse_matrix_vec(y_pred_rf)
correct = np.nonzero(test_label == y_pred_rf_vec)[0]
wrong = np.nonzero(test_label != y_pred_rf_vec)[0]

print(classification_report(test_label, y_pred_rf_vec, target_names=label_names))
print(accuracy_score(test_label, y_pred_rf_vec))

The final accuracy that we are getting is 83.8%.

We can further improve this model by using xgboost or by adjusting the hyperparameters.

import xgboost as xgb
from xgboost import XGBClassifier

params = {"loss":"deviance",
          "max_depth":10,
          "n_estimators":100}
xgb_clf = XGBClassifier(**params) 
xgb_clf.fit(train_data, train_label)

params2 = {'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'multi:softprob',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': True,
 'subsample': 1}
 
xgb_clf.get_params()
y_pred_xgb = xgb_clf.predict(test_data)
y_pred_xgb[:5]

correct = np.nonzero(test_label == y_pred_xgb)[0]
wrong = np.nonzero(test_label != y_pred_xgb)[0]

print(classification_report(test_label, y_pred_xgb, target_names=label_names))
print(accuracy_score(test_label, y_pred_xgb))

Now after applying XGBoost, we are getting an accuracy of 90%.

Wrap up the Session

In this tutorial, we learned about the random forest algorithm in detail. We learn about bootstrap, decision trees, and bagging terms. We understand how decision trees make splits based on the Gini index. we have also learned how to implement random forest classifier using sklearn and how we can improve the model using XGBoost.

You’ve come a long way in understanding one of the most important areas of machine learning! If you have questions or comments, then please put them in the comments section below.

So if you like this blog post, please like it and subscribe to our data spoof community to get real-time updates. You can follow our Facebook page to get notification whenever we upload any post so you can never miss any update from us.