Decision trees are one of the most basic and widely used machine learning algorithms, which fall under supervised machine learning techniques. Decision trees can handle both regression and classification tasks, and therefore learning decision trees is a must for those who aspire to be data scientists.
In this article, we will learn about decision trees, how to work with decision trees, and how to implement decision trees in R. We will also discuss the applications of decision trees along with their advantages and disadvantages
What is a decision tree and how does it work?
Decision trees are a non-parametric form of supervised machine learning algorithm used for both classification and regression. As the name suggests, decision trees work by asking a Boolean form of question and, based on the answer, make a decision that goes further in the form of a tree, thus the name decision tree. The model further asks the questions until the prediction is made. The goal of decision trees is to form a model that can predict the value of a target variable by learning through simple questions inferred from the data features.
The decision trees are used to predict the class labels, and prediction starts from the root node of the decision tree. It is straightforward to calculate which attribute must represent the root node and can be done by figuring out the attributes which best separate the training records. This calculation can be done by using the Gini impurity formula, which is a simple formula to distinguish the attributes. However, it becomes complex with a greater number of attributes. Once the root node is determined, the tree forms branches through the questions and decisions made by the tree. This process of branching continues until all the impurities found in the root node are classified.
Decision trees work in the format of a multiple if-else statement analogy where the trees grow through Boolean questions and stop only when the prediction is made.
Various types of Decision tree algorithms in R
The decision tree algorithm in R is of various types, which differ in the way they function and form the tree while making the prediction. The four different forms of decision tree algorithms are ID3, C4.5, C5.0, and CART.
ID3
ID3, also known as Iterative Dichotomiser 3, which was developed by Ross Quinlan in 1986, is the first form of the decision tree that creates a multiway tree and works in a greedy manner. Each node of this multiway tree is found by getting the categorical feature that yields the largest information gain for categorical targets. The trees in ID3 are grown to their maximum size. Pruning is applied to improve the ability of trees and to generalization of the unseen data.
C4.5
C4.5 is a successor to the ID3 model, which has removed the restriction that features need to be categorical. This is done by the dynamic definition of a discrete attribute partitioning the continuous attribute value into intervals of discrete sets. The c4.5 model converts the trees into a set of if-then rules. The accuracy of each if-then rule is evaluated, which determines the order in which they are applied. Pruning is performed by removing the precondition of the rule and checking if the accuracy of this rule improves without it.
C5.0
C5.0 is a modification to C4.5 and uses comparatively less memory to build smaller rulesets which are more accurate than C4.5.
CART
The logic of CART (Classification and Regression trees) is similar to C4.5 and differs in that it does not perform the computation of rule sets and supports numerical target variables. CART uses features and thresholds to construct the binary trees and yields the largest information gain at every node.
The depth of the decision tree is an important factor as the depth of the tree matters a lot in building a tree. This is done through the concepts of entropy and information gain.
Entropy
Entropy measures the uncertainty or impurity within the dataset and determines how a dataset is split through the decision tree. Entropy is also defined as the measure of disorder within the dataset, and the mathematical formula of entropy is:
Entropy is sometimes also denoted through ‘H’.
In the equation above, ‘pi’ is the probability of a class ‘i’ in the dataset.
‘i’ defined in the above equation can either be positive or negative depending on the class it belongs to.
Entropy is measured as a value between 0 and 1, where a value closer to 0 determines that the dataset is pure and contains fewer impurities. In contrast, a value closer to 1 means that the dataset is impure and has a high level of disorder. Sometimes the value of entropy can be greater than 1, which means that it has a very high level of disorder, and therefore, for the sake of simplicity, we can remember it as entropy between 0 and 1.
Entropy is calculated to measure the uncertainty or disorder present in the dataset, and the goal of finding entropy is to remove these disorders and reduce the amount of uncertainty present in the dataset.
Information Gain
If the uncertainty present in the dataset is measured, then we need to measure the reduction of uncertainty in the target class of the dataset with the features available to us.
Information gain is used for measuring the amount of information a feature provides about a class and helps in determining the order of attributes.
Information gain is also known as Kullback-Leibler divergence and is denoted as IG(S, A), where S denotes the set and is the effective change in entropy after the decision has been made regarding a particular attribute denoted by A. Hence, information gain measures the relative change occurring in entropy as per the independent variable. The equation for information gain is given by:
IG (S, A) = H(S) – H(S , A)
This equation can be further extended as:
In this equation, IG (S, A) is the information gain for the set S, and H(S) is the entropy for the whole set, whereas the other term in this equation calculates the entropy after feature A is applied, and P(x) is the probability of x, which is an event.
How to calculate Entropy and Information gain in Decision trees
Let us now take an example to understand how we can calculate entropy and information gain using the formula that we discussed earlier.
Given a set of data and a decision tree needs to be drawn for this dataset, the very first thing that needs to be understood is how many attributes are there, what is the target class and whether it is a binary or a multi-valued classification. The dataset is given here:
Day | Outlook | Temp | Humidity | Wind | Play-Cricket |
---|---|---|---|---|---|
D1 | Sunny | Hot | High | Weak | No |
D2 | Sunny | Hot | High | Strong | No |
D3 | Overcast | Hot | High | Weak | Yes |
D4 | Rain | Mild | High | Weak | Yes |
D5 | Rain | Cool | Normal | Weak | Yes |
D6 | Rain | Cool | Normal | Strong | No |
D7 | Overcast | Cool | Normal | Strong | Yes |
D8 | Sunny | Mild | High | Weak | No |
D9 | Sunny | Cool | Normal | Weak | Yes |
D10 | Rain | Mild | Normal | Weak | Yes |
D11 | Sunny | Mild | Normal | Strong | Yes |
D12 | Overcast | Mild | High | Strong | Yes |
D13 | Overcast | Hot | Normal | Weak | Yes |
D14 | Rain | Mild | High | Strong | No |
The dataset present here consists of 4 attributes, and we need to decide which is the first attribute that needs to be selected as the root node. Then, once a root node is selected, which is the next attribute that can be selected as a root node. For this, we need to calculate the information gained from every attribute. Once the information gain of every attribute is calculated, we can decide the attribute which has more importance based on the value of information gain, and that attribute can be selected as a root node for the given point in time
The first thing that we need to calculate is entropy for calculating information gain. Entropy measures the homogeneity of examples and depends on the distribution of the random variable. First, we will consider the wind column.
Entropy is calculated by using the above formula. Then the entropy of the wind column is:
E([9+, 5-]_ = -9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Here in the dataset, we have five negative and nine positive values.
Information gain will measure the expected reduction in entropy or the improvement in the data. The attribute with the minimum amount of impurity will be considered as a root node.
Therefore, the information gain can be calculated using the formula mentioned above as:
IG(S, Wind) = E(S) – (8/14) E(Sweak) – (6/14) E(Sstrong)
= 0.94 – (8/14) 0.811 – (6/14) 1.00
= 0.048
Therefore, the information gain of the wind attribute is calculated as 0.048. Similarly, we need to calculate the information gain of all the attributes, and then we can decide on the root node.
Decision tree implementation in R
There are various ways to implement the decision tree algorithm. One of the ways to implement a decision tree is using the rpart library.
The dataste is collected from the Kaggle website. It contains information about the patients who are having coronary heart disease.
The link to the dataset is https://www.kaggle.com/code/captainozlem/framingham-chd-preprossing-data/data
Prerequisite to run this code
You have to install two packages from the cran repository.
- install.packages(rpart)– This library is used to implement a decision tree
- install.packages(rpart.plot)– This library is used to plot the decision tree graph.
- install.packages(caTools)– This library is used to split the dataset into training and testing
Import the required libraries
library(rpart)
library(rpart.plot)
library(caTools)
Load the CSV dataset
data <- read.csv('CHD_preprocessed.csv')
Basic data exploration
#checking for the shape of the data
ndim(data)
# checking for the column in the dataset
str(data)
# summary statistics of the dataset
summary(data)
Splitting the dataset
set.seed(123)
#split the dataset into training and testing
split = sample.split(df$TenYearCHD, SplitRatio = 0.75)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)'
Applying decision tree algorithms
tree <- rpart(TenYearCHD ~ ., data = training_set,method = 'class')
rpart.plot(tree)
Evaluating the model on the test set
prediction <- predict(tree, newdata = test_set,type = 'class')
cm <- table(test_set$TenYearCHD, prediction)
cm
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy on test data is ', accuracy))
>> Accuracy on the test data is 86%
Applications of Decision tree
The decision algorithms are suitable for both regression and classification purposes and, therefore, have a lot of applications in the real world. Some of the applications of decision trees are:
- Decision trees are easy to understand and implement; therefore, they are one of the best ways to start a machine learning career for beginners, and the fact it requires less data preparation makes them perfect for those who are new to data science.
- Decision trees can assess future growth opportunities as they can use historical data to evaluate the growth opportunities of several businesses.
- Decision trees can effectively use demographic data for finding prospective clients, which helps streamline a market budget and make informed decisions that are appropriate for the growth of the business.
- Decision trees can be used as an excellent support tool in various fields. Lenders can use it to predict the probability of a customer returning the loan amount through the predictive model generation using the historical data of characteristics of previous customers.
Advantages and disadvantages of decision trees
Some advantages of implementing decision trees are:
- Decision trees are simple to visualize and can be interpreted and understood easily.
- Decision trees require a less amount of data preparation as compared to other machine learning algorithms. However, decision trees are unable to handle the missing values.
- The cost of implementing a decision tree for prediction is logarithmic as per the number of data points used to train the decision tree.
- Decision trees are able to handle data in both the numerical and categorical forms, with exceptions in some of the implementations.
- Decision trees can work with problems concerning multi-output.
- Decision trees use a white box model
- The validation of models in decision trees is done through statistical tests, and thus, the reliability of the model is guaranteed.
- The decision tree model performs well even if the models’ assumptions are violated by the true model from where the data generation takes place.
Some disadvantages of implementing decision trees are:
- Decision trees are prone to overfitting when the learners create over-complex trees, which are not able to generalize data well. To avoid the overfitting of decision trees, several mechanisms such as pruning, setting the maximum depth of the decision tree, or setting a minimum number of samples at the leaf node are required to be applied.
- Decision trees are not good at extrapolation, which means that the predictions made through decision trees are neither continuous nor smooth.
- Decision trees are unstable as a small variation in the data set might result in the generation of a completely different tree. This problem can be dealt with through the ensemble technique of decision trees.
- Decision trees can create biased trees if some classes of the data are set to dominate. Therefore, the balancing of the dataset is required to fit with the decision trees.
- Decision trees are unable to express some of the concepts such as parity, multiplexer problems, or XOR, which makes it hard for decision trees to learn these concepts.
Conclusion
So in this blog, we have learned about what is decision tree and how to implement it in R programming languages. We have also talked about how to calculate entropy and information gain through a example.
If you like this blog, you can share it with your friends or colleague. You can connect with me on social media profiles like Linkedin, Twitter, and Instagram.
Linkedin – https://www.linkedin.com/in/abhishek-kumar-singh-8a6326148
Twitter- https://twitter.com/Abhi007si
Instagram- www.instagram.com/dataspoof
I loá´ e it when folks come together and share opinions.
Great site, stick with it!
Thanks for appreciating.