In this tutorial, you will learn about statistical modelling technique called as Linear regression in great detail and you will implement it using R programming language.
What is Linear regression
Linear regression is a very commonly used and well-known algorithm in machine learning and statistics which serves a great purpose in prediction tasks. Linear regression is very helpful in predicting the value of a certain variable using the value of another variable. The variable that is to be predicted is called the dependent variable whereas the variable that we are using for the prediction of this variable’s value is known as the independent variable. With this incredible ability of solving the problems linear regression has a wide range of usage. Linear regression is one of the first algorithms that we need to become familiar with while starting our career in machine learning and data science. The reason why most of us start with algorithm is because linear regression is simple to understand and easier to implement compared to others. However, there are several other types of regression available in machine learning based on the data used, but linear regression is the simplest of all due to the simple mathematical calculations involved. In this article we are going to discuss more about linear regression, how it works, the advantages and disadvantages of using linear regression and much more as we move further.
Linear regression is basically a statistical model that analyzes the relationship of the two variables where the first is a response variable and another is a set of one or more explanatory variables along with their interactions. These variables are known as dependent and independent variables. The relationship between the dependent and the independent variable is linear and is represented by a straight line in the graph obtained and is the best fit line. The best fit line is obtained by considering the data points and then plotting this best fit line to fit in the model in the best possible manner. When we have one explanatory variable then it the simple linear regression and in case of more than one explanatory variable it is called as the multiple linear regression.
A linear regression is represented through the following expression in mathematical terms.
Y = b0 + b1x1 + … +bpxp + ε,
This expression can be represented on the best fit line based on the linear equation as:
Y = b0 + b1x1 + ε,
Where,
Y: the dependent variable.
b0: the Y intercept.
b1: the slope of the line.
x: the independent variable.
ε: the error obtained in the line.
In the given linear equation, the dependent variable that needs to be predicted has been denoted by Y. The line that touches the Y-axis is the intercept b0. The independent variables that determine the prediction of Y is represented by x whereas the error in the prediction is denoted by ε.
Terminologies associated with the linear regression
Let us now have a look at some of the basic terminologies related to Linear regression which we need to be familiar with before we move further into the algorithm.
Cost function: The cost function is very beneficial in providing us the best possible values for b0 and b1. These values are very beneficial to make the best fit line for the available data points. This can be done by converting the problem into a minimization problem and then getting the best values for b0 and b1. The error between the actual value and the predicted value can be minimized by using the cost function.
The mean squared error (MSE) can be used for minimizing the error, and we can also change the values of b0 and b1Â and the value of MSE will be settled to the minimum value.
Gradient descent: Gradient descent is an important terminology for understanding linear regression, which is a method of updating the values of b0 and b1 to reduce the mean squared error. The motivation behind this action is to iterate the values of b0 and b1 until we obtain the minimum value of MSE. While updating the values for b0 and b1, we do it by taking the gradients from the cost function, and these gradients can be found by taking partial derivatives with respect to b0 and b1.
The linear regression models are fitted using the least squares approach; however, it can also be done using other ways of fitting the model such as through minimization of the lack of fit in other norm or through the method of minimizing the penalized version of the least squares cost function as is done in case of the ridge and lasso regression models.
Let us take a very simple example of implementing linear regression and understand how is linear regression useful and in which kind of problems can it be used for to determine the results.
Suppose that we need to calculate the age of five children based on their heights. For this problem we need to assume their age one by one and what must be the height of an average child at that particular age. Using linear regression, we can plot the assumptions made on a graph and then a best fit line is drawn according to the plotted points. When the height of a particular child converges with the best fit line, the linear regression will provide the age of the child based on the assumption provided. If in case we provide real world data of the relationship between the height and age of various children to the linear regression algorithm, then the algorithm will be able to predict the age similar to that of the provided dataset. We can also provide various other variables such as gender, weight, height of parents, etc., as a form of dataset so that the algorithm can predict the output in a more generalized way.
Assumptions of Linear regression
There are 4 assumption made by linear regression are
1- linearity
There should be a linear relationship between the target variable and the input variable. Linear relationships means when we increase the value of x, the y value also increase and vice versa is also true. To check the linear relationship we have to make scatter plot of the target variable and the input variable.
If the data violates this assumption then we have to fix this. Some of the ways are adding another feature to the model or take the log of the dependent variable.
2- Homoscedasticity
It means when the error term is same across all values of the input variables. To check this assumption you can create the scatter plot between the fitted value and the residual plot.
And If this assumption is violated then we can fix this by taking log of the dependent variable.
3- Independence
The residual errors are independence of each other. To check this assumption we have to plot residual time series plot(i.e plot of residual and time).
If this assumption is violated then we have to add lags on input and target variables or try to add some dummy variables in the data.
4- Normality
It assumes that the residual is normally distributed. You can check this assignment by creating QQ plot.
If this assumption is violated then you can fix this by removing outliers from the dataset or take the log of input and the target variables.
Advantages of using linear regression:
- Implementation of linear regression is simple and the output coefficients can be easily interpreted while using linear regression.
- This algorithm can be implemented as a best case scenario when we know the relationship between the dependent and the independent variable that have a linear relationship, because linear regression is less complex compared to other algorithms.
- As we know that linear regression is susceptible to over-fitting, however it can be avoided through some dimensionality reduction techniques such as L1 and L2 regularization techniques and cross validation techniques.
Disadvantages of using linear regression:
- Linear regression is simple to implement, however, while using linear regression, the outliers present in the dataset can have adverse effects on the performance of regression and the boundaries in using this technique are linear.
- Linear regression assumes the linearity between the dependent variables and independent variables.
- Linear regression is prone to overfitting and we need to deal with noise quite often.
- The problem of multicollinearity still remains in linear regression.
Use cases of linear regression
Linea regression works by the method of applying a relationship between the dependent and independent variables and therefore getting a best fitting line for the prediction of the outcomes. Linear regression can be used in solving simple real-life problems related to prediction tasks. Linear regression can be used in various fields and is able to solve a lot of problems related to data science. Some of the fields where linear regression is commonly used are:
- Risk analysis
- Housing and price prediction problems and other factors.
- Finance related applications like prediction of stocks, investment, banking, etc.
- Sales forecasting and other similar prediction tasks.
Linear regression can be extensively used when the required action is related to prediction, forecasting, or error reduction. Linear regression can also be used in tasks related to behavioral, biological, environmental problems, social sciences and business related to finance. It can also be used to fit a predictive model to that of an observed data set of values of the variables that are being used. The linear regression analysis can be used for quantifying the strength of the relationship between these variables.
How to implement linear regression in R:
There are various steps in R which needs to followed to perform the process of implementation of an algorithm using the concerned dataset which is available with us. The steps have been listed in brief and is shown using the codes in the later part. The steps to be followed are:
- Load the required modules and libraries necessary for dealing with this problem.
- Load the required dataset according to the problem statement.
- Explore the dataset thoroughly by having a look at its rows, columns, variables and other features using certain libraries.
- Remove the redundancies in the dataset, by removing the null values and other columns which are not required.
- Now split the dataset in training and testing modules.
- Generate the model of linear regression by training it using the training dataset which we split in the above step.
- Now, evaluate the model using the classification evaluation metrics and the testing dataset. The accuracy of the model describes the efficiency of the trained model.
You can download the dataset from here names as bonds.txt
Let us start coding linear regression in R programming languages
Step 1- The first step is to load the dataset
dataset = read.delim("bonds.txt", row.names = 1)
head(dataset)
Step 2- Split the dataset into training and testing
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$BidPrice, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Step 3- Now calling the linear regression model and then fit it
regressor = lm(formula = BidPrice ~ CouponRate, data = training_set)
Step 4- Now we make the prediction on the test set
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
Step 5- We visualize the training and test set results
# Visualizing the Training set results
install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$ CouponRate, y = training_set$ BidPrice),
colour = 'red') +
geom_line(aes(x = training_set$ CouponRate, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('CouponRate vs Bid Price(Training set)') +
xlab('CouponRate') +
ylab('Bid Price')
# Visualising the Test set results
ggplot() +
geom_point(aes(x = test_set$CouponRate, y = test_set$ BidPrice),
colour = 'red') +
geom_line(aes(x = training_set$ CouponRate, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('CouponRate vs Bid Price(Test set)') +
xlab('CouponRate') +
ylab('Bid Price')
Summary
Now you have completed this tutorial. In this tutorial, you have learned about what is linear regression, assumptions of linear regression, technical terminologies associated with linear regression, benefits and how to implement it using R programming languages.