Hello and welcome to our trip with Python into the realm of multiple linear regression! Ever wondered how to make predictions based on a variety of variables? This is the official guide. We can help you with everything from deciphering intricate relationships in your data to doing practical Python code. This step-by-step lesson is ideal for anybody interested in data science, regardless of experience level or coding knowledge. Prepare to open up new possibilities for data interpretation and analysis. Let’s begin this thrilling journey together!
What is Multiple linear regression?
In linear regression, we deal with a single independent variable, whereas, in multiple linear regression, we deal with more than two variables.
For example- Rainfall depends upon many parameters including pressure, temperature, wind speed, humidity, and many more.
The mathematical equation which represents the MLR is
Here y is a dependent variable, bo,b1,b2, and bn are the coefficients of the regression model, and x1, x2,…..xn represent the independent variables.
How to select the best features in the dataset
Suppose in a dataset we have more than hundreds of features. So the question arises is how do we select which features have the strongest impact on the target variable?
So to solve this problem we use the feature selection method.
Feature Selection is a technique that is used to find those variables which have the most impact on the target variable.
Backward elimination
This method involves the removal of variables from the model.
Steps to perform backward elimination
- The first step is to fit the model with all independent variables.
- The second step is to choose the threshold value let’s say p= 5%
- The next step is to remove all the independent variables whose P-value is more significant than 5% otherwise finish.
- Fit the model with all remaining variables
Implementation of multiple linear regression in Python
The dataset is taken from Kaggle. This dataset contains 7 different fish species in fish market sales. The columns are fish species, weight, length, height, and width.
Step 1– The first step is to load all the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
- NumPy: It is used for mathematical operation
- pandas: It is used to load the dataset
- Matplotlib and Seaborn: It is a visualization tool to plot graphs
- train_test_split: It is used to separate our data into training and testing subsets
Step 2– The second step is to load the dataset using the pandas read_csv function.
# Load the dataset
data = pd.read_csv('Fish.csv')
# Displaying first five rows
data.head()
Step 3- The third step is to check the dataset shape, and data types, and count any missing values in the dataset. In our case, there are no missing values in the dataset.
# Displaying the no of rows and columns
data.shape
# Print datatypes of each column
data.dtypes
#checking for missing values
data.isnull().sum()
Step 4- The next step is to count all the unique species in the column and then make a bar plot showing the number of fish in each class.
#Count no of unique species
data['Species'].value_counts()
#plotting it
sns.barplot(x=data.index, y=data['Species']);
plt.xlabel('Species')
plt.ylabel('Counts of Species')
plt.show()
Step 5– Now we encode all the categorical columns of the dataset and make a correlation table. After that, we use a heatmap to find the relationship between features.
# Encoding categorical columns
for col_name in data.columns:
if(data[col_name].dtype == 'object'):
data[col_name]= data[col_name].astype('category')
data[col_name] = data[col_name].cat.codes
Next, the heatmap is used to find the correlation between the variables.
# Correlation of the Variables:
data.corr()
sns.heatmap(data.corr(), annot=True);
The heatmap is shown down below,
Here is a brief interpretation of correlation coefficients.
- All other factors have a negative impact on species. This implies that all other parameters normally drop when the species code increases.
- Weight, Length1, Length2, Length3, and Width all have a significant positive association. This shows that larger, broader fish are more common among heavier fish.
- Lengths 1, 2, and 3 have a lot of positive correlation with one another. In other words, these three variables—which probably reflect various metrics of fish length—tend to rise or fall together.
- Weight, Length1, Length2, and Length3 are all positively associated with Height and Width, indicating that taller and broader fish also tend to be heavier and longer.
Step 6– Now we check for any outliers (An outlier is a point or group of points that are different from other points) that are present in the dataset and remove them. By removing outliers we get a more accurate model. Hence it’s a good idea to remove them.
We are going to use the interquartile range (IQR ) to detect outliers. And then we can visualize it using BoxPlot. As you can show in the figure, some points are outside the box and are termed outliers.
sns.boxplot(x=data['Weight'])
dfw = data['Weight']
dfw_Q1 = dfw.quantile(0.25)
dfw_Q3 = dfw.quantile(0.75)
dfw_IQR = dfw_Q3 - dfw_Q1
dfw_lowerend = dfw_Q1 - (1.5 * dfw_IQR)
dfw_upperend = dfw_Q3 + (1.5 * dfw_IQR)
dfw_outliers = dfw[(dfw < dfw_lowerend) | (dfw > dfw_upperend)]
dfw_outliers
Similarly, check outliers for other columns also using the above technique.
Step 7- Now we define our input and target variables. And then we split the dataset into training and testing. After that, we Fit the Multiple Linear Regression model in the Training set and predict the test set results.
#defining input and target variables
X= data.loc[:, data.columns != 'Weight']
y= data['Weight']
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
# Fitting the Multiple Linear Regression in the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_Train, Y_Train)
# Predicting the Test set results
Y_Pred = regressor.predict(X_Test)
Step 8- Now we perform feature selection techniques. We are going to use backward elimination. After that, we fit the Multiple Linear Regression in the Optimal training set and predict the test set result.
# Building the optimal model using Backward Elimination
import statsmodels.api as sm
X = np.append(arr = np.ones((159, 1)).astype(int), values = X, axis = 1)
X_Optimal = X[:, [0,1,2,3,4,5,6]]
regressor_OLS = sm.OLS(endog = y, exog = X_Optimal).fit()
regressor_OLS.summary()
Next, the best features are selected and the ordinary least square summary is displayed.
X_Optimal = X[:, [0,1,2,3,5,6]]
regressor_OLS = sm.OLS(endog = y, exog = X_Optimal).fit()
regressor_OLS.summary()
Now fit the multiple linear regression model and makes the predictions on the test set.
# Fitting the Multiple Linear Regression in the Optimal Training set
X_Optimal_Train, X_Optimal_Test = train_test_split(X_Optimal,test_size = 0.2, random_state = 0)
regressor.fit(X_Optimal_Train, Y_Train)
# Predicting the Optimal Test set results
Y_Optimal_Pred = regressor.predict(X_Optimal_Test)
print(Y_Optimal_Pred)
Interpreting the results
Here is the interpretation of the two regression models. Six predictor variables (x1 to x6) are used in the first model, while the fourth predictor (x4) is removed from the second model depending on its p-value.
First Model:
- R-squared is 0.895, which indicates that the independent variables can account for around 89.5% of the variation in the dependent variable (Weight).
- The reasonably high F-statistic of 214.9 indicates that at least some of the predictors are meaningful.
- The change in the dependent variable caused by a one-unit change in the predictor is represented by a coefficient (coef), assuming that all other predictors remain constant.
- Given that all other predictors are included in the model, the p-value (P>|t|) for each predictor tests the null hypothesis that the predictor’s coefficient is zero. The p-values for variables x2, x3, and x6 are higher than the usual cutoff of 0.05, indicating that they are not statistically significant.
Second Model:
- R-squared is somewhat lower (0.889), indicating that the remaining variables can account for around 88.9% of the variance in Weight.
- Since the F-statistic is now 244.7, at least some predictors are still likely to be significant.
- Variables x2 and x3 now exhibit high p-values, indicating they aren’t statistically significant predictors after variable x4 was eliminated.
By comparing the two models, we can observe that the second model employs one fewer predictor yet has almost the same explanatory power (R-squared). This is in line with Occam’s Razor’s basic tenet that the simpler the model, the better. However, domain expertise and the model’s goal should also be taken into account while choosing variables. For instance, a variable may still be crucial to include even though it has little statistical significance for theoretical reasons.
Applications
Here are the 5 major of the multiple linear regression model
- Economics: Multiple linear regression is useful in modelling economic patterns, such as how income, wealth, and interest rates affect consumer spending.
- Healthcare: It’s utilized to research how therapies and lifestyle variables affect patient outcomes. For example, estimating life expectancy depends on illnesses, food, and activity.
- Real estate: Based on factors including location, size, and age, it helps forecast property values.
- Marketing: It helps to comprehend how spending on advertising affects sales across various channels.
- Finance: By taking into account factors like earnings, GDP growth, and other market indicators, MLR aids in the prediction of stock values.
Pros and cons
Next, is the pros and cons of using the multiple linear regression model
Pros | Cons |
1 It can help identify the relative influence of predictors. | 1. Assumes a linear relationship between the dependent and independent variables. |
2. It handles overfitting through regularization techniques. | 2. Sensitive to outliers and can lead to a poor model. |
3. It can analyze multiple predictors simultaneously. | 3. Multicollinearity can be a serious issue, requiring careful correlation analysis and potentially variable removal. |
4. It provides a global model of the dataset, which can be advantageous in understanding relationships between variables. | 4. Does not handle non-numerical (categorical) variables well without conversion. |
5. Easy to implement and interpret the output. | 5. Assumes no autocorrelation (i.e., the residuals are independent). |
6. Can be used to infer causal relationships between variables. | 6. It assumes the residuals are normally distributed and homoscedastic. |
Conclusion
In conclusion, Multiple linear regression is a potent statistical technique that enables us to analyze the relationship between a number of independent factors and a dependent variable. You should now have a thorough knowledge of the concepts underlying multiple linear regression and how to use it in Python thanks to this lesson. The assumptions of linearity, normalcy, and homoscedasticity must all be satisfied for the findings to be of high quality, despite the fact that it has many applications in a wide range of domains. We also emphasized the methods for feature selection and deletion that help in the development of more precise models. Always keep in mind that learning multiple linear regression involves practice and a deeper understanding of its statistical foundations. So keep looking around and have fun analyzing!
If you like the article and would like to support me, make sure to:
- 👏 Like for this article and subscribe to our newsletter
- 📰 View more content on my DataSpoof website
- 🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter