Introduction to Regression

In this article, we will learn about regression, the terminologies associated with it, the types of regression models, and the evaluation metrics for the regression models. This blog is an introduction to regression, and the several types of regression can be studied in a much broader manner using the other blogs on this same platform. We have provided a detailed view of other regression algorithms as we introduce the models here. We have also discussed a few use cases of regression at the end of this section which will help us have a closer look at the regression technique.

Table of Contents

Regression

Regression is one of the techniques for applying or designing machine learning algorithms that investigates the relationship between the independent variables and a dependent variable. The independent variable is also known as features, and the dependent variables are outcomes.

In machine learning, regression is used as a method for predictive modeling where the algorithm predicts continuous outcomes.

For example– Predicting the price of a house based on the several input variables like number of rooms, size of land, distance to airport and many other factors.

Here

Predicting the price of a house is dependent variables
number of rooms, size of land, distance to airport are independent variables.

Terminologies associated with regression

While dealing with regression analysis and algorithms, there are several terminologies used which must be discussed to understand the concept of regression. Some of the regression terminologies are:

Bias and Variance

Bias and variance are some of the important factors in regression, and the trade-off between bias and variance plays a major role in determining the performance of the model.

Bias is the measure of how much the predicted values deviates from the true values on average.

Lets take a mathematical examples, Imagine you are guessing someone age

Friend	Real Age (True y)	Your Guess (Predicted y^)	Residual (True – Guess)
Amit	10	8	+2
Riya	15	12	+3
Rahul	20	17	+3

Bias Calculation (Average Mistake)

Variance is the estimation of change that would occur if different training was to be used. Low variance suggests small changes, while high variance suggests large changes in the estimation of the target function, considering that there are changes in the training dataset.

Lets take a mathematical example, imagine you are measuring the height of your three friends.

Friend	Height (in cm)
Amit	140 cm
Riya	150 cm
Rahul	160 cm

Step 1- Calculate the average height (aka mean)

Step 2- Calculate the Difference from Mean and squared difference

Step 3- Calculate variance

Underfitting and overfitting

Underfitting and Overfitting are the most common problems in the data science field.

Underfitting

Underfitting means the model does not perform very well on training as well as the test dataset.
A high bias model may likely to underfit your data.

Overfitting

Overfitting means the model perform very well on training dataset but does not perform well on the test dataset.
A high variance model may likely overfit your data

Bias Variance Tradeoff

Here you trade between the bias and variance to make the model ideal.

Regression coefficients

These are the estimates used to determine the relationship between the predictor variable and the response variable based on the unknown population. The regression coefficients in linear regression are obtained by multiplying the predictor values. Based on the sign of the coefficients, the direction of the relationship between the predictor and response variables can be determined.

A positive sign of the coefficient indicates an increase in the response variable with an increase in the predictor variable. For example- Salary increase wrt Number of years of experience.
The negative sign of the coefficient indicates that the response variable decreases with the increase in the predictor variable. For example- Milk supply more so the demand will be less.

Residuals

A residual is the measurement of the distance of a point from the regression line vertically and determines the error between the predicted value and the actual observed value. Residual can be calculated by using the formula:

Residual = Actual – Predicted

Friend	Real Age (True y)	Your Guess (Predicted y^)	Residual (True – Guess)
Amit	10	8	+2
Riya	15	12	+3
Rahul	20	17	+3

Based on the residual values, a typical residual plot is made with residual values on the Y-axis, and we have the independent variables lying on the X-axis. The assumption needs to be made in linear regression that the errors are normally distributed and independent. A good residual plot has more points closer to the origin and lesser points away from the origin, and the residual plot is symmetric about the origin.

Types of Regression

Regression models are very efficient and can be used for making predictions based on statistical calculations. In this section, we are going to discuss several of the regression models briefly, while a detailed explanation of these algorithms is also available on the same platform. Some of the most popular regression algorithms are being discussed here:

Linear regression

Linear regression is a machine learning algorithm based on the regression technique and is used for dealing with regression tasks. It is one of the most basic algorithms which perform the tasks of forecasting by determining the relationship between independent and dependent variables.

It predicts a dependent variable (y) on the basis of the independent variable (x). it is used for determining the linear relationship and is called linear regression. The hypothesis function of linear regression is given by the equation:

y= β_o + β₁X + ε

where β_o is the constant or intercept
β₁ is the coefficient or slope
X is the Independent variables
Y Is the dependent variables
ε is the error term

This equation can be used for achieving the best-fit regression line, which will be used for predicting the y value in a way such that the error difference between the true value and the predicted value is minimum.

Multiple regression

Multiple regression is also known as multiple linear regression as it works in a similar manner as linear regression does. It is a statistical technique in which several explanatory variables are used for predicting the outcomes of a response variable.

The goal of this algorithm is to model a linear relationship between the independent variables and the explanatory variables where there are multiple independent variables as opposed to the linear regression having a single independent variable.

where Y is the output variable,
X terms are the corresponding input variables
β₁ is the coefficient or slope
βo is the intercept constant

Multiple regression can be used for making predictions about a single variable on the basis of information known about another variable.

Polynomial regression

Polynomial regression is another form of linear regression or a special form of multiple regression in which the relation between the variables is estimated as an nth degree polynomial. This algorithm is sensitive to the outliers, which leads to affecting the performance badly in the presence of one or two outliers.

y = β₀ + β₁x + β₂x² + … + βₙxⁿ + ε

where β_o is the constant or intercept
β₁ is the coefficient or slope
X is the Independent variables
Y Is the dependent variables
ε is the error term

Polynomial regression is able to deal with non-linear data, which obtains non-linear relationships and helps in identifying the curvilinear relationship between the dependent and independent variables. Polynomial regression is applied once the linear regression is applied and is done by converting the input to polynomial terms of the nth degree, where the value of n is chosen using the hit and trial method.

Regularized regression

Lasso

LASSO stands for least absolute shrinkage and selection operator and is a classification algorithm based on regression. The Lasso regression uses the shrinkage method in simple and sparse models where the sparse model is the model with fewer parameters, and in shrinkage, the data values are shrunk towards the central point as a mean.

Lasso regression = loss function + λ || w ||

Where loss function is the difference between predicted and the real values,
λ is the penalty term
|| w ||is themagnitude of coefficients

When to use Lasso Regression

If you have more than no of features in the dataset and you have to selected the important features then you can use lasso regression. This is also called as Feature Selection.

Ridge

It is a regularized regression algorithm that is similar to LASSO regression but performs L2 regression by adding an L2 penalty. This penalty equals the square of the magnitude of coefficients where the coefficients are shrunk by the same factor eliminating none. It does not result in sparse models and adds bias to making the estimates.

Ridge regression = loss function + λ || w ||²

Where loss function is the difference between predicted and the real values,
λ is the penalty term
|| w ||²Squared magnitude of coefficients

When to use Lasso Regression

If you are dealing with problem of multicollinearity (high correlation between variables) then you can use ridge regression.

Elastic net regression

It is a very efficient and widely used form of regularized linear regression, which linearly combines the L1 and L2 penalty functions of the LASSO and ridge regression methods.

Elastic Net regression = loss function + λ₁ || w ||² + λ₂ || w ||

This model overcomes the limitations of the LASSO and the ridge methods, and this model can be reduced to a linear support vector machine. It improves the limitations of the lasso model, where the lasso takes a small number of samples for high-dimensional data. This model is able to provide the inclusion of the “n” number of variables till the point of saturation. It works by subjecting the coefficients to two types of shrinkage.

When to use Elastic net regression

It can tackle the problem of multicollinearity and perform feature selection also that’s why we called it as elastic regression.

Applications of Regression in real world

Regression models are very popularly used in the real world and are very helpful in certain situations due to the statistical form of calculations that it applies. Some of the real-world use cases of regression models are:

Regression analysis can be very efficiently used in businesses as it helps us with several statistical methods. Predictive analytics is the best use of the regression model to forecast future opportunities and risks in business. Supporting decisions and correcting errors are also very efficient in business intelligence which can be used for growing and understanding the business.
Time series data or analytics such as stock forecasting and trading is another popular field where regression is implemented. The statistics in the regression model are the best ways through which the time series data can be analyzed, and therefore, regression is the best way of dealing with time-series data.
Linear regression is used for understanding the relationship between the blood pressure of patients and drug dosage in the fields of medical science, which makes it the best choice in the field of medical research.

Conclusion

In this article, we have discussed an introduction to regression and the algorithms that work on the basis of regression techniques. Regression is a form of supervised machine learning technique and is used in various fields, some of which we have discussed here.

We have also mentioned some of the terminologies associated with regression and what is their importance in regression technique. To continue with the testing part of regression models, we have discussed some of the most popularly known regression metrics, which are really powerful and useful depending on their use cases.

If you like the article and would like to support me, make sure to:

👏 Like for this article and subscribe to our newsletter
📰 View more content on my DataSpoof website
🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter

	Underfitting	Just right	Overfitting
Symptoms	• High training error • Training error close to test error • High bias & Low variance	• Training error slightly lower than test error • Low bias and low variance	• Very low training error • Training error much lower than test error •High variance and low bias
Regression
Classification