You are currently viewing Introduction to Regression

Introduction to Regression

Loading

In this article, we will learn about regression, the terminologies associated with it, the types of regression models, and the evaluation metrics for the regression models. This blog is an introduction to regression, and the several types of regression can be studied in a much broader manner using the other blogs on this same platform. We have provided a detailed view of other regression algorithms as we introduce the models here. We have also discussed a few use cases of regression at the end of this section which will help us have a closer look at the regression technique.

Regression

Regression is one of the techniques for applying or designing machine learning algorithms that investigates the relationship between the independent variables and a dependent variable. The independent variable is also known as features, and the dependent variables are outcomes. In machine learning, regression is used as a method for predictive modeling where the algorithm predicts continuous outcomes.

Regression analysis mostly deals with predictive or forecasting models and is a common use for supervised machine learning models. The machine learning technique generally involves plotting the best fit line through the data points, and then the distance between the line and individual data points is minimized to achieve the best fit line. It is a statistical model which models the relationship between the dependent variable and one or more independent variables. In this way, regression analysis helps in understanding the concept of how the value of a dependent variable is changing, corresponding to that of an independent variable.

Terminologies associated with regression

While dealing with regression analysis and algorithms, there are several terminologies used which must be discussed to understand the concept of regression. Some of the regression terminologies are:

Underfitting and overfitting

Underfitting is a case in data science when the model fails to understand the relation between the input variables and the output variables accurately, which leads to a high error rate on both the training set and unseen data. The high variance and low bias are the conditions indicating underfitting. On the other hand, overfitting is quite the opposite case of underfitting, which occurs when the model has been trained more than it needed to be or when there is a high amount of complexity which results in a higher rate of an error on the test dataset.

Bias and Variance

Bias and variance are some of the important factors in regression, and the trade-off between bias and variance plays a major role in determining the performance of the model. Bias is the assumption made by the model so that the target function can learn easily. Low bias means fewer assumptions, and high bias means more assumptions. Variance is the estimation of change that would occur if different training was to be used. Low variance suggests small changes, while high variance suggests large changes in the estimation of the target function, considering that there are changes in the training dataset.

Regression coefficients

These are the estimates used to determine the relationship between the predictor variable and the response variable based on the unknown population. The regression coefficients in linear regression are obtained by multiplying the predictor values. Based on the sign of the coefficients, the direction of the relationship between the predictor and response variables can be determined. A positive sign of the coefficient indicates an increase in the response variable with an increase in the predictor variable. The negative sign of the coefficient indicates that the response variable decreases with the increase in the predictor variable.

Residuals

A residual is the measurement of the distance of a point from the regression line vertically and determines the error between the predicted value and the actual observed value. Residual can be calculated by using the formula:

Based on the residual values, a typical residual plot is made with residual values on the Y-axis, and we have the independent variables lying on the X-axis. The assumption needs to be made in linear regression that the errors are normally distributed and independent. A good residual plot has more points closer to the origin and lesser points away from the origin, and the residual plot is symmetric about the origin.

Types of Regression

Regression models are very efficient and can be used for making predictions based on statistical calculations. In this section, we are going to discuss several of the regression models briefly, while a detailed explanation of these algorithms is also available on the same platform. Some of the most popular regression algorithms are being discussed here:

Linear regression

Linear regression is a machine learning algorithm based on the regression technique and is used for dealing with regression tasks. It is one f the most basic algorithms which perform the tasks of forecasting by determining the relationship between independent and dependent variables. It predicts a dependent variable (y) on the basis of the independent variable (x). it is used for determining the linear relationship and is called linear regression. The hypothesis function of linear regression is given by the equation:

This equation can be used for achieving the best-fit regression line, which will be used for predicting the y value in a way such that the error difference between the true value and the predicted value is minimum.

Multiple regression

Multiple regression is also known as multiple linear regression as it works in a similar manner as linear regression does. It is a statistical technique in which several explanatory variables are used for predicting the outcomes of a response variable. The goal of this algorithm is to model a linear relationship between the independent variables and the explanatory variables where there are multiple independent variables as opposed to the linear regression having a single independent variable. Multiple regression can be used for making predictions about a single variable on the basis of information known about another variable.

Polynomial regression

Polynomial regression is another form of linear regression or a special form of multiple regression in which the relation between the variables is estimated as an nth degree polynomial. This algorithm is sensitive to the outliers, which leads to affecting the performance badly in the presence of one or two outliers. Polynomial regression is able to deal with non-linear data, which obtains non-linear relationships and helps in identifying the curvilinear relationship between the dependent and independent variables. Polynomial regression is applied once the linear regression is applied and is done by converting the input to polynomial terms of the nth degree, where the value of n is chosen using the hit and trial method.

Regularized regression

Lasso

LASSO stands for least absolute shrinkage and selection operator and is a classification algorithm based on regression. The Lasso regression uses the shrinkage method in simple and sparse models where the sparse model is the model with fewer parameters, and in shrinkage, the data values are shrunk towards the central point as a mean.

Ridge

It is a regularized regression algorithm that is similar to LASSO regression but performs L2 regression by adding an L2 penalty. This penalty equals the square of the magnitude of coefficients where the coefficients are shrunk by the same factor eliminating none. It does not result in sparse models and adds bias to making the estimates.

Elastic net regression

It is a very efficient and widely used form of regularized linear regression, which linearly combines the L1 and L2 penalty functions of the LASSO and ridge regression methods. This model overcomes the limitations of the LASSO and the ridge methods, and this model can be reduced to a linear support vector machine. It improves the limitations of the lasso model, where the lasso takes a small number of samples for high-dimensional data. This model is able to provide the inclusion of the “n” number of variables till the point of saturation. It works by subjecting the coefficients to two types of shrinkage.

Logistic regression

It is a machine learning regression algorithm that models the probability of a discrete outcome through the input variable provided to it. Logistic regression is of three types based on the type of outcomes that it produces. A binary logistic regression model has two possible outcomes. A multinomial logistic regression model produces an output of more than two categories without ordering, and the ordinal logistic regression produces an output having more than two categories with ordering. In logistic regression, a threshold is set to predict the class to which the data belongs, and based on that, a decision boundary is created, which can be linear or non-linear.

Use cases of regression in real world

Regression models are very popularly used in the real world and are very helpful in certain situations due to the statistical form of calculations that it applies. Some of the real-world use cases of regression models are:

  1. Regression analysis can be very efficiently used in businesses as it helps us with several statistical methods. Predictive analytics is the best use of the regression model to forecast future opportunities and risks in business. Supporting decisions and correcting errors are also very efficient in business intelligence which can be used for growing and understanding the business.
  2. Time series data or analytics such as stock forecasting and trading is another popular field where regression is implemented. The statistics in the regression model are the best ways through which the time series data can be analyzed, and therefore, regression is the best way of dealing with time-series data.
  3. Linear regression is used for understanding the relationship between the blood pressure of patients and drug dosage in the fields of medical science, which makes it the best choice in the field of medical research.

Conclusion

In this article, we have discussed an introduction to regression and the algorithms that work on the basis of regression techniques. Regression is a form of supervised machine learning technique and is used in various fields, some of which we have discussed here. We have also mentioned some of the terminologies associated with regression and what is their importance in regression technique. To continue with the testing part of regression models, we have discussed some of the most popularly known regression metrics, which are really powerful and useful depending on their use cases. We hope that this article will prove to be helpful and will increase your knowledge in the field of regression.

If you like the article and would like to support me, make sure to: