Regularization is a process in machine learning which is applied to control the overfitting of the data, and this overfitting occurs when the training dataset and the testing dataset vary too much. Therefore, regularization is implemented in algorithms such as linear regression to reduce overfitting.
There are two forms of regularization that are very popular, and they are referred to as the L1 regularization and the L2 regularization. Previously, we studied about lasso algorithm which uses L1 regularization techniques, and talked about Lasso regression and regularization concepts.
Moreover, ridge regression and lasso regression are ways of applying regularization. These two regression techniques can be differentiated on the basis of the techniques being employed by them to add a penalty.
This blog will learn about Ridge regression and how to implement it in R. Additionally, we will also see how to perform feature selection using Ridge regression, its applications, and its benefits.
Ridge regression
The Ridge regression algorithm works on the Principal of the L2 regularization technique. In the ridge algorithm, a term or a value “penalty” is added, which helps in deriving the best fit through the training dataset and helps in achieving a limited amount of variance through the testing data.
The cost function of the ridge regression can be calculated with the help of the equation provided below:
The part of the equation highlighted in yellow represents the part through which we can achieve the L2 regularization. In this equation, if the value of lambda is taken as zero, then we get the OLS. However, if the value of lambda is very high, then underfitting might occur, which might add too much weight. Therefore, depending upon the value of lambda, we can use ridge regression to solve the issue related to overfitting.
Bias and variance are two important features that must be taken into account while working with regression problems in the field of statistics. When determining the accuracy of the estimations, bias is defined as the difference between the projected or expected estimator and the actual population parameter. On the other hand, variance can be used for measuring the spread or uncertainty in the estimates. For a model to be working perfectly, we need both the variance and the bias of the models to be low. In any case, even if the values of either bias or variance are high, it will result in the model performing poorly.
Relationship with respect to linear regression
Ridge regression is also an extension of the linear regression algorithm, and in ridge regression, the loss function is being modified to minimize the complexity of the models. As we have seen in the lasso regression, we need to add a penalty parameter for the modification.
Similarly, for the modification of the loss function in ridge regression, we can add a penalty parameter that is equivalent to the squared value of the magnitude of the coefficients.
The mathematical formula for calculating the loss function can be provided through this equation:
Loss = OLS + alpha * summation ( squared coefficient values )
Where the OLS is the ordinary least squares regression and alpha is the value that determines the weight of the model. The value of alpha in ridge regression is often taken as zero.
Relationship with respect to Multiple regression
Multiple regression is a form of regression that follows the statistical technique for the analysis of the relationship between the dependent and the independent variable in which there is a single dependent variable and multiple independent variables.
As ridge regression is a form of multiple regression, therefore, the ridge regression works as per the bias-variance trade-off and thus trades the variance for bias which is also an advantage of applying ridge regression. To apply the L2 regularization method, the ridge regression adds the squared magnitude of the coefficient as the part of penalty to that of the loss function, whereas in the lasso regression, the absolute value of the magnitude is added.
The OLS estimator, which we are using to calculate the loss function, has the property of being unbiased and thus has high variance. This is due to the fact that there are many predictors, and the predictor variables are correlated to each other. This can result in the poor performance of the model, and to solve this problem, we need to reduce the variance by introducing bias. The approach of introducing the bias for reducing variance is called regularization and is very beneficial for predictive analysis of the model. Thus ridge regression can be applied to creating a model with less model complexity and less number of predictors.
Properties of Ridge regression
Ridge regression is an algorithm that is used for tuning a model and analyzing the data, which follows the multiple regression equation, and the nature of data is multicollinear. Ridge regression is not as widely used as Lasso regression due to the complexity behind it; however, it can be easily understood with the help of the concept of multiple regression. The idea behind the implementation of ridge regression is to fit new lines that are not capable of fitting.
As we discussed earlier, ridge regression handles multicollinear data, so let us discuss what multicollinearity means. Multicollinearity is a feature or a process with which a predicted value in the multiple regression model can be linearly predicted, and a certain level of accuracy can be attained through this feature. Multicollinearity occurs when the correlation between the two predicted values is high. It is possible that inaccuracy through multicollinearity can be caused in the regression coefficient estimates as it signifies the correlation between the independent variable in the models.
Difference between ridge and Lasso regression
In some ways and depending on some conditions, the ridge and the lasso regression algorithms are similar and work for the same goals except for some differences between them.
Moreover, ridge regression and lasso regression are ways of applying regularization. These two regression techniques can be differentiated on the basis of the techniques being employed by them to add a penalty.
- The ridge regression uses the L2 regularization method for adding a penalty and the lasso regression uses the L1 regularization method for adding the penalty.
- The main difference between the lasso regression and ridge regression techniques is that lasso shrinks the feature’s coefficient to zero depending on their importance and thus removes some features.
Uses of the Ridge regression algorithm:
The ridge regression is similar to Lasso regression and is used for tuning a model and is a form of multiple regression model. Therefore, it can be implemented for prediction rather than being implemented for achieving inference.
- The ridge regression aids in the reduction of the coefficients, and when used, it causes the predicted coefficients to approach zero. This enhances the performance of the ridge regression and improves the model’s prediction when we have new data sets.
- In the case of working with a small number of predictors where the value of the predictors ranges close to zero, ridge regression can therefore be implemented to avoid overfitting and allows us to apply complex models at the same time, and thus helps us in applying the ridge regression with various new datasets.
- We can apply the ridge regression model efficiently for the purpose of feature selection methods, and since the ridge regression performs L2 regression, it is also beneficial for solving the problems of overfitting by applying the regression. We need to calculate the cost function or the loss function to obtain the best-fit regression line that will be used for feature selection.
Implementation of ridge regression in r
To apply the ridge regression algorithm to solve the machine learning problem, we can do it with the help of some simple steps:
Step 1- Load the libraries and the dataset.
The first step is to load the glmnet library with the help of the library function.
library(glmnet)
library(Metrics)
library(caTools)
Step 2- Then load the data set and define the dependent and independent variables.
Step 3- The third step is to perform data preprocessing in which we clean the data and deal with the missing values and duplicated ones. After that, we split the data set into training and testing.
Step 4– Implement the Ridge regression
Then we define the glm net function. It will take some of the following arguments like
- The first argument is the input variable which is represented by x.
- The second argument is the target variable which is denoted by Y.
- The third argument is the family, which determines the distribution. For a linear regression model, it will be a Gaussian distribution.
- The fourth argument is nlambda which decides the number of regularization parameters to be tried.
- The fifth argument is alpha, if the value is set to 0 then it is ridge penalty and if it is set to 1 then it is lasso penalty.
- The sixth argument is lambda which decides the lambda values to be tested.
After that, we fit the glmnet model and print the summary of the model.
ridge_regression = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
summary(ridge_regression )
Step 5– Making the prediction on the test set
At last, we make the prediction on the test dataset and evaluate the model using regression metrics like the sum of squared error, root means squared error.
Advantages and disadvantages of using the Ridge Regression
Advantages:
- Applying ridge regression can help us prevent the model from overfitting.
- Ridge regression performs well when the number of predictors in the data is larger than that of the number of observations.
- The complexity of the model is reduced while applying the ridge regression.
- The ridge regression does not require unbiased estimators and can work with biased estimators.
- When we have to work on improving the least-squares estimate in the case of multicollinearity, then the ridge estimator is one of the best effective ways to perform this operation.
Disadvantages:
- The ridge regression fails at performing feature selection.
- While creating a final model, the ridge regression will include all the predictors in this model.
- The ridge regression shrinks the coefficients towards zero and therefore is risky for use.
- The ridge regression performs the bias-variance trade-off and thus trades variance to get the bias.
Conclusion
In this blog, we have learned about what is ridge regression, how it works, and its implementation in the R programming language.
If you enjoy this site, tell your coworkers or friends about it. You may find me on social networking sites like Twitter, Instagram, and Linkedin.
LinkedIn – https://www.linkedin.com/in/abhishek-kumar-singh-8a6326148
Twitter- https://twitter.com/Abhi007si
Instagram- www.instagram.com/dataspoof