Linear regression regular term (penalty term) principle, classification of regular term and implementation of Python code

Article directory

1 Meaning of the regular term
2 The difference between L1 and L2 regular terms
3 regular python implementation
- 3.1 Lasso regularization
- 3.2 Ridge regularization
- 3.3 Elastic Net regularization
4 Case examples

Preliminary knowledge for this blog:
Linear regression, the solution derivation of the least squares method and the underlying code implementation based on Python
The principle of linear regression feature expansion and the realization of python code

1 The meaning of the regular term

In linear regression, the regularization term is a technique used to control the complexity of the model, which limits the complexity of the model by adding the size of the coefficients to the loss function. In linear regression, L1 regularization or L2 regularization is usually used. The form of the regular term can be expressed as:

L1 regular term (Lasso):

lambda

∑

∣

L_{1} = \lambda \sum_{i=1}^{p} \left| w_i \right|

L1?=λi=1∑p?∣wi?∣

L2 regular term (Ridge):

lambda

∑

L_{2} = \lambda \sum_{i=1}^{p} w_i^2

L2?=λi=1∑p?wi2?

in,

p is the number of coefficients,

w_i

wi? is the first

i coefficients,

lambda

\lambda

λ is a regularization parameter used to control the strength of regularization.

The L1 regularization term uses the sum of the absolute values of the coefficients as a regularization term, which can make some coefficients in the model become 0, thereby achieving the effect of feature selection. This means that the L1 regular term can select the most important features in the model, thereby improving the generalization ability of the model.

The L2 regularization term uses the sum of the squares of the coefficients as a regularization term, which can prevent the coefficients in the model from being too large, thereby reducing the overfitting of the model. The L2 regularization term is widely used in many machine learning tasks.

Taking the L2 regular term as an example, before adding the regular term, the loss function is:

J

(

w

)

=

1

2

m

∑

i

=

1

m

(

h

w

(

x

(

i

)

)

?

the y

(

i

)

)

2

J(w) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_w(x^{(i)}) – y^{(i)}\right)^2

J(w)=2m1?i=1∑m?(hw?(x(i))?y(i))2
After adding the regular term, the loss function becomes:

J

(

w

)

=

1

2

m

∑

i

=

1

m

(

h

w

(

x

(

i

)

)

?

the y

(

i

)

)

2

+

lambda

2

m

∑

j

=

1

no

w

j

2

J(w) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_w(x^{(i)}) – y^{(i)}\right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2

J(w)=2m1?i=1∑m?(hw?(x(i))?y(i))2 + 2mλ?j=1∑n?wj2?
in,

h

w

(

x

(

i

)

)

h_w(x^{(i)})

hw?(x(i)) is the predicted value,

the y

(

i

)

y^{(i)}

y(i) is the actual value,

w

j

w_j

wj? is the first

j

j

the weights of j features,

lambda

\lambda

λ is a regularization parameter used to control the strength of regularization.

This means that in model fitting, the balance between the performance of the model in the training set and the complexity of the model will be found. In practice, cross-validation is often used to choose the best regularization parameters

lambda

\lambda

λ for the best performance and generalization ability. The regularization term is a very effective technique that can help solve the problem of overfitting and improve the generalization ability of the model.

2 The difference between L1 and L2 regular terms

In addition to the algorithmic difference between L1 and L2 mentioned above, there is another important difference that the solution of the L1 regular term is not unique, while the solution of the L2 regular term is unique. This is because the regularization graph of the L1 regularization term is a rhombus whose border is on the coordinate axis, while the regularization graph of the L2 regularization term is a circle whose border is a smooth curve. This difference makes the solution of the L1 regular term may not be unique in certain cases. See the figure below for details:
Page 22-4.PNG
At the same time, this picture also explains why the L1 regularization can select the most important features in the model, because its intersection point is more likely to appear on the coordinate axis, and the point on the coordinate axis means a certain

\theta

θ becomes 0.

It is usually difficult to say which of L1 and L2 regularization is better, so Elastic Net regularization technology has emerged, which combines L1 regularization and L2 regularization.

The regularization term of Elastic Net can be expressed as:

lambda

∑

∣

lambda

∑

L_{1,2} = \lambda_1 \sum_{i=1}^{p} \left| w_i \right| + \lambda_2 \sum_{i=1}^{p} w_i^2

L1,2?=λ1?i=1∑p?∣wi?∣ + λ2?i=1∑p?wi2?

in,

lambda

\lambda_1

λ1? and

lambda

\lambda_2

λ2? are the regularization parameters of L1 regular term and L2 regular term respectively,

p is the number of coefficients,

w_i

wi? is the first

i coefficients.

The main advantage of Elastic Net is that it can overcome the respective shortcomings of L1 and L2 regularization terms and take advantage of their advantages at the same time. By adjusting the weights of the L1 and L2 regularization terms, the trade-off between feature selection and smoothing effects can be controlled, leading to better performance and generalization.

3 regular python implementation

3.1 Lasso Regular

Use from sklearn.linear_model import Lasso to import the Lasso regression package, and the subsequent modeling sequence is as follows:

Import the required libraries and data (here is the same as the previous blog, and Boston housing prices are also used)

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target

Create Lasso model object

lasso = Lasso(alpha=1.0)

fit model

lasso.fit(X, y)

Access the coefficients of the model

print(lasso.coef_)

predict

y_pred = lasso. predict(X)

Model Performance Evaluation

from sklearn.metrics import r2_score
print(r2_score(y, y_pred))

The output of this function is:

You can see that some

\theta

The value of θ is 0, and the addition of Lasso regularization successfully plays the role of feature screening.

3.2 Ridge Regular

Ridge regression can be used by using from sklearn.linear_model import Ridge in python. The subsequent model training and use are exactly the same as Lasso, which is very simple and will not be repeated here.
Also using the Boston dataset, the Ridge regular code result is:

As you can see, this time

\theta

θ no longer has a zero value, but the unimportant features correspond to

\theta

θ becomes very small. It plays the role of preventing the coefficient in the model from being too large as mentioned above, thereby reducing the overfitting effect of the model.

3.3 Elastic Net Regular

Use from sklearn.linear_model import ElasticNet in python to use Elastic Net regression. Compared with the previous two methods 1, there is one more receiving parameter in the Elastic Net creation class:

elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)

The L1_ratio parameter controls the weight between L1 regularization and L2 regularization, and the value is between 0-1. Taking 1 is equivalent to Lasso, and taking 0 is equivalent to Ridge. The methods of other parts are exactly the same as above.

4 case examples

Below we use the Boston house price data set to compare the performance of the three regularizations.
First, we import the data and perform a second-order polynomial expansion on the data. The binomial expansion here is to review the content of the previous blog, and the second is to simulate the situation where there are too many useless features encountered in daily projects.

import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import cross_val_score, KFold
  
# load the boston house price dataset
boston = load_boston()
X = boston.data
y = boston.target
  
# Perform 2nd order polynomial expansion
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

Subsequently, define the LinearRegression, Lasso, Ridge, and ElasticNet models respectively. Here the penalty coefficient of Ridge regression is very high, much higher than the other two models. This is because the blogger has already tried this data set and found a suitable hyperparameter range. In actual projects, it is recommended to perform grid search (GridSearch) according to the actual situation.

# Define LinearRegression, Lasso, Ridge, ElasticNet models
lr = LinearRegression()
lasso = lasso(alpha=1.0)
ridge = Ridge(alpha=100.0)
elastic_net = ElasticNet()

In order to evaluate the model more objectively, we use 5-fold cross-validation to evaluate the performance of the model.

# Define 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=1)
  
# Evaluate model performance using cross-validation
scores_lr = cross_val_score(lr, X_poly, y, scoring='r2', cv=cv)
scores_lasso = cross_val_score(lasso, X_poly, y, scoring='r2', cv=cv)
scores_ridge = cross_val_score(ridge, X_poly, y, scoring='r2', cv=cv)
scores_elastic_net = cross_val_score(elastic_net, X_poly, y, scoring='r2', cv=cv)

Finally, the scores of each model are calculated and output, and the best model is selected.

# Compute the R-squared score for cross-validation
r2_lr = np.mean(scores_lr)
r2_lasso = np.mean(scores_lasso)
r2_ridge = np. mean(scores_ridge)
r2_elastic_net = np.mean(scores_elastic_net)
  
# Output the R-squared score for each model
print('Linear Regression R2:', r2_lr)
print('Lasso R2:', r2_lasso)
print('Ridge R2:', r2_ridge)
print('ElasticNet R2:', r2_elastic_net)

The overall output of this code is:

It can be seen that the performance of each model is not much different. This is mainly because the Boston data set is sufficient and relatively standardized, and the performance of each model is relatively good. At the same time, we did not choose the best hyperparameters for each model.

Finally, I would like to remind everyone that if you need to use the model, you need to take out the best model and hyperparameters and retrain. Because in the cv process, we got 5 models, and we can’t judge which one is the best. It is best to use all the data sets to train again.

Click here to download the full code for this article for free