Python implements stepwise regression

@Created on: 2023.09.17
@Modified on: 2023.09.17

This article is reproduced from Ali Yiyang’s Python implementation and its gradual return.
Purely for record purposes and not for any commercial use.

Article directory

    • Python implements stepwise regression
      • 1 What is stepwise regression?
      • 2 Detailed explanation of function parameters for implementing stepwise regression
      • 3 Python implements gradual regression
        • 3.1 Read data
        • 3.2 Implementation of two-way screening and stepwise regression
        • 3.3 Forward screening and gradual regression implementation
        • 3.4 Backward screening and stepwise regression implementation
        • 3.5 Bidirectional stepwise regression specifies the feature screening criterion as ks
        • 3.6 Bidirectional stepwise regression specifies the feature screening criterion as auc
      • References

Python implements gradual regression

Stepwise Regression is a regression method that selects variables step by step to determine the best prediction model. It optimizes the model’s predictive power by gradually adding and removing variables.

This article focuses on explaining what stepwise regression is and how to implement stepwise regression using Python.

1 What is stepwise regression?

Stepwise regression is a process of screening variables in regression analysis. We can use stepwise regression to screen the variables that work from a set of candidate variables or eliminate the variables that do not work to build a model.

Stepwise regression has three methods of filtering variables.

  1. Forward selection: First select the independent variable that alone explains the largest variation in the dependent variable, and then introduce the remaining independent variables into the model one by one. After introduction, see whether the addition of the variable causes significant changes in the model. (F test), if a significant change occurs, the variable is introduced into the model, otherwise the variable is ignored until all variables are considered.

Features: Once an independent variable is selected, it is permanently saved in the model.

  1. Backward elimination: Contrary to forward filtering, all variables are put into the model at the beginning, and then try to eliminate a certain variable to see whether the elimination is significant to the entire model. Change (F test), if there is no significant change, delete it, if there is, retain it, until all factors that have significant change on the model remain.

Features: Once the independent variables are eliminated, they will no longer enter the model, and all independent variables will be introduced into the model at the beginning, which requires too much calculation.

  1. Bidirectional elimination: This method is equivalent to a combination of the first two screening methods. When a variable is introduced, first check whether the variable causes a significant change in the model (F test). If there is a significant change, then conduct a t test on all variables. When the original introduced variable is no longer affected by the introduction of later variables, When there is a significant change, this variable is removed, ensuring that only significant variables are included in the regression equation before each new variable is introduced, until no significant explanatory variables are selected into the regression equation, and no insignificant explanatory variables are removed from the regression equation. So far, an optimal set of variables is finally obtained.

2 Detailed explanation of function parameters to implement stepwise regression

To implement stepwise regression, you can use the toad.selection.stepwise function in the toad library. The calling method, main parameters and explanation of this function are as follows:

import toad

toad.selection.stepwise(frame, target='target', estimator='ols', direction='both', criterion='aic', p_enter=0.01, p_remove=0.01, p_value_enter=0.2, intercept=False, max_iter= None, return_drop=False, exclude=None)

frame: input data frame, including independent variables and target variables.
Target: Specify the column name of the target variable in the data frame. The default is target, which can be adjusted according to the actual situation.
estimator: model used for fitting, supports ‘ols’ (default item), ‘lr’, ‘lasso’, ‘ridge’.
direction: the direction of stepwise regression, supporting ‘forward’ (forward method), ‘backward’ (backward method), and ‘both’ (bidirectional method, default).
criterion: Specify the criterion for selecting features, which can be ‘aic’ (Akaike information criterion, default), ‘bic’ (Bayesian information criterion), ‘ks’, ‘auc’.
p_enter: Specifies the significance level of added features, the default is 0.01.
p_remove: Specifies the significance level of deleted features, the default is 0.01.
p_value_enter: Specifies the P value threshold for adding features, the default is 0.2.
intercept: Whether to fit the intercept term, the default is False.
max_iter: Specifies the maximum number of iterations. The default is None, which means there is no limit on the number of iterations.
return_drop: Whether to return the deleted feature name, the default is False.
exclude: Specify a list of feature column names to be excluded from training, such as ID columns and time columns. The default is None.

Experience says:

  1. direction = both’ generally works best.
  2. estimator = ols’ and criterion = aic’ run quickly and the results are well representative of logistic regression modeling.

The above two points are common experience summaries. Specific analysis still needs to be carried out based on the modeling data.

3 Python implements gradual regression

3.1 Reading data

First, import modeling data and perform data preprocessing. Since the focus of this article is on the implementation of step-by-step regression, and the previous article Enterprise Fraud Recognition has already elaborated on this module, this article will not go into details.
The specific code is as follows:

import os
import toad
import numpy as np
import pandas as pd

os.chdir(r'F:\Official Account\3.Enterprise Fraud Identification\audit_data') #Set the folder for data reading
qz_date = pd.read_csv('audit_risk.csv') #Read data
qz_date.LOCATION_ID = pd.to_numeric(qz_date.LOCATION_ID, errors = 'coerce') #Convert text data into numerical data
qz_date = qz_date.fillna(0) #Fill the empty values in the data frame with 0
qz_date.head(5)

got the answer:

It can be found that this data contains 27 columns.

3.2 Two-way filtering and stepwise regression implementation

Then use the two-way screening method to select stepwise regression variables. The specific code is as follows:

final_data = toad.selection.stepwise(qz_date,
                                    target = 'Risk',
                                     estimator='ols',
                                     direction = 'both',
                                     criterion = 'aic'
                                     )
final_data

got the answer:

It can be found that the two-way stepwise regression selected 12 variables into the model.

3.3 Progressive regression implementation of forward screening

Then use the forward screening method to select stepwise regression variables. The specific code is as follows:

final_data = toad.selection.stepwise(qz_date,
                                    target = 'Risk',
                                     estimator='ols',
                                     direction = 'forward',
                                     criterion = 'aic'
                                     )
final_data

got the answer:

It can be found that forward stepwise regression selected 13 variables to be modeled, and more RiSk_E variables were modeled than bidirectional stepwise regression, and the remaining variables were consistent.

3.4 Backward screening and stepwise regression implementation

Then use the backward screening method to select stepwise regression variables. The specific code is as follows:

final_data = toad.selection.stepwise(qz_date,
                                    target = 'Risk',
                                     estimator='ols',
                                     direction = 'backward',
                                     criterion = 'aic'
                                     )
final_data

got the answer:

It can be found that the backward stepwise regression selected 16 variables into the model, which is somewhat different from the two-way and forward stepwise regression.

3.5 Bidirectional stepwise regression specifies the feature screening criterion as ks

In order to analyze the impact of different feature selection criteria on variable selection, then specify the feature selection criteria as ks in the two-way stepwise regression. Take a look at the results. The specific code is as follows:

final_data = toad.selection.stepwise(qz_date,
                                    target = 'Risk',
                                     estimator='ols',
                                     direction = 'both',
                                     criterion = 'ks'
                                     )
final_data

got the answer:

It can be found that if the feature selection criterion is set to ks during bidirectional stepwise regression, only one model variable is selected, which obviously does not meet the modeling requirements.

3.6 Bidirectional stepwise regression specifies the feature screening criterion as auc

Then specify the feature selection criterion as auc during bidirectional stepwise regression. The specific code is as follows:

final_data = toad.selection.stepwise(qz_date,
                                    target = 'Risk',
                                     estimator='ols',
                                     direction = 'both',
                                     criterion = 'auc'
                                     )
final_data

got the answer:

It can be found that if the feature selection criterion is set to auc during bidirectional stepwise regression, only one model variable is selected, which obviously does not meet the modeling requirements.

In summary, we can refer to previous empirical parameters when using stepwise regression modeling.

Reference materials

Python implements stepwise regression