Python statistics 13 – regression multicollinearity, heteroscedasticity, autocorrelation test

In basic statistics, or econometrics, it is necessary to perform some tests that violate the classical assumptions on regression problems, such as multicollinearity, heteroscedasticity, and autocorrelation tests. These tests are very simple with stata, r, and Eviews, but many people can’t use python. Let’s take you to practice a full version of a regression case to see how to achieve it.

Regression case

import package

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
pd.set_option('display.float_format', lambda x: '{:.4f}'.format(x))

Read data, set time index

df=pd.read_excel('data.xlsx').dropna(axis=1).set_index('year')
df

small amount of data

Descriptive Statistics

df.describe().T

Draw a scatterplot

y = df['agriculture']
x_cols = ['primium', 'agrloans', 'product', 'employ', 'pay']
x = df[x_cols]
fig = plt.figure(figsize=(8,4), dpi=128)
for i in range(5):
    plt.subplot(2,3,i + 1) # 2 rows and 3 columns subplot
    sns. scatterplot(x=x. iloc[:, i], y=y)
    #plt.ylabel(column[i], fontsize=12)
    #plt. xticks([])
plt.tight_layout()
plt.savefig('scatterplot.jpg',dpi=128)
plt. show()

It can be seen that among the five independent variables, except employment, the others are positively correlated with agriculture

column = df.columns.tolist() # list head
fig = plt.figure(figsize=(6,6), dpi=128) # specify the width and height of the drawing object
for i in range(6):
    plt.subplot(3,2,i + 1) # 2 rows and 3 columns subplot
    sns.boxplot(data=df.iloc[:,i].to_numpy(), orient="v",width=0.5) # box plot
    plt.ylabel(column[i], fontsize=12)
    plt. xticks ([])
plt.tight_layout()
plt.savefig('box plot.jpg',dpi=128)
plt. show()

Except for y itself, the distribution of other x is relatively symmetrical, which conforms to the assumption of normal distribution?

Correlation Analysis

Calculate the correlation coefficient

df.corr()

plt.figure(figsize=(7,5),dpi=128)
sns.heatmap(df.corr().round(2), cmap='coolwarm', annot=True, annot_kws={"size": 10})
plt.savefig('correlation coefficient.jpg')

Employment and other variables are negatively correlated, and there is a high correlation between agriculture and x, but there is also a high correlation between X, for example, the correlation coefficient between primium and the other four x is as high as 0.9 or more,

Indicating that the model may have multicollinearity.

OLS, multiple linear regression

model = smf.ols('agriculture~primium + agrloans + product + employ + pay', data=df)
model = model. fit()
model.summary()

The goodness of fit of the model was as high as 96%, and the adjusted R2 was 0.945. It shows that these explanatory variables can highly explain the changes in y. The overall F value is 45.30, far exceeding the critical value, indicating that the overall model is very significant.

At the significance level of 0.05, except pay, the other four variables are significant, indicating that their changes have a significant impact on y.

Let’s look at the coefficient of a single variable. The employ coefficient is positive and the agrloans coefficient is negative. These do not meet theoretical expectations. It should be caused by the existence of multicollinearity in the model?

Calculate the variance inflation factor, view multicollinearity.

VIF calculation

def VIF_calculate(df_all,y_name):
    x_cols = df.columns.to_list()
    x_cols. remove(y_name)
    
    def vif(df_exog, exog_name):
        exog_use = list(df_exog.columns)
        exog_use. remove(exog_name)
        model=smf.ols(f"{exog_name}~{' + '.join(list(exog_use))}",data=df_exog).fit()
        return 1./(1.-model.rsquared)

    df_vif = pd. DataFrame()
    for x in x_cols:
        df_vif.loc['VIF',x]=vif(df_all[x_cols],x)
        
    df_vif.loc['tolerance']=1/df_vif.loc['VIF']
    df_vif=df_vif.T.sort_values('VIF', ascending=False)
    df_vif.loc['mean_vif'] = df_vif.mean()
    return df_vif
VIF_calculate(df,'agriculture')

It is generally believed that the VIF value is greater than 10, and the model will have multicollinearity. In addition to the employment variable, there is a problem of multicollinearity among the other four variables.

Residual Error Test

plt.rcParams ['font.sans-serif'] =['SimHei']
plt.rcParams['axes.unicode_minus']=False
 
x=model.fittedvalues;y=model.resid
 
plt.subplots(1,2,figsize=(8,3),dpi=128)
plt. subplot(121)
plt.scatter(model.fittedvalues,model.resid)
plt.xlabel('fitting value')
plt.ylabel('residual')
plt.title('(a) Residual value and fitted value graph', fontsize=15)
plt.axhline(0,ls='--')
 
ax2=plt.subplot(122)
pplot=sm.ProbPlot(model.resid,fit=True)
pplot.qqplot(line='r', ax=ax2, xlabel='expected normal value', ylabel='standardized observed value')
ax2.set_title('(b) residual normal Q-Q graph', fontsize=15)
plt. show()

The residuals are uniformly distributed around the 0 axis, and the model does not have problems such as heteroscedasticity

white test of heteroscedasticity

from statsmodels.stats.diagnostic import het_white
wh = het_white(model.resid, model.model.exog)
print('LM Statistic: {:.3f}'.format(wh[0]))
print('LM p-value: {:.3f}'.format(wh[1]))
#print('F Statistic: {:.3f}'.format(wh[2])) print('F p-value: {:.3f}'.format(wh[3]))< /pre>
<p><img alt="" height="151" src="//i2.wp.com/img-blog.csdnimg.cn/77d0b4e811394a52acec720bf9b527e7.png" width="445"></p>
 
<h3>DW test of autocorrelation</h3>
<pre>from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(model.resid)
print('DW statistic: {:.4f}'.format(dw))

The following deals with multicollinearity

When the problem of multicollinearity occurs, the following methods can be adopted to deal with it:

1. Delete the variables with strong correlation. Multicollinearity problems can arise if two or more independent variables are highly correlated. The influence of collinearity can be reduced by calculating the correlation coefficient matrix between independent variables and then deleting the variables with strong correlation.

2. Principal Component Analysis (PCA). Principal component analysis can convert multiple highly correlated independent variables into a few irrelevant principal components, thereby reducing the impact of collinearity. Principal component analysis can be performed using the sklearn.decomposition.PCA class in Python.

3. Ridge Regression. Ridge regression is a regularization method that avoids overfitting by adding a penalty term to the loss function. In the presence of multicollinearity, using ridge regression can reduce the variance of the model and improve the generalization performance of the model.

4. Lasso regression (Lasso Regression). Lasso regression is also a regularization method that can achieve feature selection and dimensionality reduction by penalizing the L1 norm of the regression coefficients. Lasso regression can reduce the influence of multicollinearity by shrinking the coefficients of independent variables with strong correlation.

5. Collect more data. If the multicollinearity problem comes from the limitation of the data sample, then the problem can be alleviated by collecting more data.

(Economics should only learn ridge regression, just use this)

Ridge regression

y = df['agriculture']
X = df[['primium', 'agrloans', 'product', 'employ', 'pay']]
from sklearn.linear_model import Ridge
model_ride = Ridge(alpha=10)
#fit model
model_ride. fit(X, y)
# Calculate the goodness of fit on the test set
model_ride. score(X, y)

#model intercept
model_ride. intercept_

#model coefficient
#Data frame display coefficient
pd.DataFrame(model_ride.coef_, index=X.columns, columns=['Coefficient'])

(PS, export and storage of regression results)

How do we save the graph of the ols regression above in the excel table and export it?

I wrote a function that can directly save the results of the regression model, and can also save the regression model in batches.

def store(writer,model,model_name='model_SH1'):
    df1 = pd.DataFrame(model.summary().tables[0])
    df2 = pd.DataFrame(model.summary().tables[1])
    df3 = pd.DataFrame(model.summary().tables[2])
    df1.to_excel(writer, sheet_name=model_name, startrow=0, startcol=0, header=False, index=False)
    df2.to_excel(writer, sheet_name=model_name, startrow=df1.shape[0] + 1, startcol=0, header=False, index=False)
    df3.to_excel(writer, sheet_name=model_name, startrow=df1.shape[0] + df2.shape[0] + 2, startcol=0, header=False, index=False)

Three parameters, one is the excel table object of pandas, one is the model variable, and the other is the model name.

Use as follows:

model_list=['model']
with pd.ExcelWriter('regression table.xlsx') as writer:
    for model_name in model_list:
        model=eval(model_name)
        store(writer,model=model,model_name=model_name)

Pass the variable name of your model into the list as a string, and then run it to generate an excel workbook. A model exists in a sheet, and there are several models in several sheets.

There is only one model here, so there is only one sheet:

The result is the same as the above ols.