Table of Contents
1. Data sets and analysis objects
2. Purpose and analysis tasks
3. Methods and Tools
4. Data reading
5. Data understanding
6. Data preparation
7. Model training
8. Model evaluation
9. Model parameter adjustment
10. Model prediction
Commonly used third-party Python toolkits for implementing regression analysis algorithms include statsmodels, statistics, scikit-learn, etc. Below we mainly use statsmodels.
1. Data sets and analysis objects
CSV file – “women.csv”.
Dataset link: https://download.csdn.net/download/m0_70452407/88519967
This data set gives the height and weight data of 15 women aged 30 to 39 years old. The main attributes are as follows:
(1) height: height
(2) weight: weight
2. Purpose and Analysis Tasks
Understand the application of machine learning methods in data analysis – using simple linear regression and polynomial regression methods for regression analysis.
(1) Training model.
(2) Conduct goodness-of-fit evaluation and visualization processing on the model to verify the effectiveness of simple linear regression modeling.
(3) Use polynomial regression for model optimization.
(4) Predict weight data according to polynomial regression model.
3. Methods and Tools
Python language and third-party tool packages pandas, matplotlib and statsmodels.
4. Data reading
import pandas as pd df_women=pd.read_csv("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Machinery Industry Press\Chapter 3 Regression Analysis\\ \women.csv", index_col=0)
5. Data Understanding
Perform exploratory analysis on the data frame df_women.
df_women.describe()
df_women.shape
(15, 2)
Next, perform data visualization analysis on the database df_women, and draw a scatter plot by calling the scatter() method of the data frame (DataFrame) in the mayplotlib.pyplot package.
import matplotlib.pyplot as plt plt.scatter(df_women["height"],df_women["weight"])
It can be seen from the output results that the relationship between female height and weight can be analyzed by linear regression, and further data preparation work is required.
6. Data preparation
Before performing linear regression analysis, the feature matrix (X) and target vector (y) required by the model should be prepared. Here we use Python’s statistical analysis package statsmodel for automatic type conversion.
X=df_women['height'] y=df_women['weight']
Seven. Model Training
A simple linear regression model is performed on the data using female height as the independent variable and weight as the dependent variable. The OLS function in Python’s statistical analysis package statsmodels is used for modeling analysis.
import statsmodels.api as sm
There are 4 inputs to the statsmodels.OLS() method (endog, exog, missing, hasconst). Endog is the dependent variable in the regression, which is the weight in the above model, and exog is the value of the independent variable, which is the weight in the model. height.
By default, the statsmodels.OLS() method does not contain an intercept term, so constant terms in the model should be treated as coefficients on a base 1 dimension. Therefore, in the exog input, the values in the leftmost column should be all 1. Here we use the method provided in statsmodels to directly solve this problem – sm.add_constant() to add a new column to X, the column name is const, and the value of each row is 1.0
X_add_const=sm.add_constant(X) X_add_const
Perform a simple linear regression using the OLS() method on the independent variable X_add_const and the dependent variable y.
myModel=sm.OLS(y,X_add_const)
Then obtain the fitting results and call the summary() method to display the regression fitting results.
results=myModel.fit() print(results.summary())
OLS Regression Results ================================================== ============================ Dep. Variable: weight R-squared: 0.991 Model: OLS Adj. R-squared: 0.990 Method: Least Squares F-statistic: 1433. Date: Thu, 09 Nov 2023 Prob (F-statistic): 1.09e-14 Time: 18:28:09 Log-Likelihood: -26.541 No. Observations: 15 AIC: 57.08 Df Residuals: 13 BIC: 58.50 Df Model: 1 Covariance Type: nonrobust ================================================== ============================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------- ---------------------------- const -87.5167 5.937 -14.741 0.000 -100.343 -74.691 height 3.4500 0.091 37.855 0.000 3.253 3.647 ================================================== ============================ Omnibus: 2.396 Durbin-Watson: 0.315 Prob(Omnibus): 0.302 Jarque-Bera (JB): 1.660 Skew: 0.789 Prob(JB): 0.436 Kurtosis: 2.596 Cond. No. 982. ================================================== ============================ Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1769: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15 warnings.warn("kurtosistest only valid for n>=20 ... continuing "
The const and height corresponding to the coef column in the second part of the above running results are the intercept term and slope in the calculated regression model.
In addition to reading the regression summary, you can also call the params attribute to view the slope and intercept of the fitting results.
results.params
const -87.516667 height 3.450000 dtype: float64
As can be seen from the output results, the intercept term and slope in the regression model are -87.516667 and 3.450000 respectively.
8. Model Evaluation
R^2 (coefficient of determination) is used as an index to measure the degree of fit of the regression line to the observed value. Its value range is [0,1]. The closer to 1, the better the goodness of fit of the regression line. You can call the squared attribute to view the R^2 of the fitting results.
results.rsquared
0.9910098326857505
In addition to statistics such as the coefficient of determination, the regression effect can also be viewed more intuitively through visualization methods. Here we call the plot() method in the matplotlib.pyplot package to draw the regression line and the real data in a graph for comparison.
y_predict=results.params[0] + results.params[1]*df_women["height"] plt.rcParams['font.family']="simHei" #Chinese character display font settings plt.plot(df_women["height"],df_women["weight"],"o") plt.plot(df_women["height"],y_predict) plt.title("Linear regression analysis of female height and weight") plt.xlabel("height") plt.ylabel("weight")
It can be seen from the output results that the effect of using a simple linear regression model can be further optimized. For this reason, the polynomial regression method is used for regression analysis.
9. Model parameter adjustment
Call the OLS() method in Python’s statistical analysis package statsmodels to perform polynomial regression modeling on the independent variable female height and the dependent variable weight.
It is assumed that there is a high-level linear regression between the dependent variable y and the independent variables X, X^2, and X^3. Therefore, in polynomial analysis, the characteristic matrix consists of 3 parts, namely X, X^2, and X^3. Create the feature matrix X by calling the column_stack() method of the numpy library.
import numpy as np X=np.column_stack((X,np.power(X,2),np.power(X,3)))
Preserve the intercept term in polynomial regression through the sm.add_constant() method. Use the OLS() method to perform polynomial regression on the independent variable X_add_const and the dependent variable y.
X_add_const=sm.add_constant(X) myModel_updated=sm.OLS(y,X_add_const) results=myModel_updated.fit() print(results.summary())
OLS Regression Results ================================================== ============================ Dep. Variable: weight R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 1.679e + 04 Date: Thu, 09 Nov 2023 Prob (F-statistic): 2.07e-20 Time: 18:46:54 Log-Likelihood: 1.3441 No. Observations: 15 AIC: 5.312 Df Residuals: 11 BIC: 8.144 Df Model: 3 Covariance Type: nonrobust ================================================== ============================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------- ---------------------------- const -896.7476 294.575 -3.044 0.011 -1545.102 -248.393 x1 46.4108 13.655 3.399 0.006 16.356 76.466 x2 -0.7462 0.211 -3.544 0.005 -1.210 -0.283 x3 0.0043 0.001 3.940 0.002 0.002 0.007 ================================================== ============================ Omnibus: 0.028 Durbin-Watson: 2.388 Prob(Omnibus): 0.986 Jarque-Bera (JB): 0.127 Skew: 0.049 Prob(JB): 0.939 Kurtosis: 2.561 Cond. No. 1.25e + 09 ================================================== ============================ Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.25e + 09. This might indicate that there are strong multicollinearity or other numerical problems. C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1769: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15 warnings.warn("kurtosistest only valid for n>=20 ... continuing "
It can be seen from the output results that the intercept term in the polynomial regression model is -896.7476, and the corresponding slopes of X, X^2, and X^3 are 46.4108, -0.7462, and 0.0043 respectively.
Call the squared attribute to view the R^2 of the fitting result:
results.rsquared
0.9997816939979361
It can be seen from the results of the coefficient of determination that the polynomial regression model performs better than the simple linear regression model.
10. Model Prediction
Use this polynomial regression model to predict weight and output the prediction results.
y_predict_updated=results.predict() y_predict_updated
array([114.63856209, 117.40676937, 120.18801264, 123.00780722, 125.89166846, 128.86511168, 131.95365223, 135.18280543, 138.57808662, 142.16501113, 145.9690943, 150.01585147, 154.33079796, 158.93944911, 163.86732026])
Visualization of polynomial regression model:
y_predict=(results.params[0] + results.params[1]*df_women["height"] + results.params[2]*df_women["height"]**2 + results.params[3]*df_women["height"]**3) plt.plot(df_women["height"],df_women["weight"],"o") plt.plot(df_women["height"],y_predict) plt.title("Polynomial regression analysis of female height and weight") plt.xlabel("height") plt.ylabel("weight")
It can be seen from the results that the fitting effect is significantly improved after using polynomial regression, and the results are more satisfactory.