Practical data analysis | Linear regression – female height and weight data analysis

Table of Contents

1. Data sets and analysis objects

2. Purpose and analysis tasks

3. Methods and Tools

4. Data reading

5. Data understanding

6. Data preparation

7. Model training

8. Model evaluation

9. Model parameter adjustment

10. Model prediction


Commonly used third-party Python toolkits for implementing regression analysis algorithms include statsmodels, statistics, scikit-learn, etc. Below we mainly use statsmodels.

1. Data sets and analysis objects

CSV file – “women.csv”.

Dataset link: https://download.csdn.net/download/m0_70452407/88519967

This data set gives the height and weight data of 15 women aged 30 to 39 years old. The main attributes are as follows:

(1) height: height

(2) weight: weight

2. Purpose and Analysis Tasks

Understand the application of machine learning methods in data analysis – using simple linear regression and polynomial regression methods for regression analysis.

(1) Training model.

(2) Conduct goodness-of-fit evaluation and visualization processing on the model to verify the effectiveness of simple linear regression modeling.

(3) Use polynomial regression for model optimization.

(4) Predict weight data according to polynomial regression model.

3. Methods and Tools

Python language and third-party tool packages pandas, matplotlib and statsmodels.

4. Data reading

import pandas as pd
df_women=pd.read_csv("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Machinery Industry Press\Chapter 3 Regression Analysis\\ \women.csv",
                     index_col=0)

5. Data Understanding

Perform exploratory analysis on the data frame df_women.

df_women.describe()

df_women.shape
(15, 2)

Next, perform data visualization analysis on the database df_women, and draw a scatter plot by calling the scatter() method of the data frame (DataFrame) in the mayplotlib.pyplot package.

import matplotlib.pyplot as plt
plt.scatter(df_women["height"],df_women["weight"])

It can be seen from the output results that the relationship between female height and weight can be analyzed by linear regression, and further data preparation work is required.

6. Data preparation

Before performing linear regression analysis, the feature matrix (X) and target vector (y) required by the model should be prepared. Here we use Python’s statistical analysis package statsmodel for automatic type conversion.

X=df_women['height']
y=df_women['weight']

Seven. Model Training

A simple linear regression model is performed on the data using female height as the independent variable and weight as the dependent variable. The OLS function in Python’s statistical analysis package statsmodels is used for modeling analysis.

import statsmodels.api as sm

There are 4 inputs to the statsmodels.OLS() method (endog, exog, missing, hasconst). Endog is the dependent variable in the regression, which is the weight in the above model, and exog is the value of the independent variable, which is the weight in the model. height.

By default, the statsmodels.OLS() method does not contain an intercept term, so constant terms in the model should be treated as coefficients on a base 1 dimension. Therefore, in the exog input, the values in the leftmost column should be all 1. Here we use the method provided in statsmodels to directly solve this problem – sm.add_constant() to add a new column to X, the column name is const, and the value of each row is 1.0

X_add_const=sm.add_constant(X)
X_add_const

Perform a simple linear regression using the OLS() method on the independent variable X_add_const and the dependent variable y.

myModel=sm.OLS(y,X_add_const)

Then obtain the fitting results and call the summary() method to display the regression fitting results.

results=myModel.fit()
print(results.summary())
OLS Regression Results
================================================== ============================
Dep. Variable: weight R-squared: 0.991
Model: OLS Adj. R-squared: 0.990
Method: Least Squares F-statistic: 1433.
Date: Thu, 09 Nov 2023 Prob (F-statistic): 1.09e-14
Time: 18:28:09 Log-Likelihood: -26.541
No. Observations: 15 AIC: 57.08
Df Residuals: 13 BIC: 58.50
Df Model: 1
Covariance Type: nonrobust
================================================== ============================
                 coef std err t P>|t| [0.025 0.975]
-------------------------------------------------- ----------------------------
const -87.5167 5.937 -14.741 0.000 -100.343 -74.691
height 3.4500 0.091 37.855 0.000 3.253 3.647
================================================== ============================
Omnibus: 2.396 Durbin-Watson: 0.315
Prob(Omnibus): 0.302 Jarque-Bera (JB): 1.660
Skew: 0.789 Prob(JB): 0.436
Kurtosis: 2.596 Cond. No. 982.
================================================== ============================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1769: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

The const and height corresponding to the coef column in the second part of the above running results are the intercept term and slope in the calculated regression model.

In addition to reading the regression summary, you can also call the params attribute to view the slope and intercept of the fitting results.

results.params
const -87.516667
height 3.450000
dtype: float64

As can be seen from the output results, the intercept term and slope in the regression model are -87.516667 and 3.450000 respectively.

8. Model Evaluation

R^2 (coefficient of determination) is used as an index to measure the degree of fit of the regression line to the observed value. Its value range is [0,1]. The closer to 1, the better the goodness of fit of the regression line. You can call the squared attribute to view the R^2 of the fitting results.

results.rsquared
0.9910098326857505

In addition to statistics such as the coefficient of determination, the regression effect can also be viewed more intuitively through visualization methods. Here we call the plot() method in the matplotlib.pyplot package to draw the regression line and the real data in a graph for comparison.

y_predict=results.params[0] + results.params[1]*df_women["height"]
plt.rcParams['font.family']="simHei" #Chinese character display font settings
plt.plot(df_women["height"],df_women["weight"],"o")
plt.plot(df_women["height"],y_predict)
plt.title("Linear regression analysis of female height and weight")
plt.xlabel("height")
plt.ylabel("weight")

It can be seen from the output results that the effect of using a simple linear regression model can be further optimized. For this reason, the polynomial regression method is used for regression analysis.

9. Model parameter adjustment

Call the OLS() method in Python’s statistical analysis package statsmodels to perform polynomial regression modeling on the independent variable female height and the dependent variable weight.

It is assumed that there is a high-level linear regression between the dependent variable y and the independent variables X, X^2, and X^3. Therefore, in polynomial analysis, the characteristic matrix consists of 3 parts, namely X, X^2, and X^3. Create the feature matrix X by calling the column_stack() method of the numpy library.

import numpy as np
X=np.column_stack((X,np.power(X,2),np.power(X,3)))

Preserve the intercept term in polynomial regression through the sm.add_constant() method. Use the OLS() method to perform polynomial regression on the independent variable X_add_const and the dependent variable y.

X_add_const=sm.add_constant(X)
myModel_updated=sm.OLS(y,X_add_const)
results=myModel_updated.fit()
print(results.summary())
OLS Regression Results
================================================== ============================
Dep. Variable: weight R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.679e + 04
Date: Thu, 09 Nov 2023 Prob (F-statistic): 2.07e-20
Time: 18:46:54 Log-Likelihood: 1.3441
No. Observations: 15 AIC: 5.312
Df Residuals: 11 BIC: 8.144
Df Model: 3
Covariance Type: nonrobust
================================================== ============================
                 coef std err t P>|t| [0.025 0.975]
-------------------------------------------------- ----------------------------
const -896.7476 294.575 -3.044 0.011 -1545.102 -248.393
x1 46.4108 13.655 3.399 0.006 16.356 76.466
x2 -0.7462 0.211 -3.544 0.005 -1.210 -0.283
x3 0.0043 0.001 3.940 0.002 0.002 0.007
================================================== ============================
Omnibus: 0.028 Durbin-Watson: 2.388
Prob(Omnibus): 0.986 Jarque-Bera (JB): 0.127
Skew: 0.049 Prob(JB): 0.939
Kurtosis: 2.561 Cond. No. 1.25e + 09
================================================== ============================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.25e + 09. This might indicate that there are
strong multicollinearity or other numerical problems.
C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1769: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

It can be seen from the output results that the intercept term in the polynomial regression model is -896.7476, and the corresponding slopes of X, X^2, and X^3 are 46.4108, -0.7462, and 0.0043 respectively.

Call the squared attribute to view the R^2 of the fitting result:

results.rsquared
0.9997816939979361

It can be seen from the results of the coefficient of determination that the polynomial regression model performs better than the simple linear regression model.

10. Model Prediction

Use this polynomial regression model to predict weight and output the prediction results.

y_predict_updated=results.predict()
y_predict_updated
array([114.63856209, 117.40676937, 120.18801264, 123.00780722,
       125.89166846, 128.86511168, 131.95365223, 135.18280543,
       138.57808662, 142.16501113, 145.9690943, 150.01585147,
       154.33079796, 158.93944911, 163.86732026])

Visualization of polynomial regression model:

y_predict=(results.params[0] + results.params[1]*df_women["height"] +
           results.params[2]*df_women["height"]**2 +
           results.params[3]*df_women["height"]**3)

plt.plot(df_women["height"],df_women["weight"],"o")
plt.plot(df_women["height"],y_predict)
plt.title("Polynomial regression analysis of female height and weight")
plt.xlabel("height")
plt.ylabel("weight")

It can be seen from the results that the fitting effect is significantly improved after using polynomial regression, and the results are more satisfactory.