Statistics | Python | Principal component analysis principal component score coefficient calculation

Foreword:

Because spss cannot directly obtain the principal component score coefficient, refer to the articles written by other bloggers on csdn and sort out the code used to calculate the principal component score coefficient.

Principle of principal component analysis

Skip it first and add it later

Principal component analysis code

The libraries and file reading that need to be used, the following takes reading csv files as an example, pandas can also read excel, sav (data set format commonly used in spss) and other formats

Case data:

Analysis of the economic benefits of national key cement enterprises in a certain year, the evaluation indicators are: X1 is the profit rate of fixed assets, X2 is the profit rate of capital, X3 is the profit rate of sales income, The number of days of working capital turnover,

**Data**
?	X1	X2	X3	X4	X5	X6	X7	X8
Enterprise A	16.68	26.75	31.84	18.4	53.25	55	28.83	1.75
Enterprise B	19.7	27.56	32.94	19.2	59.82	55	32.92	2.87
Enterprise C	15.2	23.4	32.98	16.24	46.78	65	41.69	1.53
Enterprise D	7.29	8.97	21.3	4.76	34.39	62	39.28	1.63
Enterprise E	29.45	56.49	40.74	43.68	75.32	69	26.68	2.14
Enterprise F	32.93	42.78	47.98	33.87	66.46	50	32.87	2.6
Enterprise G	25.39	37.85	36.76	27.56	68.18	63	35.79	2.43
Enterprise H	15.05	19.49	27.21	14.21	56.13	76	35.76	1.75
Enterprise I	19.82	28.78	33.41	20.17	59.25	71	39.13	1.83
Enterprise J	21.13	35.2	39.16	26.52	52.47	62	35.08	1.73
Enterprise K	16.75	28.72	29.62	19.23	55.76	58	30.08	1.52
Enterprise L	15.83	28.03	26.4	17.43	61.19	61	32.75	1.6
Enterprise M	16.53	29.73	32.49	20.63	50.14	69	37.57	1.31
EnterpriseN	22.24	54.59	31.05	37	67.95	63	32.33	1.57
Enterprise O	12.92	20.82	25.12	12.54	51.07	66	39.18	1.83

1. Reading library and data set files

# Data processing
import pandas as pd
import numpy as np
 
# drawing
import seaborn as sns
import matplotlib.pyplot as plt

#Read the data set file
df = pd.read_csv("hw91.csv")
'''How to convert data into csv format?
pd.read_csv defaults to reading UTF-8 encoding mode
You can copy the data to excel and save it as CSV UTF-8 (comma separated) format'''

2. Correlation test

Before performing principal component analysis, the data needs to be tested for correlation.

The inspection method is generally

(1) KMO inspection

(2) Bartley’s test of sphericity

# KMO inspection
''' Check the correlation and partial correlation between variables, the value is between 0-1;
    The closer the KMO statistic is to 1, the stronger the correlation between variables, the weaker the partial correlation, and the better the effect of factor analysis.
    '''
#Remember to check whether the factor_analyzer library is installed
from factor_analyzer.factor_analyzer import calculate_kmo
#kmo_all, kmo_model = calculate_kmo(df)
kmo = calculate_kmo(df)
print(f'KMO test result is: {kmo[1]}')

#Perform sphericity test
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value, p_value = calculate_bartlett_sphericity(df)
print(f'Chi-square={chi_square_value},P value={p_value}')

Output:

Attachment: spss results

(3) Standardize the data and find the correlation coefficient matrix

# Data normalization
from sklearn import preprocessing
dfz = preprocessing.scale(df)
print(f'The result after data standardization is as follows:\\
{dfz}')

Data normalization results

Use the standardized data to find the correlation coefficient matrix

# Correlation coefficient matrix
covX = np.around(np.corrcoef(dfz.T),decimals=3)
print(f'The correlation coefficient matrix is as follows:\\
{covX}')

Output result:

Attachment: spss results

3. Principal component selection

(1) Solution of eigenvalues and eigenvectors

# Solve eigenvalues and eigenvectors
featValue, featVec = np.linalg.eig(covX.T)
print(f' is solved, the eigenvalue is: \\
{featValue}, \\
The eigenvector is: \\
{featVec}')
featValue = sorted(featValue)[::-1]
featValue

Output result:

Draw scatter plots and line charts

# Draw scatter plots and line charts
plt.scatter(range(1, df.shape[1] + 1),featValue)
plt.plot(range(1, df.shape[1] + 1), featValue)

#Display the title of the graph and the name of the xy axis
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')

plt.grid()
plt.show()

Output a scatter plot (also known as a gravel plot)

Calculate eigenvalue contribution and cumulative contribution

#Feature value contribution
gx = featValue/np.sum(featValue)

#Cumulative contribution
lg = np.cumsum(gx)
print(f'Cumulative contribution of feature value:\\
{lg}')

Output results

Here you can add a code to draw the cumulative contribution image, and add () next time

Select principal components:

Regarding the selection of principal components, it is generally based on the cumulative contribution of the eigenvalue or whether the eigenvalue is greater than 1. In this example, refer to the evaluation criteria of spss and select whether the eigenvalue is greater than 1.

#Principal component selection
'''Refer to spss for judgment: the main component with eigenvalue>1 is selected'''
k = [i for i in range(0, len(featValue)) if featValue[i]>1]
k = list(k)
print(f'The selected principal component is: {k}')
'''The 0th component is the 1st component'''

Output result:

Attached: spss principal component selection results

Of course, the conditions for determining the selection of principal components can also be selected as follows:

'''The following code is judged according to contribution, but which is the most common method, I don't know yet'''
k = [i for i in range (len(lg)) if lg[i] < 0.85]


'''The following code is suitable for calculating the principal component score coefficient when principal component analysis regression does not reduce dimensionality.
That is to say, there are as many principal components as there are independent variables, which can ensure that the data is extracted to the maximum extent'''
k= [i for i in range(0, len(featValue))]

4. Principal component score coefficient calculation

Select the eigenvector matrix corresponding to the principal component, that is, the principal component score coefficient (?) is consistent with the result obtained by spss

#Select the eigenvector matrix corresponding to the principal component
selectVec = np.matrix(featVec.T[k]).T
selectVec = selectVec*(-1)
print(f'The eigenvector matrix corresponding to the principal component is:\\
{selectVec}')

The following is the result calculated by spss: But here we find the score of the common factor instead of the principal component score. We also need to multiply the factor score by the corresponding variance to get the principal component score. The calculation results are shown in the table below.

It can be found that the results obtained by python are consistent with the results processed by spss.