Principal component analysis (PCA) and wine data set example code

1. Main steps of principal component analysis

Principal component analysis (PCA) is an unsupervised linear transformation technique that is widely used in different fields, especially feature extraction and dimensionality reduction. His application areas include exploratory data analysis and denoising for stock market trading, and genomic data and gene expression level analysis in bioinformatics.

PCA helps us identify models in the data based on correlations between features. PCA aims to find the direction of maximum variance present in high-dimensional data and project the data into a new subspace with dimensions smaller than or equal to the original data. Assuming that the new feature axes are orthogonal to each other, the orthogonal axes (principal components) of this space can be interpreted as the direction of maximum variance, as shown in the figure below. Among them, x1 and x2 are the original feature axes, and PC1 and PC2 are the main component directions.

If we use PCA to reduce dimensionality, we can construct a d $\times$ k-dimensional transformation Matrix W, which can map the feature vector x of the training sample to a new k-dimensional feature subspace, which has fewer dimensions than the original d-dimensional feature space.
For example, suppose we have a feature vector x:

? ? ? $x=[x_{1,}x_{2... ,} x_{d}],x\ in \mathbb{R}^{d}$
Then pass a transformation matrix $W\in \mathbb{R}^{d\times k}$ Transform:

? ? ? ? ? ? $xW=z$

The results are expressed in vector form as follows:

? ? ? ? ? $z=[z_{1,}z_{ 2,}...,z_{k}], z\in \mathbb{R}^{k}$
In the result of converting the original high-dimensional data into a k-dimensional new subspace (usually k ≤ d), the variance of the first principal component is the largest. In addition, the direction of PCA is very sensitive to data scale, and features need to be standardized before performing PCA.

Suppose we only take one principal component, then $w\in R^{d\times 1}$ , the principal component is $z=xw$ ， $x\in \mathbb{R}^{1\times d}$

? ? ? $var(z)=var(xw)=w^{T}cov(x)w=w^{ T}\Sigma w$

Because $\Sigma$ is a positive definite matrix, there is a decomposition $\ Sigma =Q\Lambda Q^{T}$ , where $Q$ is a unit symmetric matrix, $\Lambda$ are arranged in descending order. So there is:

? ? ? ? ? $var(z)=w^{T}\Sigma w=w^ {T}Q\Lambda Q^{T}w$

Take $w$ as $\Sigma$ $\lambda _{1}$ Corresponding feature vector $q_{1}$ ， $q_{1}\in \mathbb{R}^{d\times 1}$ , then there is:

? $var(z)=w^{T}Q\Lambda Q^{T}w=q_{1}^{T}Q \Lambda Q^{T}q_{1}=\lambda _{1}$ $(q_{ 1}^{T}q_{i}=0,i=2,...d)$

In the same way, if two principal component analysis is performed, there is $w_{2}=q_{2}$ , the second principal component $z_{2}=xq_{2}$

At this time, $var(z + z_{2})$ is the largest, and $var(z + z_{2})=\lambda _{1} + \lambda _{2}$

Based on this, it can be constructed: $w=[q_{1},q_{2}]$

The PCA steps are briefly summarized as follows:
1) Standardized d-dimensional data set.
2) Construct the covariance matrix.
3) Decompose the covariance matrix into eigenvectors and eigenvalues.
4) Sort the eigenvalues in descending order, thereby sorting the corresponding eigenvectors.
5) Select k eigenvectors corresponding to the k largest eigenvalues, where k is the dimension of the new feature subspace (k ≤ d).
6) Construct the projection matrix W from the first k eigenvectors.
7) Use the projection matrix W to transform the d-dimensional input data set X to obtain a new k-dimensional feature subspace.

2. Extract principal components

The first four steps of PCA are discussed below:
1) Standardized data set.
2) Construct the covariance matrix.
3) Obtain the eigenvalues and eigenvectors of the covariance matrix.
4) Sort the eigenvalues in descending order to sort the eigenvectors.

First, load the wine dataset.

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue',
                   'OD280/OD315 of dilute wines', 'Proline']

df_wine.head()

Divide the training set and test set according to the ratio of 7:3, and standardize to unit variance.

from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3,
                     stratify=y,
                     random_state=0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #Standardize
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

Then, construct the covariance matrix.

Among them, the features $x_{j}$ and $x_ The overall covariance calculation formula between {k}$ is as follows:

$\Sigma _{jk }=\frac{1}{n}\sum_{i=1}^{n}(x_{j}^{(i)}-\mu _{j}) (x_{k}^{(i)}-\mu _{k})$

The feature vector v satisfies the following conditions:

$\Sigma v=\lambda v$

$\lambda$ is the characteristic value.

Call Numpy’s linalg.eig function to obtain the eigenvectors and eigenvalues of the covariance matrix of the wine data set.

import numpy as np
cov_mat = np.cov(X_train_std.T) #Calculate the covariance matrix of the standardized training set
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) #Use np.linalg.eig to complete eigendecomposition

print('\
Eigenvalues \
%s' % eigen_vals) #eigen_valseigenvalues

Generate 13 feature vectors eigen_vals, and the corresponding feature vectors are stored in 13 $\times$ In the columns of the 13-dimensional matrix eigen_vecs.

3. Total variance and explained variance

Because we want to reduce the dimension, we need to find the top k most important feature vectors. Before doing this, first calculate the variance explanation ratio of the eigenvalues. Variance explained ratio of eigenvalues $\lambda _{j}$ It is the sum of $\lambda _{j}$ and the sum of eigenvalues Compare:

Variance explained ratio = $\frac{\lambda _{j}}{\sum_{j=1}^{d}\lambda _{j}}$

Call Numpy’s cumsum function to calculate the sum of explained variances, and then use Matplotlib’s step function to plot.

tot = sum(eigen_vals) #Calculate the sum of eigenvalues
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)] #Calculate the proportion of variance explained by each feature value
cum_var_exp = np.cumsum(var_exp) #Cumulative summation of variance explanation ratio

import matplotlib.pyplot as plt


plt.bar(range(1, 14), var_exp, alpha=0.5, align='center',
        label='Individual explained variance')
plt.step(range(1, 14), cum_var_exp, where='mid',
         label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

The results show that the first principal component itself accounts for about 40% of the variance, and the first two principal components combined can explain almost 60% of the variance in the data set.

4. Feature transformation

After decomposing the covariance matrix into eigenvectors, the final three steps are then completed to transform the wine data set to the new principal component axes.

The rest of the steps are as follows:
5) Select k eigenvectors corresponding to the first k largest eigenvalues, where k is the dimension of the new feature subspace (k ≤ d).
6) Construct the projection matrix W using the first k eigenvectors.
7) Use the projection matrix W to transform the d-dimensional input data set X to obtain a new k-dimensional feature subspace.

Arrange the eigenvectors in descending order of eigenvalues, construct a projection matrix from the selected eigenvectors, and use the projection matrix to transform the data into a low-dimensional subspace.
Sort the eigenvectors in descending order of eigenvalues:

eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i])
               for i in range(len(eigen_vals))] #Combine the absolute value of each feature value and the corresponding feature vector into a tuple.

eigen_pairs.sort(key=lambda k: k[0], reverse=True)
#Specify by lambda k: k[0] to sort according to the first element in the tuple, reverse=True means to sort in descending order.

Collecting eigenvectors corresponding to the first two largest eigenvalues captures approximately 60% of the variance from the data set. Here only two eigenvectors are selected for explanation.

w = np.hstack((eigen_pairs[0][1][:, np.newaxis],
               eigen_pairs[1][1][:, np.newaxis]))
#eigen_pairs[0] represents the first eigenvalue-eigenvector pair, and eigen_pairs[0][1] represents the eigenvector corresponding to the eigenvalue. Convert it to a column vector via [:, np.newaxis].
#Use the np.hstack() function to splice these two column vectors horizontally to form the projection matrix W.
print('Matrix W:\
', w)
#Each column is a principal component

Create a 13 $\times$ 2-dimensional projection matrix based on the first two eigenvectors W.

Use the projection matrix to transform x (representing a 13-dimensional row vector) into the PCA subspace (principal components 1 and 2) to obtain x’, a two-dimensional instance vector composed of two new features:

? ? ? ? ? ? $x' = xW$

X_train_std[0].dot(w) #The first sample data of X_train_std is linearly transformed using the projection matrix W.

The entire training set is transformed into two principal components by calculating the matrix dot product:

? ? ? ? ? ? $X'= XW$

X_train_pca = X_train_std.dot(w)
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_pca[y_train == l, 0], #Belongs to the projection of the sample labeled l on the first and second principal components.
                X_train_pca[y_train == l, 1],
                c=c, label=l, marker=m)

plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

After visualization, it can be seen from the scatter plot that the data spreads more along the x-axis (first principal component) than the second principal component (y-axis), which is consistent with the previous conclusion of the variance explained ratio.

5. Implement using sklearn

Calculate 13 explained variance ratios.

from sklearn.decomposition import PCA

pca = PCA()
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_ #Each variance ratio

Visualizing the explained variance ratio.

plt.bar(range(1, 14), pca.explained_variance_ratio_, alpha=0.5, align='center')
plt.step(range(1, 14), np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')

plt.show()

Define the reduced dimensions and standardize them.

pca = PCA(n_components=2) #The number of features after dimensionality reduction is 2
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1])
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.show()

PCA is a converter class of scikit-learn. We first use the training data to fit the model, and then use the same model parameters to convert the training data and test data. Apply the PCA class to the wine training data set, classify the transformed samples through logistic regression, and call the plot_decision_regions function to visualize the decision regions.

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):

    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))]) #colors[:len(np.unique(y))] means to select the corresponding number of colors according to the number of categories.

    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 #X[:, 0] means extracting all rows (samples) of feature vector Keep the value of the first dimension.
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), #x1_min to x1_max is a one-dimensional array with a step size of resolution
                           np.arange(x2_min, x2_max, resolution))
    The #np.meshgrid function generates a two-dimensional grid coordinate matrix.
    
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) #Flatten the two-dimensional grid coordinate matrices of xx1 and xx2 into one dimension
    #Use the predict method of the classifier to predict the flattened feature vector and obtain the corresponding category label. For each grid point, a predicted value is obtained.
    
    Z = Z.reshape(xx1.shape)
    
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)): #Indicates that for each unique value cl, use the variable idx to track its index in the array
        plt.scatter(x=X[y == cl, 0],
                    y=X[y == cl, 1],
                    alpha=0.6,
                    color=cmap(idx),
                    edgecolor='black',
                    marker=markers[idx],
                    label=cl)

from sklearn.linear_model import LogisticRegression

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

lr = LogisticRegression(multi_class='ovr', random_state=1, solver='lbfgs')
lr = lr.fit(X_train_pca, y_train)

plot_decision_regions(X_train_pca, y_train, classifier=lr)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

The decision region of the training data is reduced to two principal component axes.

The decision-making area is also drawn on the test data set, and the sample classification effect can be better.

plot_decision_regions(X_test_pca, y_test, classifier=lr)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

Initialize the PCA class by setting the parameter n_components to None so that all principal components are retained and the system will return all sorted principal components instead of performing dimensionality reduction. The explained variance ratio is then accessed by calling the explained_variance_ratio_property.

pca = PCA(n_components=None)
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 385536 people are learning the system