Directory
a brief introdction
Code
Dataset partition
Choose the number of factors
Model training and classification
Call functions
A brief introduction
(taken from various sources here)
PLS-DA can be used for both classification and dimensionality reduction. Unlike PCA, PCA is unsupervised and PLS-DA is supervised. Unlike PCA, PCA is unsupervised, and PLS is a partial least squares analysis of the “supervised” mode. When the difference between sample groups is large and the difference within the group is small, the unsupervised analysis method can well distinguish the difference between groups. On the contrary, if the difference between sample groups is not large, it is difficult for unsupervised methods to distinguish between groups. Also, if the differences between groups are small and the sample sizes of the groups vary widely, the group with the larger sample size will dominate the model. Supervised analysis (PLS-DA) can solve these problems well. That is, when the data is analyzed, the grouping relationship of the samples is known, so that the characteristic variables that distinguish each group can be better selected to determine the relationship between the samples. DA is discriminant analysis, and PLS-DA uses the method of partial least squares regression to “reduce the dimensionality” of the data, establish a regression model, and conduct discriminant analysis on the regression results.
This article is mainly based on the classification of PLS.
Code Implementation
Mainly refer to this boss: https://zhuanlan.zhihu.com/p/374412915
Dataset division
First of all, the data set should be processed into a certain format, that is, the independent variable and the dependent variable should be clarified, the data set should be divided, and then sent back.
def deal_data(path): # Read the data matrix composed of independent variables and dependent variables, the category y is placed in the last column, and the front is x spec = pd. read_excel(path) spec = np.array(spec) # directly convert to numpy type x = spec[:, 0:-1] # The previous columns are all independent variables y = spec[:,-1] # First do a data set division train_X, test_X, train_y, test_y = train_test_split(x, y, test_size=0.2) return train_X, test_X, train_y, test_y
Select the number of factors
PLS is similar to PCA. There is such a saying that there are components. Different numbers of components will eventually have different effects. Therefore, we train for different numbers of components, and then perform cross-validation to observe the performance of different numbers of components. , so as to choose the appropriate number.
def accuracy_component(xc, xv, yc, yv, component=8, n_fold=5): # xc represents the training set, xv represents the test set, yc represents the training label, yv represents the test label, component represents the maximum number, n_fold represents divided into several groups of samples (each time a group as a test set, cross-validation) k_range = np.linspace(start=1, stop=component, num=component) kf = KFold(n_splits=n_fold, random_state=None, shuffle=True) # n_splits indicates how many K subsets to divide into, cross-validation needs accuracy_validation = np.zeros((1, component)) # used to store the test average accuracy accuracy of each component score accuracy_train = np.zeros((1, component)) # Used to store the average training accuracy accuracy of each component score for j in range(component): # j∈[0,component-1],j + 1∈[1,component] p = 0 acc = 0 #acc represents the total accuracy, p represents the number, acc/p average accuracy # The following is the normal training model_pls = PLSRegression(n_components=j + 1) # Select component components at this time yc_labels = pd. get_dummies(yc) model_pls. fit(xc, yc_labels) y_pred = model_pls. predict(xv) y_pred = np.array([np.argmax(i) for i in y_pred]) accuracy_train[:, j] = accuracy_score(yv, y_pred) # this is directly trained # The following is cross-validation for train_index, test_index in kf.split(xc): # Conduct n_fold rounds of cross-validation # Divide the dataset X_train, X_test = xc[train_index], xc[test_index] y_train, y_test = yc[train_index], yc[test_index] YC_labels = pd.get_dummies(y_train) # one-hot encoding of training data results model_1 = PLSRegression(n_components=j + 1) model_1.fit(X_train, YC_labels) Y_pred = model_1. predict(X_test) Y_pred = np.array([np.argmax(i1) for i1 in Y_pred]) # one-hot encoding converted into categorical variables acc = accuracy_score(y_test, Y_pred) + acc p = p + 1 accuracy_validation[:, j] = acc / p # Calculate the average accuracy of j + 1 components # First train a model for each component number, and then use the test set to get the accuracy print('model training accuracy') print(accuracy_train) # Then cross-validate the training set of samples print('Average accuracy of cross-validation') print(accuracy_validation) plt.plot(k_range, accuracy_train.T, 'o-', label="Training", color="r") plt.plot(k_range, accuracy_validation.T, 'o-', label="Cross-validation", color="b") plt.xlabel("N components") plt.ylabel("Score") plt.legend(loc="best") # Select the best position to mark the legend plt.rc('font', family='Times New Roman') plt.rcParams['font.size'] = 10 plt. show() return accuracy_validation, accuracy_train
The following is the operation effect. Because the data is random, so the parameters do not need to be concerned. From this point of view, the effect of three to four factors is not bad.
model training and classification
The following is to select the appropriate number of components for classification, and get the confusion matrix and some parameter indicators.
def PLS_DA(train_X, test_X, train_y, test_y): # modeling model = PLSRegression(n_components=6) train_y = pd. get_dummies(train_y) model. fit(train_X, train_y) # predict y_pred = model. predict(test_X) # Convert predictions (category matrix) to numerical labels y_pred = np.array([np.argmax(i) for i in y_pred]) # Model evaluation---confusion matrix and accuracy print('The test set confusion matrix is:\ ', confusion_matrix(test_y, y_pred)) print('Average classification accuracy:\ ', accuracy_score(test_y, y_pred))
The running effect is at least higher than the 33% correct rate of random classification.
Call function
The above are all components, and finally a main function call is required to be connected in series, as follows, it is recommended to step by step, which is also convenient for problem discovery and processing.
max_component = 8 # Iterate the maximum number of components n_fold = 10 # Number of cross-validation times excel_path = './data.xlsx' # data set address if __name__ == '__main__': train_X, test_X, train_y, test_y = deal_data(excel_path) # process the data, return the processed training and test sets, analyze the specific situation # accuracy_component(train_X, test_X, train_y, test_y, max_component, n_fold) PLS_DA(train_X, test_X, train_y, test_y, n_components=3)
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge Algorithm skill treeHome pageOverview 41865 people are studying systematically