Implementation of PLS-DA classification (based on sklearn)

Directory

a brief introdction

Code

Dataset partition

Choose the number of factors

Model training and classification

Call functions


A brief introduction

(taken from various sources here)

PLS-DA can be used for both classification and dimensionality reduction. Unlike PCA, PCA is unsupervised and PLS-DA is supervised. Unlike PCA, PCA is unsupervised, and PLS is a partial least squares analysis of the “supervised” mode. When the difference between sample groups is large and the difference within the group is small, the unsupervised analysis method can well distinguish the difference between groups. On the contrary, if the difference between sample groups is not large, it is difficult for unsupervised methods to distinguish between groups. Also, if the differences between groups are small and the sample sizes of the groups vary widely, the group with the larger sample size will dominate the model. Supervised analysis (PLS-DA) can solve these problems well. That is, when the data is analyzed, the grouping relationship of the samples is known, so that the characteristic variables that distinguish each group can be better selected to determine the relationship between the samples. DA is discriminant analysis, and PLS-DA uses the method of partial least squares regression to “reduce the dimensionality” of the data, establish a regression model, and conduct discriminant analysis on the regression results.

This article is mainly based on the classification of PLS.

Code Implementation

Mainly refer to this boss: https://zhuanlan.zhihu.com/p/374412915

Dataset division

First of all, the data set should be processed into a certain format, that is, the independent variable and the dependent variable should be clarified, the data set should be divided, and then sent back.

def deal_data(path):
    # Read the data matrix composed of independent variables and dependent variables, the category y is placed in the last column, and the front is x
    spec = pd. read_excel(path)
    spec = np.array(spec) # directly convert to numpy type
    x = spec[:, 0:-1] # The previous columns are all independent variables
    y = spec[:,-1]

    # First do a data set division
    train_X, test_X, train_y, test_y = train_test_split(x, y, test_size=0.2)
    return train_X, test_X, train_y, test_y

Select the number of factors

PLS is similar to PCA. There is such a saying that there are components. Different numbers of components will eventually have different effects. Therefore, we train for different numbers of components, and then perform cross-validation to observe the performance of different numbers of components. , so as to choose the appropriate number.

def accuracy_component(xc, xv, yc, yv, component=8, n_fold=5):
    # xc represents the training set, xv represents the test set, yc represents the training label, yv represents the test label, component represents the maximum number, n_fold represents divided into several groups of samples (each time a group as a test set, cross-validation)
    k_range = np.linspace(start=1, stop=component, num=component)

    kf = KFold(n_splits=n_fold, random_state=None, shuffle=True) # n_splits indicates how many K subsets to divide into, cross-validation needs

    accuracy_validation = np.zeros((1, component)) # used to store the test average accuracy accuracy of each component score
    accuracy_train = np.zeros((1, component)) # Used to store the average training accuracy accuracy of each component score
    for j in range(component): # j∈[0,component-1],j + 1∈[1,component]
        p = 0
        acc = 0 #acc represents the total accuracy, p represents the number, acc/p average accuracy
        # The following is the normal training
        model_pls = PLSRegression(n_components=j + 1) # Select component components at this time
        yc_labels = pd. get_dummies(yc)
        model_pls. fit(xc, yc_labels)
        y_pred = model_pls. predict(xv)
        y_pred = np.array([np.argmax(i) for i in y_pred])
        accuracy_train[:, j] = accuracy_score(yv, y_pred) # this is directly trained
        # The following is cross-validation
        for train_index, test_index in kf.split(xc): # Conduct n_fold rounds of cross-validation
            # Divide the dataset
            X_train, X_test = xc[train_index], xc[test_index]
            y_train, y_test = yc[train_index], yc[test_index]
            YC_labels = pd.get_dummies(y_train) # one-hot encoding of training data results
            model_1 = PLSRegression(n_components=j + 1)
            model_1.fit(X_train, YC_labels)
            Y_pred = model_1. predict(X_test)
            Y_pred = np.array([np.argmax(i1) for i1 in Y_pred]) # one-hot encoding converted into categorical variables
            acc = accuracy_score(y_test, Y_pred) + acc
            p = p + 1
        accuracy_validation[:, j] = acc / p # Calculate the average accuracy of j + 1 components
    # First train a model for each component number, and then use the test set to get the accuracy
    print('model training accuracy')
    print(accuracy_train)
    # Then cross-validate the training set of samples
    print('Average accuracy of cross-validation')
    print(accuracy_validation)
    plt.plot(k_range, accuracy_train.T, 'o-', label="Training", color="r")
    plt.plot(k_range, accuracy_validation.T, 'o-', label="Cross-validation", color="b")
    plt.xlabel("N components")
    plt.ylabel("Score")
    plt.legend(loc="best") # Select the best position to mark the legend
    plt.rc('font', family='Times New Roman')
    plt.rcParams['font.size'] = 10
    plt. show()
    return accuracy_validation, accuracy_train

The following is the operation effect. Because the data is random, so the parameters do not need to be concerned. From this point of view, the effect of three to four factors is not bad.

model training and classification

The following is to select the appropriate number of components for classification, and get the confusion matrix and some parameter indicators.

def PLS_DA(train_X, test_X, train_y, test_y):
    # modeling
    model = PLSRegression(n_components=6)
    train_y = pd. get_dummies(train_y)
    model. fit(train_X, train_y)
    # predict
    y_pred = model. predict(test_X)
    # Convert predictions (category matrix) to numerical labels
    y_pred = np.array([np.argmax(i) for i in y_pred])
    # Model evaluation---confusion matrix and accuracy
    print('The test set confusion matrix is:\
', confusion_matrix(test_y, y_pred))
    print('Average classification accuracy:\
', accuracy_score(test_y, y_pred))

The running effect is at least higher than the 33% correct rate of random classification.

Call function

The above are all components, and finally a main function call is required to be connected in series, as follows, it is recommended to step by step, which is also convenient for problem discovery and processing.

max_component = 8 # Iterate the maximum number of components
n_fold = 10 # Number of cross-validation times
excel_path = './data.xlsx' # data set address
if __name__ == '__main__':
    train_X, test_X, train_y, test_y = deal_data(excel_path) # process the data, return the processed training and test sets, analyze the specific situation
    # accuracy_component(train_X, test_X, train_y, test_y, max_component, n_fold)
    PLS_DA(train_X, test_X, train_y, test_y, n_components=3)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge Algorithm skill treeHome pageOverview 41865 people are studying systematically