Practical data analysis | KNN algorithm – automatic diagnosis and analysis of cases

Table of Contents

1. Data and analysis objects

2. Purpose and analysis tasks

3. Methods and Tools

4. Data reading

5. Data understanding

6. Data preparation

7. Model training

8. Model evaluation

9. Model parameter adjustment

10. Model improvement

11. Model prediction


1. Data and analysis objects

CSV file – “bc_data.csv”

Dataset link: https://download.csdn.net/download/m0_70452407/88524905

This data set mainly records 32 attributes of 569 cases. The main attributes/fields are as follows:

(1) ID: ID of the case.

(2) Diagnosis: M means malignant, B means benign. This dataset contains a total of 357 benign cases and 212 malignant cases.

(3) 10 characteristic values of the cell nucleus, including radius, texture, perimeter, area, smoothness, compactness, concavity, concavity Points (concave points), symmetry (symmetry) and fractal dimension (fractal dimension), etc. At the same time, three statistics are provided for the above 10 eigenvalues, namely mean, standard error and worst or largest.

2. Purpose and Analysis Tasks

Understand the application of machine learning methods in data analysis-KNN method for classification analysis.

(1) Samples are used as training sets for supervised learning and prediction – “diagnosis”.

(2) Use the remaining records as the test set to perform KNN modeling.

(3) Predict the dignosis type of the test set according to the KNN model.

(4) Comparative analysis based on the diagnosis “predicted type” given by the KNN model and the “actual type” provided by the data set bc_data.csv to verify the effectiveness of KNN modeling.

3. Methods and Tools

Python language and scikit-learn package

4. Data reading

import pandas as pd
bc_data=pd.read_csv("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Machinery Industry Press\Chapter 4 Classification Analysis\bc_data.csv"
                   ,header=0)
bc_data.head()

5. Data Understanding

Conduct exploratory analysis on the data frame bc_data. The implementation method used here is to call the describe() method of the data frame (DataFrame) in the pandas package.

bc_data.describe()

In addition to the describe() method, you can also call the shape attribute and the pandas_profiling package to perform exploratory analysis on the data frame.

bc_data.shape
(569, 32)

6. Data preparation

In the data frame bc_data, the useful data for breast cancer diagnostic analysis are the 10 characteristic values of the cell nucleus. In order to extract the data values, it is necessary to delete the column names “id” and “diagnosis” on the basis of the data frame bc_data. Data, the deleted data frame is named “data”, and the implementation method is to call the drop() method of the data frame, and use the head() method of the package to observe the data situation.

data=bc_data.drop(['id'],axis=1)
X_data=data.drop(['diagnosis'],axis=1)
X_data.head()

Next, call NumPy’s ravel() method to return the column information named “diagnosis” in the data frame data in the form of a view, and output it in the form of a one-dimensional array.

import numpy as np
y_data=np.ravel(data[['diagnosis']])
y_data[0:6]
array(['M', 'M', 'M', 'M', 'M', 'M'], dtype=object)

In order to achieve the goal of automatic diagnosis of breast cancer based on the KNN algorithm, the data frame information is first randomly divided into two parts: a training set and a test set. The implementation method adopted is to call the train_test_split() method of the model_selection module in the scikit-learn package, set the training set data capacity to account for 75% of the total, and the rest is the test set data, and call the data frame (DataFrame) in the pandas package. describe() method.

from sklearn.model_selection import train_test_split
X_trainingSet,X_testSet,y_trainingSet,y_testSet=train_test_split(
    X_data,y_data,random_state=1,test_size=0.25)
X_trainingSet.describe()

In addition to the describe() method, you can also call the shape attribute and the pandas_profiling package to perform exploratory analysis on the data frame.

X_trainingSet.shape
(426, 30)

At the same time, the same process is performed on the test set data frame.

X_testSet.describe()

X_testSet.shape
(143, 30)

After “learning and training” is performed on the training set data, its mean and variance are automatically obtained, and then the training set and test set are “normalized” respectively. The implementation method adopted is to call the StandardScaler() method of the preprocessing module in the scikit-learn package. Among them, the normalization process of the training set data is as follows:

from sklearn.preprocessing import StandardScaler
means_normalization=StandardScaler() #Mean normalization processing
means_normalization.fit(X_trainingSet) #Perform "diagnostic learning" of the training set and obtain the mean and variance
X_train_normalization=means_normalization.transform(X_trainingSet)
X_train_normalization
array([[ 0.30575375, 2.59521918, 0.46246107, ..., 1.81549702,
         2.10164609, 3.38609913],
       [0.23351721, -0.05334893, 0.20573083, ..., 0.5143837,
         0.14721854, 0.05182385],
       [0.15572401, 0.18345881, 0.11343692, ..., 0.69446859,
         0.263409 , -0.10011179],
       ...,
       [0.85586279, 1.19276558, 0.89773369, ..., 1.12967374,
         0.75591781, 2.97065009],
       [-0.02486734, 0.44095848, -0.08606303, ..., -0.52515632,
        -1.1291423 , -0.45561747],
       [-0.30270019, -0.20968802, -0.37543871, ..., -0.967865,
        -1.54361274, -1.31500348]])

The test set data is also normalized in the same way.

X_test_normalization=means_normalization.transform(X_testSet)
X_test_normalization
array([[ 0.15850234, -1.23049032, 0.25369143, ..., -0.05738582,
        -0.08689656, 0.48863884],
       [-0.2638036, -0.15450952, -0.23961754, ..., 1.41330744,
         1.77388495, 2.02105229],
       [-0.32492682, -0.76147305, -0.35407811, ..., -0.1354226,
         0.87210827, 0.71179432],
       ...,
       [0.25852216, -0.06024625, 0.21500053, ..., -0.03937733,
        -1.03202789, -0.84910706],
       [1.46709506, 0.95825694, 1.49824869, ..., 0.62693676,
         0.07438274, -0.45739797],
       [-0.61942964, 0.42256565, -0.6261235 , ..., -0.48013509,
         0.34318156, -0.6134881 ]])

Seven. Model Training

The training set is used to learn the concept “diagnosis results”, and the test set is used for KNN modeling. After appropriate processing of the training and test data, the next step is to determine the model parameters. KNN model categories include brute force method, KD tree and ball tree. The brute force method is suitable for forms with less data, while the KD tree has more advantages in more data. Considering the algorithm efficiency issue and the amount of data in the data frame in this project, the KD tree was selected for modeling, and the KNN classification was first obtained processor, and use built-in parameters to adjust the three elements of KNN.

The model training implementation used here is the KNeighborsClassifier() method of the neighbors module in the scikit-learn package. The various parameters set are explained as follows:

(1) algorithm represents a fast k-nearest neighbor search algorithm, and the algorithm model determined here is a KD tree.

(2) leaf_size is the size of the KD tree constructed, and the default is 30.

(3) metric is used for distance measurement, and the default metric is minkowski.

(4) metric_params represents other key parameters of the distance formula, which are not very important. Use the default None.

(5) n_jobs is a parallel processing setting, using the default None.

(6) n_neighbors represents the initial set of neighbor trees, that is, the k value in the KNN algorithm.

(7) p represents the distance measurement formula, where 1 is the Harmanton distance formula and 2 is the Euclidean distance formula. Here, the Euclidean distance formula is used for distance measurement and the p value is set to 2.

(8) Weights table weights, the default is uniform (equal weight).

Then, the training function fit() and the prediction function predict() are used to compare the output of the known data of the training set and the data of the test set.

from sklearn.neighbors import KNeighborsClassifier
myModel=KNeighborsClassifier(algorithm="kd_tree",
                             leaf_size=30,
                             metric="minkowski",
                             metric_params=None,
                             n_jobs=None,
                             n_neighbors=5,
                             p=2,
                             weights="uniform")
myModel.fit(X_trainingSet,y_trainingSet)
y_predictSet=myModel.predict(X_testSet)

The fit() function data training results are as follows:

y_testSet
array(['B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'M', 'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'M ',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M ',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B ',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'M ',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B ',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'M', 'B', 'M', 'M ',
       'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B '],
      dtype=object)

The results of prediction using the predict() function are as follows:

y_predictSet
array(['M', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'M', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B ',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M ',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B ',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B ',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'M', 'M', 'M', 'M ',
       'B', 'B', 'B', 'M', 'B', 'M', 'M', 'M', 'B', 'B', 'M', 'M', 'B '],
      dtype=object)

Finally, use the get_params() method to query the parameters of the model:

myModel.get_params()
{'algorithm': 'kd_tree',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

It can be seen from the above output results that query parameters using the get_params() method are displayed in the form of a dictionary structure, and you can see that the parameter results are consistent with the previous settings.

8. Model Evaluation

In order to evaluate the performance of the established model, the “Prediction Accuracy (Accuracy Score)” parameter is used. The specific implementation method is to call the accuracy_score() method of the metrics module of the scikit-learn package.

from sklearn.metrics import accuracy_score
accuracy_score(y_testSet,y_predictSet)
0.9370629370629371

It can be seen from the result output that the accuracy of the model prediction results is about 93.71%, and further optimization can be considered.

9. Model parameter adjustment

From the previous analysis, it can be seen that the size of the k value will have a lot of impact on the model prediction results. For this reason, the accuracy function score() is used to calculate the accuracy value of k value ranging from 1 to 22.

import matplotlib.pyplot as plt
NumberOfNeighbors=range(1,23)
KNNs=[KNeighborsClassifier(n_neighbors=i) for i in NumberOfNeighbors]
range(len(KNNs))
scores=[KNNs[i].fit(X_trainingSet,y_trainingSet).score(X_testSet,y_testSet) for i in range(0,22)]
plt.plot(NumberOfNeighbors,scores)
plt.title("Elbow Curve")
plt.xlabel("Number of Neighbors")
plt.ylabel("Score")

As can be seen from the icon information, when the value of k (i.e. n_neighbors) is 4, the model prediction score is the highest, so the model parameters are improved next.

10. Model Improvement

myModel_prove=KNeighborsClassifier(algorithm="kd_tree",
                             leaf_size=30,
                             metric="minkowski",
                             metric_params=None,
                             n_jobs=None,
                             n_neighbors=4,
                             p=2,
                             weights="uniform")
myModel_prove.fit(X_trainingSet,y_trainingSet)
y_predictSet=myModel_prove.predict(X_testSet)

11. Model Prediction

The fit() function data training results are as follows:

y_predictSet
array(['B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'B', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'M', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B ',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M ',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B ',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B ',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'M', 'M', 'M', 'M ',
       'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B '],
      dtype=object)

The results of prediction using the predict() function are as follows:

y_predictSet
array(['B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'B', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'M', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B ',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M ',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B ',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B ',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B ',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B ',
       'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'M', 'M', 'M', 'M ',
       'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B '],
      dtype=object)

In order to evaluate the performance of the established model, the “Prediction Accuracy (Accuracy Score)” parameter is used. The specific implementation method is to call the accuracy_score() method of the metrics module of the scikit-learn package.

accuracy_score(y_testSet,y_predictSet)
0.9440559440559441

It can be seen from the output results that the prediction accuracy of the model has improved, indicating that the model has been optimized.