Data Analysis Practice | Bayesian Classification Algorithm – Automatic Case Diagnosis Analysis

Table of Contents

1. Data and analysis objects

2. Purpose and analysis tasks

3. Methods and Tools

4. Data reading

5. Data understanding

6. Data preparation

7. Model training

8. Model evaluation

9. Model parameter adjustment

10. Model prediction


1. Data and analysis objects

CSV file – “bc_data.csv”

Dataset link: https://download.csdn.net/download/m0_70452407/88524905

This data set mainly records 32 attributes of 569 cases. The main attributes/fields are as follows:

(1) ID: ID of the case.

(2) Diagnosis: M means malignant, B means benign. This dataset contains a total of 357 benign cases and 212 malignant cases.

(3) 10 characteristic values of the cell nucleus, including radius, texture, perimeter, area, smoothness, compactness, concavity, concavity Points (concave points), symmetry (symmetry) and fractal dimension (fractal dimension), etc. At the same time, three statistics are provided for the above 10 eigenvalues, namely mean, standard error and worst or largest.

2. Purpose and Analysis Tasks

Understand the application of machine learning methods in data analysis – using the Naive Bayes algorithm for classification analysis.

(1) Divide the data set into a training set and a test set at a certain ratio.

(2) Use the training set to model the Naive Bayes algorithm.

(3) Use the Naive Bayes classification model to predict the diagnosis results on the test set.

(4) Comparatively analyze the classification prediction of the diagnosis results by the Naive Bayes classification model and the real diagnosis results to verify the effectiveness of the Naive Bayes classification model.

3. Methods and Tools

Python language and scikit-learn package.

4. Data reading

import pandas as pd
df=pd.read_csv("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Machinery Industry Press\Chapter 4 Classification Analysis\\ \bc_data.csv"
                   ,header=0)
df.head()

5. Data Understanding

To check whether there are missing values in the data set, you can use the isnull() method of the pandas package to determine whether there are null values in the data, and combine it with the any() method to check whether there are missing values in each feature.

df.isnull().any()
id False
diagnosis False
radius_mean False
texture_mean False
perimeter_mean False
area_mean False
smoothness_mean False
compactness_mean False
concavity_mean False
concave points_mean False
symmetry_mean False
fractal_dimension_mean False
radius_se False
texture_se False
perimeter_se False
area_se False
smoothness_se False
compactness_se False
concavity_se False
concave points_se False
symmetry_se False
fractal_dimension_se False
radius_worst False
texture_worst False
perimeter_worst False
area_worst False
smoothness_worst False
compactness_worst False
concavity_worst False
concave_points_worst False
symmetry_worst False
fractal_dimension_worst False
dtype: bool

As can be seen from the output, there are no missing values in the data set.

Conduct exploratory analysis on the data frame df. The implementation method used here is to call the describe() method of the data frame in the pandas package.

df.describe()

In addition to the describe() method, you can also call the shape attribute to perform exploratory analysis on the data frame.

df.shape
(569, 32)

6. Data preparation

The classification task of this project is a two-classification task. It is necessary to convert the value of the diagnosis result “diagnosis” in the data frame df into the numerical type of 0 and 1. Here, the LabelEncoder() method of the preprocessing module in the scikit-learn package is used.

from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
df['diagnosis']=encoder.fit_transform(df['diagnosis'])
df

It can be seen that the original diagnosis result diagnosis is converted from M (indicating malignant) and B (indicating benign) to 1 (indicating malignant) and 0 (indicating benign).

Divide the data set into a training set and a test set at a ratio of 7:3. Here, first assign the feature set of the cell nucleus (that is, the data set in the data frame df except the first two columns) to the variable x, and assign the diagnosis result to variable y for subsequent use. Then use the train_test_split() method of the model_selection module in the scikit-learn package to divide the data set.

from sklearn.model_selection import train_test_split
x=df.iloc[:,2:]
y=df['diagnosis']
x_train,x_test,y_train,y_test=train_test_split(x,y,
                                               test_size=0.3,
                                               random_state=40,stratify=y)

Seven. Model Training

The naive_bayes module in the scikit-learn package provides several different models based on feature types and distributions, such as GaussianNB, BernoulliNB and MultinomialNB. in:

(1) GaussianNB assumes that the data conforms to the normal distribution and is used for features with many continuous values.

(2) BernoulliNB is used for binary discrete-valued features.

(3) MultinomialNB is a feature used for multivariate discreteness.

The features of the data set here are all continuous variables, so GaussianNB is used for model training.

from sklearn.naive_bayes import GaussianNB
gnb_clf=GaussianNB()
gnb_clf.fit(x_train,y_train)
GaussianNB()

8. Model Evaluation

Here, the accuracy, precision, recall and f1 value are used to evaluate the model. The metrics module in scikit-learn provides the accuracy_score(), precision_score(), recall_score(), and f1_score() methods.

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
gnb_ypred=gnb_clf.predict(x_test)
print("Accuracy rate: %f,\\
Precision rate: %f,\\
Recall rate: %f,\\
f1 value: %f."
      %(accuracy_score(y_test,gnb_ypred),precision_score(y_test,gnb_ypred)
        ,recall_score(y_test,gnb_ypred),f1_score(y_test,gnb_ypred)))
Accuracy rate: 0.935673,
Accuracy: 0.964912,
Recall rate: 0.859375,
f1 value: 0.909091.

9. Model parameter adjustment

GaussianNB can input two parameters prior and var_smoothing. Prior is used to define the prior probability of the sample category. By default, the prior probability is calculated based on the data set, so the prior is generally not set. The default value of var_smoothing is 1e-9. It is mainly used to control the stability of the model by setting the maximum variance of the feature and then adding it to the estimated variance at a given ratio.

Here, the network search function in the model_selection module of the scikit-learn package is called, that is, the GridSearchCV() method to adjust the parameters of the model. First define a variable params to store different values of alpha (assuming here that the value range of var_smoothing is [1e-7, 1e-8, 1e-9, 1e-10.1e-11, 1e-12]).

from sklearn.model_selection import GridSearchCV
params={'var_smoothing':[1e-7,1e-8,1e-9,1e-10,1e-11,1e-12]}
gnb_grid_clf=GridSearchCV(GaussianNB(),params,cv=5,verbose=2)
gnb_grid_clf.fit(x_train,y_train)
Fitting 5 folds for each of 6 candidates, totaling 30 fits
[CV] END ........................var_smoothing=1e-07; total time= 0.0s
[CV] END ........................var_smoothing=1e-07; total time= 0.0s
[CV] END ........................var_smoothing=1e-07; total time= 0.0s
[CV] END ........................var_smoothing=1e-07; total time= 0.0s
[CV] END ........................var_smoothing=1e-07; total time= 0.0s
[CV] END ........................var_smoothing=1e-08; total time= 0.0s
[CV] END ........................var_smoothing=1e-08; total time= 0.0s
[CV] END ........................var_smoothing=1e-08; total time= 0.0s
[CV] END ........................var_smoothing=1e-08; total time= 0.0s
[CV] END ........................var_smoothing=1e-08; total time= 0.0s
[CV] END ........................var_smoothing=1e-09; total time= 0.0s
[CV] END ........................var_smoothing=1e-09; total time= 0.0s
[CV] END ........................var_smoothing=1e-09; total time= 0.0s
[CV] END ........................var_smoothing=1e-09; total time= 0.0s
[CV] END ........................var_smoothing=1e-09; total time= 0.0s
[CV] END ........................var_smoothing=1e-10; total time= 0.0s
[CV] END ........................var_smoothing=1e-10; total time= 0.0s
[CV] END ........................var_smoothing=1e-10; total time= 0.0s
[CV] END ........................var_smoothing=1e-10; total time= 0.0s
[CV] END ........................var_smoothing=1e-10; total time= 0.0s
[CV] END ........................var_smoothing=1e-11; total time= 0.0s
[CV] END ........................var_smoothing=1e-11; total time= 0.0s
[CV] END ........................var_smoothing=1e-11; total time= 0.0s
[CV] END ........................var_smoothing=1e-11; total time= 0.0s
[CV] END ........................var_smoothing=1e-11; total time= 0.0s
[CV] END ........................var_smoothing=1e-12; total time= 0.0s
[CV] END ........................var_smoothing=1e-12; total time= 0.0s
[CV] END ........................var_smoothing=1e-12; total time= 0.0s
[CV] END ........................var_smoothing=1e-12; total time= 0.0s
[CV] END ........................var_smoothing=1e-12; total time= 0.0s
GridSearchCV(cv=5, estimator=GaussianNB(),
             param_grid={'var_smoothing': [1e-07, 1e-08, 1e-09, 1e-10, 1e-11,
                                           1e-12]},
             verbose=2)

Here, in the GridSearchCV() method, the GaussianNB model, the parameter value variable params to be optimized, the cross-validation parameter cv (five-fold cross-validation is set here) and the display training log parameter verbose (when the verbose value is 0, it will not Display the training process. When the value is 1, the training process is occasionally output. When the value is >1, the training process is output for each sub-model.

Then use best_params_ in GridSearchCV to view the model parameters with the highest accuracy.

gnb_grid_clf.best_params_
{'var_smoothing': 1e-10}

It can be seen that within the given var_smoothing value range, the model has the highest accuracy when the value is 1e-10.

10. Model Prediction

The predictions of the model can be used through the predict() method of the trained model. Here, the two GaussianNB models under default conditions and after parameter adjustment are used to perform classification predictions on the test set, and the model evaluation method is used for comparison.

First, use the default GaussianNB to perform classification prediction on the test set, then store the classification results in the variable gnb_ypred, and output the accuracy, precision, recall and f1 value of the model.

gnb_ypred=gnb_clf.predict(x_test)
print("Accuracy rate: %f,\\
Precision rate: %f,\\
Recall rate: %f,\\
f1 value: %f."
      %(accuracy_score(y_test,gnb_ypred),precision_score(y_test,gnb_ypred)
        ,recall_score(y_test,gnb_ypred),f1_score(y_test,gnb_ypred)))
Accuracy rate: 0.935673,
Accuracy: 0.964912,
Recall rate: 0.859375,
f1 value: 0.909091.

Then, use the adjusted GaussianNB to perform classification prediction on the test set, then store the classification results in tuned_ypred, and output the accuracy, precision, recall and f1 value of the model.

tuned_ypred=gnb_grid_clf.best_estimator_.predict(x_test)
print("Accuracy rate: %f,\\
Precision rate: %f,\\
Recall rate: %f,\\
f1 value: %f."
      %(accuracy_score(y_test,tuned_ypred),precision_score(y_test,tuned_ypred)
        ,recall_score(y_test,tuned_ypred),f1_score(y_test,tuned_ypred)))
Accuracy rate: 0.941520,
Accuracy: 0.965517,
Recall rate: 0.875000,
f1 value: 0.918033.
tuned=GaussianNB(var_smoothing=1e-10)
tuned.fit(x_train,y_train)
tuned_ypred1=tuned.predict(x_test)
print("Accuracy rate: %f,\\
Precision rate: %f,\\
Recall rate: %f,\\
f1 value: %f."
      %(accuracy_score(y_test,tuned_ypred1),precision_score(y_test,tuned_ypred1)
        ,recall_score(y_test,tuned_ypred1),f1_score(y_test,tuned_ypred1)))
tuned=GaussianNB(var_smoothing=1e-10)
tuned.fit(x_train,y_train)
tuned_ypred1=tuned.predict(x_test)
print("Accuracy rate: %f,\\
Precision rate: %f,\\
Recall rate: %f,\\
f1 value: %f."
      %(accuracy_score(y_test,tuned_ypred1),precision_score(y_test,tuned_ypred1)
        ,recall_score(y_test,tuned_ypred1),f1_score(y_test,tuned_ypred1)))

It can be seen that by adjusting the parameters of GaussianNB, all four evaluation indicators have been improved to a certain extent.