Practical data analysis | SVM algorithm – automatic diagnosis and analysis of cases

Table of Contents

1. Data analysis and objects

2. Purpose and analysis tasks

3. Methods and Tools

4. Data reading

5. Data understanding

6. Data preparation

7. Model training

8. Model application and evaluation


1. Data analysis and objects

CSV file – “bc_data.csv”

Dataset link: https://download.csdn.net/download/m0_70452407/88524905

This data set mainly records 32 attributes of 569 cases. The main attributes/fields are as follows:

(1) ID: ID of the case.

(2) Diagnosis: M means malignant, B means benign. This dataset contains a total of 357 benign cases and 212 malignant cases.

(3) 10 characteristic values of the cell nucleus, including radius, texture, perimeter, area, smoothness, compactness, concavity, concavity Points (concave points), symmetry (symmetry) and fractal dimension (fractal dimension), etc. At the same time, three statistics are provided for the above 10 eigenvalues, namely mean, standard error and worst or largest.

2. Purpose and Analysis Tasks

(1) Use the training set to train the SVM model.

(2) Use the SVM model to predict the diagnosis results of the Wisconsin breast cancer data set.

(3) Evaluate the SVM model.

3. Methods and Tools

Python language and pandas, NumPy, matplotlib, scikit-learn packages.

Hyperparameters of svm.SVC and their interpretation:

svm.SVC hyperparameters and their interpretation
Parameter name Parameter type Description
C Floating point type, must be positive, default value is 1.0 The penalty used in sklearn.svm.SVC is the square of the L2 norm, and C corresponds to the regularization parameter of this penalty, that is, the coefficient of the penalty. The larger the value of C, the greater the penalty for classification errors, so the classification results are more likely to be completely correct; the smaller the value of C, the smaller the penalty for classification errors, so the classification results will allow more errors. .
kernel can be one of the following Any character: ‘linear’,’poly’,’rbf’,’sigmoid’,’precomputed’; the default is ‘rbf’. Kernel function type, ‘rbf’ is the radial basis function, ‘linear’ is the linear kernel, and ‘poly’ is the polynomial Kernel function
degree Type int , the default value is 3 When the kernel is specified as ‘poly’, it indicates the highest degree of the selected polynomial, the default is cubic polynomial (poly)
gamma ‘scale\ ‘, ‘auto’ or ‘float’, the default value is ‘scale’ (before version 0.22, the default value is ‘auto’ gamma is the kernel coefficient of ‘rbf’, ‘poly’, and ‘sigmoid’.
decision_function_shape The default is ‘ovr’, there are only two values to choose from ‘ovr’ and ‘ovo’ When dealing with multi-classification problems, be sure to use a certain strategy. ‘ovr’ represents a one-to-one classifier. If there are k categories, you need Build k*(k-1)/2 classifiers; ‘ovo’ is a one-to-many classifier. If there are k categories, you need to build k classifiers.

4. Data reading

Import required third-party packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot

#Import sklearn’s svm
from sklearn import svm

#Import metrics evaluation method
from sklearn import metrics

#train_test_split is used to split the training set and test set
from sklearn.model_selection import train_test_split

#StandardScalery is used to normalize the mean and variance
from sklearn.preprocessing import StandardScaler

Read in data:

df_bc_data=pd.read_csv("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Mechanical Industry Press\Chapter 4 Classification Analysis \bc_data.csv")

Display the data set:

df_bc_data

5. Data Understanding

Conduct exploratory analysis on the data frame df_bc_data. The implementation method used here is to call the describe() method of the data frame (DataFrame) in the pandas package.

df_bc_data.describe()

Check whether there are missing values in the dataset:

df_bc_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 id 569 non-null int64
 1 diagnosis 569 non-null object
 2 radius_mean 569 non-null float64
 3 texture_mean 569 non-null float64
 4 perimeter_mean 569 non-null float64
 5 area_mean 569 non-null float64
 6 smoothness_mean 569 non-null float64
 7 compactness_mean 569 non-null float64
 8 concavity_mean 569 non-null float64
 9 concave points_mean 569 non-null float64
 10 symmetry_mean 569 non-null float64
 11 fractal_dimension_mean 569 non-null float64
 12 radius_se 569 non-null float64
 13 texture_se 569 non-null float64
 14 perimeter_se 569 non-null float64
 15 area_se 569 non-null float64
 16 smoothness_se 569 non-null float64
 17 compactness_se 569 non-null float64
 18 concavity_se 569 non-null float64
 19 concave points_se 569 non-null float64
 20 symmetry_se 569 non-null float64
 21 fractal_dimension_se 569 non-null float64
 22 radius_worst 569 non-null float64
 23 texture_worst 569 non-null float64
 24 perimeter_worst 569 non-null float64
 25 area_worst 569 non-null float64
 26 smoothness_worst 569 non-null float64
 27 compactness_worst 569 non-null float64
 28 concavity_worst 569 non-null float64
 29 concave_points_worst 569 non-null float64
 30 symmetry_worst 569 non-null float64
 31 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4 + KB

Check whether there is any imbalance in the data:

df_bc_data['diagnosis'].value_counts()
B 357
M 212
Name: diagnosis, dtype: int64

6. Data preparation

Since the id column is not an independent variable or a dependent variable, delete this column.

new_bc=df_bc_data.drop(['id'],axis=1)

Replace the value of the diagnosis attribute field with 1 for ‘M’ and 0 for ‘B’.

new_bc['diagnosis']=new_bc['diagnosis'].map({'M':1,'B':0})

Split the data set into a training set and a test set, using 20% of the data as the test set.

bc_train,bc_test=train_test_split(new_bc,test_size=0.2)

Split the data attributes and labels of the training set and test set.

#Split the data and labels of the training set
bc_train_data=bc_train.iloc[:,1:]
bc_train_label=bc_train['diagnosis']
#Split the data and labels of the test set
bc_test_data=bc_test.iloc[:,1:]
bc_test_label=bc_test['diagnosis']

In order to eliminate the influence of the numerical dimension on the results, the training data and prediction data need to be standardized.

bc_train_data=StandardScaler().fit_transform(bc_train_data)
bc_test_data=StandardScaler().fit_transform(bc_test_data)

Seven. Model Training

Use the training set to train the SVM model. In addition to directly specifying parameter values, you can also use automatic parameter tuning (such as GridSearchCV) for parameter selection.

bc_model=svm.SVC(C=0.2,kernel='linear') #Create SVM classifier
bc_model.fit(bc_train_data,bc_train_label) #Training model
SVC(C=0.2, kernel='linear')

8. Model Application and Evaluation

Use the trained SVM model to test on the test set and output the values of the evaluation indicators.

#Apply the model on the test set and evaluate it
prediction=bc_model.predict(bc_test_data)
#evaluation index
print("Confusion matrix:\\
",metrics.confusion_matrix(bc_test_label,prediction))
print("Accuracy:",metrics.accuracy_score(bc_test_label,prediction))
print('Precision rate:',metrics.precision_score(bc_test_label,prediction))
print('Recall:',metrics.recall_score(bc_test_label,prediction))
print("F1 value:",metrics.f1_score(bc_test_label,prediction))
Confusion matrix:
 [[74 0]
 [1 39]]
Accuracy: 0.9912280701754386
Precision rate: 1.0
Recall: 0.975
F1 value: 0.9873417721518987

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57540 people are learning the system