Table of Contents
1. Data analysis and objects
2. Purpose and analysis tasks
3. Methods and Tools
4. Data reading
5. Data understanding
6. Data preparation
7. Model training
8. Model application and evaluation
1. Data analysis and objects
CSV file – “bc_data.csv”
Dataset link: https://download.csdn.net/download/m0_70452407/88524905
This data set mainly records 32 attributes of 569 cases. The main attributes/fields are as follows:
(1) ID: ID of the case.
(2) Diagnosis: M means malignant, B means benign. This dataset contains a total of 357 benign cases and 212 malignant cases.
(3) 10 characteristic values of the cell nucleus, including radius, texture, perimeter, area, smoothness, compactness, concavity, concavity Points (concave points), symmetry (symmetry) and fractal dimension (fractal dimension), etc. At the same time, three statistics are provided for the above 10 eigenvalues, namely mean, standard error and worst or largest.
2. Purpose and Analysis Tasks
(1) Use the training set to train the SVM model.
(2) Use the SVM model to predict the diagnosis results of the Wisconsin breast cancer data set.
(3) Evaluate the SVM model.
3. Methods and Tools
Python language and pandas, NumPy, matplotlib, scikit-learn packages.
Hyperparameters of svm.SVC and their interpretation:
Parameter name | Parameter type | Description |
C | Floating point type, must be positive, default value is 1.0 | The penalty used in sklearn.svm.SVC is the square of the L2 norm, and C corresponds to the regularization parameter of this penalty, that is, the coefficient of the penalty. The larger the value of C, the greater the penalty for classification errors, so the classification results are more likely to be completely correct; the smaller the value of C, the smaller the penalty for classification errors, so the classification results will allow more errors. . |
kernel | can be one of the following Any character: ‘linear’,’poly’,’rbf’,’sigmoid’,’precomputed’; the default is ‘rbf’. | Kernel function type, ‘rbf’ is the radial basis function, ‘linear’ is the linear kernel, and ‘poly’ is the polynomial Kernel function |
degree | Type int , the default value is 3 | When the kernel is specified as ‘poly’, it indicates the highest degree of the selected polynomial, the default is cubic polynomial (poly) |
gamma | ‘scale\ ‘, ‘auto’ or ‘float’, the default value is ‘scale’ (before version 0.22, the default value is ‘auto’ | gamma is the kernel coefficient of ‘rbf’, ‘poly’, and ‘sigmoid’. |
decision_function_shape | The default is ‘ovr’, there are only two values to choose from ‘ovr’ and ‘ovo’ | When dealing with multi-classification problems, be sure to use a certain strategy. ‘ovr’ represents a one-to-one classifier. If there are k categories, you need Build k*(k-1)/2 classifiers; ‘ovo’ is a one-to-many classifier. If there are k categories, you need to build k classifiers. |
4. Data reading
Import required third-party packages:
import pandas as pd import numpy as np import matplotlib.pyplot #Import sklearn’s svm from sklearn import svm #Import metrics evaluation method from sklearn import metrics #train_test_split is used to split the training set and test set from sklearn.model_selection import train_test_split #StandardScalery is used to normalize the mean and variance from sklearn.preprocessing import StandardScaler
Read in data:
df_bc_data=pd.read_csv("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Mechanical Industry Press\Chapter 4 Classification Analysis \bc_data.csv")
Display the data set:
df_bc_data
5. Data Understanding
Conduct exploratory analysis on the data frame df_bc_data. The implementation method used here is to call the describe() method of the data frame (DataFrame) in the pandas package.
df_bc_data.describe()
Check whether there are missing values in the dataset:
df_bc_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave_points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), int64(1), object(1) memory usage: 142.4 + KB
Check whether there is any imbalance in the data:
df_bc_data['diagnosis'].value_counts()
B 357 M 212 Name: diagnosis, dtype: int64
6. Data preparation
Since the id column is not an independent variable or a dependent variable, delete this column.
new_bc=df_bc_data.drop(['id'],axis=1)
Replace the value of the diagnosis attribute field with 1 for ‘M’ and 0 for ‘B’.
new_bc['diagnosis']=new_bc['diagnosis'].map({'M':1,'B':0})
Split the data set into a training set and a test set, using 20% of the data as the test set.
bc_train,bc_test=train_test_split(new_bc,test_size=0.2)
Split the data attributes and labels of the training set and test set.
#Split the data and labels of the training set bc_train_data=bc_train.iloc[:,1:] bc_train_label=bc_train['diagnosis'] #Split the data and labels of the test set bc_test_data=bc_test.iloc[:,1:] bc_test_label=bc_test['diagnosis']
In order to eliminate the influence of the numerical dimension on the results, the training data and prediction data need to be standardized.
bc_train_data=StandardScaler().fit_transform(bc_train_data) bc_test_data=StandardScaler().fit_transform(bc_test_data)
Seven. Model Training
Use the training set to train the SVM model. In addition to directly specifying parameter values, you can also use automatic parameter tuning (such as GridSearchCV) for parameter selection.
bc_model=svm.SVC(C=0.2,kernel='linear') #Create SVM classifier bc_model.fit(bc_train_data,bc_train_label) #Training model
SVC(C=0.2, kernel='linear')
8. Model Application and Evaluation
Use the trained SVM model to test on the test set and output the values of the evaluation indicators.
#Apply the model on the test set and evaluate it prediction=bc_model.predict(bc_test_data) #evaluation index print("Confusion matrix:\\ ",metrics.confusion_matrix(bc_test_label,prediction)) print("Accuracy:",metrics.accuracy_score(bc_test_label,prediction)) print('Precision rate:',metrics.precision_score(bc_test_label,prediction)) print('Recall:',metrics.recall_score(bc_test_label,prediction)) print("F1 value:",metrics.f1_score(bc_test_label,prediction))
Confusion matrix: [[74 0] [1 39]] Accuracy: 0.9912280701754386 Precision rate: 1.0 Recall: 0.975 F1 value: 0.9873417721518987
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57540 people are learning the system