K-Nearest Neighbor Algorithm Realizes Red Wine Dataset Classification

Table of Contents

  • 1. Introduction of the author
  • 2. Introduction to K-Nearest Neighbor Algorithm
    • 2.1 Fundamentals of K
    • 2.2 Advantages and disadvantages of the algorithm
  • 3. KNN red wine dataset classification experiment
    • 3.1 Get the red wine dataset
    • 3.2 KNN algorithm
    • 3.3 Complete code
  • 4. Problem Analysis
  • Reference links (links and citations available for reference)

1. Introduction of the author

Lu Zhidong, male, School of Electronic Information, Xi’an Polytechnic University, 2022 graduate student, Zhang Hongwei Artificial Intelligence Research Group
Research direction: machine vision and artificial intelligence
Email: [email protected]

2. Introduction to K-Nearest Neighbor Algorithm

2.1 K Fundamentals

Principle: If most of the k most similar samples in the feature space (that is, the nearest neighbors in the feature space) of a sample belong to a certain category, then the sample also belongs to this category. Simply put, it is to find the distance between two points and see who is the closest, so as to distinguish which category the data we want to predict belongs to.

Let’s look at the picture to understand. Blue points are sample points belonging to type a, and pink points are sample points belonging to type b. At this time, a new point (yellow point) has come, how to judge whether it belongs to type a or type b.

The method is: the new point finds the k points closest to itself (k is variable). Calculate the distance from the new point to each other point, sort the distance from small to large, and find the k points closest to itself. Count among the k points, how many points belong to class a, and how many points belong to class b. Among these k points, if there are more points belonging to class b, then this new point also belongs to class b. The distance calculation formula is also the familiar Pythagorean theorem.

2.2 Algorithm advantages and disadvantages

Advantages of the algorithm: simple and easy to understand, no need to estimate parameters, no training required. It is suitable for thousands to tens of thousands of data volumes.

Disadvantages of the algorithm: the calculation of the test sample requires a large amount of calculation, the memory overhead is large, and the value of k needs to be continuously adjusted to achieve the optimal effect. If the value of k is too small, it is easily affected by outliers. If the value of k is too large, it will cause overfitting and affect the accuracy.

3. KNN red wine data set classification experiment

3.1 Get the red wine dataset

First import sklearn’s local dataset library, variable wine to get red wine data, since the return value received by wine is .Bunch type data, so I use wine_data to receive all feature value data, it is an array of 178 rows and 13 columns, each column represents a feature. wine_target is used to receive all target values. The target values (red wine category) in this data set are 0, 1, and 2 types of red wine.

Then convert the data we need into DataFrame type data. To make predictions more general, we shuffle this dataset. The operation is as follows:

from sklearn import datasets
wine = datasets.load_wine() # get wine data
wine_data = wine.data #Get wine index data data, 178 rows and 13 columns
wine_target = wine.target #Get classification target value
 
# Convert data to DataFrame type
wine_data = pd. DataFrame(data = wine_data)
wine_target = pd. DataFrame(data = wine_target)
 
# Insert wine_target into the first column, and name the column index of this column 'class'
wine_data.insert(0,'class',wine_target)
 
# ==1== variable.sample(frac=1) means shuffling and reordering
# ==2== variable.reset_index(drop=True) makes index start sorting from 0
 
wine = wine_data.sample(frac=1).reset_index(drop=True) #Disrupt the row order of DataFrame

3.2 KNN algorithm

Generally, 75% of the data is used for training and 25% is used for testing. Therefore, before the data is predicted, the data must be divided.

Division method:
Use the sklearn.model_selection.train_test_split module for data splitting.

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=data ratio)
train_test_split() parameters in parentheses:
x: data set eigenvalues (features)
y: data set target value (targets)
test_size: The proportion of test data, expressed in decimals, such as 0.25, 75% for training and 25% for testing.

The return value of train_test_split():
x_train: training part of the eigenvalues
x_test: test some eigenvalues
y_train: training part of the target value
y_test: test part of the target value
# Divide the test set and training set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(features,targets,test_size=0.25)

3.3 Complete code

import pandas as pd
from sklearn import datasets
 
wine = datasets.load_wine() # get wine data
wine_data = wine.data #Get wine index data data, 178 rows and 13 columns
wine_target = wine.target #Get classification target value
 
wine_data = pd.DataFrame(data = wine_data) #Convert to DataFrame type data
wine_target = pd. DataFrame(data = wine_target)
# Insert target into the first column
wine_data.insert(0,'class',wine_target)
 
# ==1== variable.sample(frac=1) means shuffling and reordering
# ==2== Variable.reset_index(drop=True) So that index starts sorting from 0, this step can be omitted
wine = wine_data.sample(frac=1).reset_index(drop=True)
 
# Take 10 lines for verification
wine_predict = wine[-10:].reset_index(drop=True)
wine_predict_feature = wine_predict.drop('class',axis=1) #The feature value used for verification is input into the predict() function
wine_predict_target = wine_predict['class'] #target value, used to compare with the final prediction result
 
wine = wine[:-10] #Delete the last 10 lines
features = wine.drop(columns=['class'],axis=1) #Delete the class column and generate a return value, which is the feature value
targets = wine['class'] #class is the target value
# Equivalent to 13 eigenvalues corresponding to 1 target
 
 
# Divide the test set and training set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(features,targets,test_size=0.25)
 
# Normalize before predicting
from sklearn.preprocessing import StandardScaler #Import standardized scaling method
scaler = StandardScaler() #The variable scaler receives the standardized method
 
# pass in feature values for normalization
x_train = scaler.fit_transform(x_train) #Standardize the training feature values
x_test = scaler.fit_transform(x_test) #Standardize the eigenvalues of the test
wine_predict_feature = scaler. fit_transform(wine_predict_feature)
 
# Classify using the K-nearest neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier #Import k nearest neighbor algorithm library
# k nearest neighbor function
knn = KNeighborsClassifier(n_neighbors=5, algorithm='auto')
 
# Training, pass in the training eigenvalues and training target values
knn. fit(x_train, y_train)
# Check the accuracy of the model -- the eigenvalue and target value of the incoming test
# Scoring method, according to the x_test prediction result, compare the result with the real y_test, and calculate the accuracy
accuracy = knn. score(x_test,y_test)
# Forecast, enter the x value for prediction
result = knn. predict(wine_predict_feature)

4. Problem Analysis

If you encounter the problem of incomplete installation library, see the error shown below, you can check whether the relevant library is installed or the environment is wrong. png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L20wXzM3NzU4MDYz,size_16,color_FFFFFF,t_70#pic_center” >

Reference links (links and citations available for reference)

https://blog.csdn.net/dgvv4/article/details/121316823
https://blog.csdn.net/dgvv4/article/details/121316823