K-nearest neighbor algorithm (KNN)

K-nearest neighbor algorithm (KNN)

The KNN algorithm is an instance-based learning method that uses a set of labeled training samples for classification. The core idea of the algorithm is to classify the test samples into the category of the most similar K neighbors in the training samples according to the similarity between the samples.

1.knn classification process

1.1 Import required libraries

In the sample code, we use the Pandas library to read and process data, and the KNeighborsClassifier and accuracy_score functions in the scikit-learn library to implement KNN classification and evaluate classification accuracy.

1.2 Read dataset

The sample code uses the pd.read_csv function to read the dataset file named ‘adults.txt’, which is a dataset containing personal characteristics and salary levels.

1.3 Data preprocessing

In the KNN algorithm, the target variable usually needs to be transformed into a binary variable. The example code converts the original target variable into a binary variable by applying a lambda function, where ‘<=50K' means 0 and '>50K’ means 1.

1.4 Feature selection and division

According to the needs of the problem, select the appropriate feature column and target variable column. In the sample code, a series of characteristics are selected as independent variables, and the target variable is salary level. Then, use the train_test_split function to divide the data set into a training set and a test set, where the proportion of the test set is 0.2.

1.5 Data preprocessing

The sample code uses one-hot encoding (pd.get_dummies) to convert categorical features and convert them to numerical features for use in the KNN algorithm.

1.6 KNN model training and prediction (only storage during training, calculation during prediction)

Create a KNeighborsClassifier object and use the training set data for model fitting. Then, use the trained model to predict the test set and get the prediction result.

1.7 Evaluate classification accuracy

Use the accuracy_score function to compare the predicted result with the real label to calculate the classification accuracy.
Please add a picture description

1.7 Summary

advantage:

  • Simple to understand and easy to implement.
  • There are no assumptions about the data distribution and it is applicable to various types of data.
  • It can handle multi-classification problems.
  • No training process is required, and classification is performed directly based on similarity.

shortcoming:

  • Computational overhead is high, especially for large-scale datasets, where distances between test samples and all training samples need to be calculated.

  • Sensitive to outliers and noise, which may lead to erroneous classification results. If the k value is too small, it will be easily affected by outliers; if the k value is too large, it will be easily affected by the sample balance problem.

  • The data needs to be normalized to avoid the influence of different feature scales on the results.

  • Sensitive to feature selection, the weight of different features may have a greater impact on the classification results.

  • In high-dimensional feature space, due to the “dimension disaster” problem, the classification effect of KNN may not be ideal.

2. knn classification case

Label

  • love movie
  • action movie

What is the basis for the judgment?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
### Processing Chinese
plt.rcParams['font.sans-serif'].insert(0, 'SimHei')
plt.rcParams['axes.unicode_minus'] = False
data=pd.read_excel('./res/moive.xlsx')
data

2.1 Feature Construction

X = data[['action scene','love scene']].copy()
y = data['movie type'].copy()

2.2 Samples to be predicted

# Movie A [4,21]
# Movie B [22,5]
X_test = np.array([[4,21],[22,5]])
X_test
'''
array([[ 4, 21],
       [22, 5]])
'''

2.3 Visualization of samples to be classified

plt.figure(figsize=(8,5))
plt.scatter(X['action scene'],X['love scene'],s=100,c=y.map({<!-- -->'action film':0 ,'Love movie':1}),cmap=plt.cm.Accent_r)

2.4 Redraw and draw the observation results separately

# Characteristics of action movies or romantic movies
action =X.loc[y=='action movie']
love = X.loc[y=='love movie']
s = 100
plt.scatter(action['action shot'],action['love shot'],s=s,c='blue',label='action movie')
plt.scatter(love['action shot'],love['love shot'],s=s,c='red',label='love movie')
plt.scatter(X_test[:,0],X_test[:,1],marker='*',s=s)
plt.xlabel('action shot')
plt.ylabel('love lens')
plt. legend()

2.5 Algorithm verification

1) Import knn classifier

# 1. Import knn classifier
from sklearn.neighbors import KNeighborsClassifier

2) Instantiate the algorithm object

# 2. Instantiate the algorithm object
knn = KNeighborsClassifier(n_neighbors=3)

3) Training model

# 3. Training model
knn. fit(X, y)

4) Predicted test samples

# 4. Prediction and test samples
knn.predict(X_test) # array(['love movie', 'action movie'], dtype=object)

3. knn regression

3.1 Principle

Based on the principle of proximity, the KNN regression algorithm predicts its output value by finding the K training samples closest to the new sample. Its core idea is “near Zhuzhechi”, that is, the K training samples closest to the new sample have the greatest impact on the prediction results. In KNN regression, we usually use Euclidean distance or other distance measures to measure the similarity between samples.

3.2 KNN regression application scenarios

For example, when we need to predict housing prices in a certain area, we can use KNN regression to find the prices of K houses closest to the target house, and estimate the price of the target house by averaging or weighted average. KNN regression can also be used for forecasting of time series data, where past data points are used as training samples to predict future trends.

3.3 knn regression case

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

# create training data
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])

# Create test data
X_test = np.array([[2.5], [3.7]])

# Define different K values
k_values = [1, 3, 5]

# Perform regression prediction and visualization for each K value
for k in k_values:
    # Create KNN regression model
    knn = KNeighborsRegressor(n_neighbors=k)
    # fit the model
    knn. fit(X_train, y_train)
    # Make predictions
    y_pred = knn. predict(X_test)
# Visualize regression results
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_pred, color='red', label='Predictions (k={})'.format(k))
plt.xlabel('X')
plt.ylabel('y')
plt.title('KNN Regression (k={})'.format(k))
plt. legend()
plt. show()

3.4 Parameter adjustment method

When using KNN regression, choosing an appropriate value of K is crucial to predicting the outcome. Smaller K values may produce overfitting, resulting in sensitivity to noise; while larger K values may produce underfitting, resulting in smooth prediction results. Therefore, we need to choose the best K value by methods such as cross-validation. In addition, factors such as standardization of features and selection of distance measures need to be considered to ensure the accuracy and stability of the model.

3.5 Normalization

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def minmax_demo():
    data = pd.read_csv('dating.txt',sep=',')
    print(data)
    transfer = MinMaxScaler(feature_range=(0,1))

    data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
    print(data)
    return None
    
minmax_demo()