K-nearest neighbor algorithm (KNN)
The KNN algorithm is an instance-based learning method that uses a set of labeled training samples for classification. The core idea of the algorithm is to classify the test samples into the category of the most similar K neighbors in the training samples according to the similarity between the samples.
1.knn classification process
1.1 Import required libraries
In the sample code, we use the Pandas library to read and process data, and the KNeighborsClassifier and accuracy_score functions in the scikit-learn library to implement KNN classification and evaluate classification accuracy.
1.2 Read dataset
The sample code uses the pd.read_csv function to read the dataset file named ‘adults.txt’, which is a dataset containing personal characteristics and salary levels.
1.3 Data preprocessing
In the KNN algorithm, the target variable usually needs to be transformed into a binary variable. The example code converts the original target variable into a binary variable by applying a lambda function, where ‘<=50K' means 0 and '>50K’ means 1.
1.4 Feature selection and division
According to the needs of the problem, select the appropriate feature column and target variable column. In the sample code, a series of characteristics are selected as independent variables, and the target variable is salary level. Then, use the train_test_split function to divide the data set into a training set and a test set, where the proportion of the test set is 0.2.
1.5 Data preprocessing
The sample code uses one-hot encoding (pd.get_dummies) to convert categorical features and convert them to numerical features for use in the KNN algorithm.
1.6 KNN model training and prediction (only storage during training, calculation during prediction)
Create a KNeighborsClassifier object and use the training set data for model fitting. Then, use the trained model to predict the test set and get the prediction result.
1.7 Evaluate classification accuracy
Use the accuracy_score function to compare the predicted result with the real label to calculate the classification accuracy.
1.7 Summary
advantage:
- Simple to understand and easy to implement.
- There are no assumptions about the data distribution and it is applicable to various types of data.
- It can handle multi-classification problems.
- No training process is required, and classification is performed directly based on similarity.
shortcoming:
-
Computational overhead is high, especially for large-scale datasets, where distances between test samples and all training samples need to be calculated.
-
Sensitive to outliers and noise, which may lead to erroneous classification results. If the k value is too small, it will be easily affected by outliers; if the k value is too large, it will be easily affected by the sample balance problem.
-
The data needs to be normalized to avoid the influence of different feature scales on the results.
-
Sensitive to feature selection, the weight of different features may have a greater impact on the classification results.
-
In high-dimensional feature space, due to the “dimension disaster” problem, the classification effect of KNN may not be ideal.
2. knn classification case
Label
- love movie
- action movie
What is the basis for the judgment?
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline ### Processing Chinese plt.rcParams['font.sans-serif'].insert(0, 'SimHei') plt.rcParams['axes.unicode_minus'] = False data=pd.read_excel('./res/moive.xlsx') data
2.1 Feature Construction
X = data[['action scene','love scene']].copy() y = data['movie type'].copy()
2.2 Samples to be predicted
# Movie A [4,21] # Movie B [22,5] X_test = np.array([[4,21],[22,5]]) X_test ''' array([[ 4, 21], [22, 5]]) '''
2.3 Visualization of samples to be classified
plt.figure(figsize=(8,5)) plt.scatter(X['action scene'],X['love scene'],s=100,c=y.map({<!-- -->'action film':0 ,'Love movie':1}),cmap=plt.cm.Accent_r)
2.4 Redraw and draw the observation results separately
# Characteristics of action movies or romantic movies action =X.loc[y=='action movie'] love = X.loc[y=='love movie'] s = 100 plt.scatter(action['action shot'],action['love shot'],s=s,c='blue',label='action movie') plt.scatter(love['action shot'],love['love shot'],s=s,c='red',label='love movie') plt.scatter(X_test[:,0],X_test[:,1],marker='*',s=s) plt.xlabel('action shot') plt.ylabel('love lens') plt. legend()
2.5 Algorithm verification
1) Import knn classifier
# 1. Import knn classifier from sklearn.neighbors import KNeighborsClassifier
2) Instantiate the algorithm object
# 2. Instantiate the algorithm object knn = KNeighborsClassifier(n_neighbors=3)
3) Training model
# 3. Training model knn. fit(X, y)
4) Predicted test samples
# 4. Prediction and test samples knn.predict(X_test) # array(['love movie', 'action movie'], dtype=object)
3. knn regression
3.1 Principle
Based on the principle of proximity, the KNN regression algorithm predicts its output value by finding the K training samples closest to the new sample. Its core idea is “near Zhuzhechi”, that is, the K training samples closest to the new sample have the greatest impact on the prediction results. In KNN regression, we usually use Euclidean distance or other distance measures to measure the similarity between samples.
3.2 KNN regression application scenarios
For example, when we need to predict housing prices in a certain area, we can use KNN regression to find the prices of K houses closest to the target house, and estimate the price of the target house by averaging or weighted average. KNN regression can also be used for forecasting of time series data, where past data points are used as training samples to predict future trends.
3.3 knn regression case
import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsRegressor # create training data X_train = np.array([[1], [2], [3], [4], [5]]) y_train = np.array([2, 4, 6, 8, 10]) # Create test data X_test = np.array([[2.5], [3.7]]) # Define different K values k_values = [1, 3, 5] # Perform regression prediction and visualization for each K value for k in k_values: # Create KNN regression model knn = KNeighborsRegressor(n_neighbors=k) # fit the model knn. fit(X_train, y_train) # Make predictions y_pred = knn. predict(X_test) # Visualize regression results plt.scatter(X_train, y_train, color='blue', label='Training data') plt.scatter(X_test, y_pred, color='red', label='Predictions (k={})'.format(k)) plt.xlabel('X') plt.ylabel('y') plt.title('KNN Regression (k={})'.format(k)) plt. legend() plt. show()
3.4 Parameter adjustment method
When using KNN regression, choosing an appropriate value of K is crucial to predicting the outcome. Smaller K values may produce overfitting, resulting in sensitivity to noise; while larger K values may produce underfitting, resulting in smooth prediction results. Therefore, we need to choose the best K value by methods such as cross-validation. In addition, factors such as standardization of features and selection of distance measures need to be considered to ensure the accuracy and stability of the model.
3.5 Normalization
import pandas as pd from sklearn.preprocessing import MinMaxScaler def minmax_demo(): data = pd.read_csv('dating.txt',sep=',') print(data) transfer = MinMaxScaler(feature_range=(0,1)) data = transfer.fit_transform(data[['milage','Liters','Consumtime']]) print(data) return None minmax_demo()