Data Mining – Implementation of KNN Algorithm

? About the author: A java blogger who has practiced for two and a half years

Personal homepage: King’s Landing?

ps: Likes are free, but they can make bloggers happy for days

Article Directory

1. Introduction to k-Nearest Neighbor Classification Algorithm

Second, the characteristics of k-NN

3. Pseudocode of KNN algorithm

4. Python implementation of KNN algorithm


1. Introduction to k-nearest neighbor classification algorithm

The K-Nearest Neighbor (KNN) classification algorithm is a relatively mature method in theory and one of the simplest machine learning algorithms. The idea of this method is: in the feature space, if most of the k nearest samples near a sample (that is, the nearest neighbors in the feature space) belong to a certain category, then the sample also belongs to this category.

As shown in Figure 1, there are two different types of sample data, represented by small blue squares and small red triangles, and the data marked by the green circle in the middle of the figure is the data to be classified. That is to say, now, we don’t know which category the green data in the middle belongs to (blue small square or red small triangle), and next, we will solve this problem: classify this green circle.

?

Figure 1

We often say that birds of a feather flock together, and people are divided into groups. To judge what kind of character and characteristics a person is, you can often start with the friends around him/her. Aren’t we going to determine which type of data the green circle in Figure 1 belongs to? Well, let’s start with its neighbors. But how many neighbors do you see at once? From Figure 1, you can also see:

  • If K=3, the nearest 3 neighbors of the green dot are 2 small red triangles and 1 small blue square, and the minority belongs to the majority. Based on the statistical method, it is determined that the green point to be classified belongs to the red triangle. kind.

  • If K=5, the nearest 5 neighbors of the green dot are 2 red triangles and 3 blue squares, or the minority belongs to the majority, based on statistical methods, it is determined that the green dot to be classified belongs to the blue square one type.

Here we see that when it is impossible to determine which category the current point to be classified belongs to in the known classification, we can look at its location characteristics based on statistical theory, measure the weight of its neighbors, and put It is classified (or assigned) to the category with greater weight. This is the core idea of the K-nearest neighbor algorithm.

step:

1: Let k be the number of nearest neighbors and D be the set of training samples

2: for each test sample z=(x ‘,y) do

3: Calculate the distance d(x ‘, x) between z and each sample (x, y) ∈ D

4: Select the set Dz of the k training samples closest to z to be included in D

5.

?

6: end for

  • distance weighted voting

?

  • Among them, v is the class label, and yi is a nearest neighbor class label. 1() is an indicator function, where the parameter is true and the return value is 1, otherwise it is 0

?

Two, the characteristics of k-NN

(1) It is an example-based learning

  • A proximity metric is needed to determine the similarity or distance between instances

(2) There is no need to build a model, but it is very expensive to classify a test sample

  • The distance to all training instances needs to be calculated

(3) Prediction based on local information is very sensitive to noise

(4) The nearest neighbor classifier can generate a decision boundary of any shape

  • Decision trees and rule-based classifiers typically have straight-line decision boundaries

(5) Appropriate proximity metrics and data preprocessing are required

  • Prevent proximity measures from being influenced by a certain attribute

(6) Selection of k value:

  • Sensitive to noisy points if k is too small
  • If k is too large, the neighborhood may contain many points of other classes

(7) Calibration problem (standardization)

  • Attributes may need to be normalized to prevent distance metrics from being dominated by attributes with large value ranges

3. Pseudocode of KNN algorithm

Input parameters: k value, trainingSamples (training data set, M*N matrix, M is the number of samples, N is the number of attributes), trainingLabels (classification labels of training data set 0, 1, 2…, M*1 matrix) , testingSample (test data, 1*N matrix)
Output parameter: class (the test data corresponds to the category label)

Algorithm flow:
1. Get the size M, N of the training dataset trainingSamples
2. Initialize the Distance array (M*1), which is used to store the distance between each training sample and the test sample.

3. For each training sample trainingSamples(i,:)【for i in range(M)】, calculate the distance between it and the testing sample testingSample, and store it in Distance[i]
4. Sort the Distance array in ascending order [argsort function]
5. Obtain the sequence numbers corresponding to the K distances before sorting, and assign the classification labels of the training data corresponding to the sequence numbers to labs
6. Get the unique elements of the labs array and store them in the array All_labs [unique function]

7. Get the number LabNum of non-repeating elements (array All_labs)
8. (for i in range(LabNum)) For each non-repeating classification label All_labs[i], find out how many of the nearest k category labels labs are equal to All_labs[i], and use this number as the i-th Class votes Vote[i]

9. Find the index ind where the maximum value of Vote[i] is located
10. All_labs[ind] is the category label corresponding to the maximum number of votes, which is the algorithm output result class

4. Python implementation of KNN algorithm

KNN_Classify_E:

import numpy as np
import math
def KNN_Classify_E(trainingSamples, trainingLabels, testingSample, k):
    M=trainingSamples. shape[0]
    N=trainingSamples.shape[1]
    Distance=np.zeros((M,1))
    for i in range(M):
        training = trainingSamples[i,:]
        Distance[i] = dist_E(training, testingSample)
    ind=np.argsort(Distance,axis=0)#axis=0 indicates sorting in the direction of the column
    labs = trainingLabels[ind[:k]]
    labs = np.array(labs)
    All_labs = np.unique(labs) # labs should be converted from mat to array, otherwise unique returns a problem
    LabNum = All_labs. size;
    Vote = np. zeros((LabNum, 1))
    for i in range(LabNum):
        vect = labs[labs == All_labs[i]]
        Vote[i] = vect. size
    ind = Vote.argmax(0) #Default
    c = All_labs[ind]
    return c

def dist_E(vect1,vect2):
    dist = -1
    if (vect1. size != vect2. size):
        print('length of input vectors must agree')
    else:
        t = np.multiply((vect1 - vect2), (vect1 - vect2))
        dist = math. sqrt(t. sum())
    return dist

TestE:

import numpy as np
from KNN_Classify_E import *


def classify_data(Tr_file_path, Tst_file_path):
    Tr = np.loadtxt(Tr_file_path, delimiter=",", dtype="double")
    Tst = np.loadtxt(Tst_file_path, delimiter=",", dtype="double")

    Tr = np.mat(Tr)
    Tst = np.mat(Tst)

    trAttr = Tr[:, :-1]
    trLabels = Tr[:, -1]
    tstAttr = Tst[:, :-1]
    tstLabels = Tst[:, -1]

    trAttr = normalize(trAttr)
    tstAttr = normalize(tstAttr)

    k = 3
    
    predictlabel = np.zeros((tstLabels.size, 1))
    for i in range(tstLabels. size):
        predictlabel[i, 0] = KNN_Classify_E(trAttr, trLabels, tstAttr[i, :], k)

    predict_right = predictlabel[predictlabel == tstLabels]
    acc = predict_right.size / predictlabel.size
    return acc


def normalize2(Samples):
    meanValue = np.mean(Samples, axis=0)
    stdValue = np.std(Samples, axis=0)
    Samples2 = (Samples - meanValue)/stdValue
    return Samples2

def normalize(Samples):
    Samples = np.mat(Samples)
    M = Samples. shape[0]
    N = Samples. shape[1]
    Samples2 = np.mat(np.zeros((M, N)))
    for i in range(N):
        allAtr = Samples[:, i]
        STD = allAtr.std()
        MEAN = allAtr. mean()
        x = (allAtr - MEAN) / STD;
        Samples2[:, i] = x
    return Samples2


if __name__ == "__main__":
    Tr_file_path = '0data/diabets_Tr.csv'
    Tst_file_path = '0data/diabets_Tst.csv'
    acc = classify_data(Tr_file_path, Tst_file_path)
    print(acc)

diabetes_Tr: (training data)

?

diabetes_Tst: (test data)

?