K-means (K-means) algorithm

K-means (k-means, also recorded as kmeans) is one of the clustering algorithms. Due to its simple principle, strong interpretability, easy implementation, and fast convergence speed, it is widely used in data mining, cluster analysis, data clustering, It has a wide range of applications in the fields of pattern recognition, financial risk control, data science, intelligent marketing and data operations.

K-means is our most commonly used clustering algorithm based on Euclidean distance. It believes that the closer the distance between two targets, the greater the similarity.

1. K-means basics

1.1 Priest-Villager Model

K-means has a famous explanation: the priest-villager model:

Four pastors went to preach in the suburbs. At first, the pastors randomly selected a few preaching points and announced the conditions of these preaching points to all the villagers in the suburbs. So each villager went to the preaching point closest to his home. Listen to the lecture.

After listening to the class, everyone felt that the distance was too far, so each pastor counted the addresses of all the villagers in his class, moved them to the center of all addresses, and updated the location of his preaching point on the poster.

Every time the pastor moves, he cannot be closer to everyone. Some people find that after Pastor A moves, it is better to go to Pastor B to listen to the class to be closer, so each villager goes to the preaching point closest to them…

In this way, the pastor updated his location every week, and the villagers chose preaching points according to their own circumstances, and eventually the situation stabilized.

We can see that the purpose of the priest is to minimize the distance between each villager and its nearest center point.

1.2 Clustering

What is clustering?

In layman’s terms, clustering is dividing a bunch of data into different groups.

Popular definition of Clustering: Dividing a bunch of data into different groups. A type of unsupervised learning in which the resulting categories are unknown.

Academic definition of clustering: The process of dividing a collection of data objects into clusters (subsets) so that objects within clusters are similar to each other and objects between clusters are dissimilar.

1.3 Cluster classification

Among clustering algorithms, K-means and DBSCAN are commonly used, but this article focuses on K-means.

HMM is the Hidden Markov Model (HiddenMarkovModel) which is widely used in speech recognition, machine translation, Chinese word segmentation, named entity recognition, It is widely used in fields such as part-of-speech tagging and gene recognition.

1.4 Clustering algorithm based on partitioning

Academic definition: Divide data into several groups according to a certain objective, and the result of the division is to maximize (or minimize) the value of the objective function.

Popular definition: Divide sample features into several categories based on their similarity or distance.

1.4.1 Similarity

What is similarity? That is, the degree of similarity between two objects.

The principle of similarity
	Principle	Representation
Measurement based on distance	Small distance means greater similarity	Euclidean distance
Measurement based on angle	Small angle, large similarity	Cosine similarity

1.4.2 Distance

What is distance? That is the distance between two points.

2. K-means principle

K-means, where K refers to the number of classes and means refers to the mean.

2.1K-means principle

K-means is a clustering algorithm based on the partition of sample sets and is a type of unsupervised learning.

The idea of K-means:

Divide the sample set into k subsets to form k classes;
Divide n samples into k classes, and the distance between each sample and the center of the class to which it belongs is the smallest.

The assumption of K-means: a sample only belongs to one class, or the intersection of classes is the empty set.

How does K-means determine categories, and how does it determine similarity?

Through the principle of K-means algorithm, it can be seen that the essence of K-means is that birds of a feather flock together.

2.2K-means algorithm

The algorithm steps of K-means are:

Select the initialized k samples as the initial clustering center $a = a_{1},a_{2},...a_{k}$ ;
For each sample in the dataset $x_{i}$ calculate it to k distance between cluster centers and assign them to the cluster corresponding to the cluster center with the smallest distance;
For each category $a_{j}$ , recalculate its cluster Class center $a_{j} = \frac{1}{|c_{j}|}\sum _{x\in c_{j}}x$ (that is, it belongs to the centroid of all samples of the class);
Repeat the above two steps 2 and 3 until a certain termination condition is reached (number of iterations, minimum error change, etc.).

The main steps of K-means clustering algorithm:

Step 1: Initialize the clustering center: Randomly select k samples as the initial clustering center. (k needs to be determined in advance)
Step 2: Assign samples to cluster centers: Calculate the distance between each sample and each cluster center, and assign each sample to the cluster center closest to it (selection of initial center affect the clustering results)
Step 3: Move the cluster center: The new cluster center is moved to the average of all samples in this cluster (empty clusters may exist)
Step 4: Stop moving: Repeat step 2 until the cluster center no longer moves.

Note: The K-means algorithm uses an iterative method to obtain the local optimal solution.

2.2.1. How does K-means determine the K value?

K-means often determines the K value based on SSE and silhouette coefficients.

Method 1: Try different k values: Select several k values, compare the clustering effects, and choose the optimal k value.

Method 2: Combining business characteristics: Suppose you want to divide the article into three types: table tennis, basketball, and comprehensive, set k=3.

Method 3: Based on SSE and silhouette coefficient: The smaller the SSE, the better the clustering effect; the larger the silhouette coefficient, the better the clustering effect.

2.2.2. How to select the initial center point for K-means?

K-means chooses different initial centers and will get different clustering results.

K-means The K-means++ method is often used to determine the initial center point.

K-means++: Select the mutual distance between the initial cluster centers to be as far as possible.

Bipartite K-means: Select the class with the largest error and perform a binary split.

2.2.3. How does K-means handle empty clusters?

Cluster centers are not assigned to samples and are often deleted.

Empty cluster problem: K-means planned to cluster into 20 categories, but ended up clustering into 19 categories, and 1 category was empty.

Cause of empty cluster: The cluster center is not assigned to the sample.

Solution:

Method 1: Mean points of other cluster centers
Method 2: Delete the empty family
Method 3: The point farthest from the cluster center

2.3. K-means feature engineering

Neither categorical features nor large numerical features are suitable for K-means clustering.

Reason: K-means is based on distance, while categories have no distance.

k-means is obvious for outliers, such as age, amount, etc.

2.4. K-means evaluation

What kind of K-means clustering is good K-means clustering?

In practical applications, SSE (Sum of Squared Errors) is often used in combination with Silhouette Coefficient to evaluate the effect of the clustering model.

SSE: The Sum of Squared Errors is the smallest and the clustering effect is the best.

Silhouette Coefficient: The larger the silhouette coefficient, the better the clustering effect.

2.4.1. SSE

The smaller the SSE, the better the clustering effect.

2.4.2. Contour coefficient

The larger the silhouette coefficient, the better the clustering effect.

2.5 complexity

Let’s look at the pseudocode first:

Get data n pieces of m-dimensional data
Randomly generate K m-dimensional points
while(t)
    for(int i=0;i < n;i + + )
        for(int j=0;j < k;j + + )
            Calculate the distance from point i to class j
    for(int i=0;i < k;i + + )
        1. Find all data points belonging to your category
        2. Modify your own coordinates to the center point coordinates of these data points
end

Time complexity: $O(tknm)$ , where t is iteration degree, k is the number of clusters, n is the number of sample points, and m is the sample point dimension.

Space complexity: $O(m(n + k))$ , where k is the number of clusters, m is the sample point dimension, and n is the number of sample points.

3. K-means application

Python implementation of 3.1K-means

# -*- coding:utf-8 -*-
import numpy as np
from matplotlib import pyplot


class K_Means(object):
    # k is the number of groups; tolerance center point error’; max_iter is the number of iterations
    def __init__(self, k=2, tolerance=0.0001, max_iter=300):
        self.k_ = k
        self.tolerance_ = tolerance
        self.max_iter_ = max_iter

    def fit(self, data):
        self.centers_ = {}
        for i in range(self.k_):
            self.centers_[i] = data[i]

        for i in range(self.max_iter_):
            self.clf_ = {}
            for i in range(self.k_):
                self.clf_[i] = []
            # print("Particle:",self.centers_)
            for feature in data:
                # distances = [np.linalg.norm(feature-self.centers[center]) for center in self.centers]
                distances = []
                for center in self.centers_:
                    # Euler distance
                    # np.sqrt(np.sum((features-self.centers_[center])**2))
                    distances.append(np.linalg.norm(feature - self.centers_[center]))
                classification = distances.index(min(distances))
                self.clf_[classification].append(feature)

            # print("Group situation:",self.clf_)
            prev_centers = dict(self.centers_)
            for c in self.clf_:
                self.centers_[c] = np.average(self.clf_[c], axis=0)

            # Is the 'center point' within the error range?
            optimized=True
            for center in self.centers_:
                org_centers = prev_centers[center]
                cur_centers = self.centers_[center]
                if np.sum((cur_centers - org_centers) / org_centers * 100.0) > self.tolerance_:
                    optimized=False
            if optimized:
                break

    def predict(self, p_data):
        distances = [np.linalg.norm(p_data - self.centers_[center]) for center in self.centers_]
        index = distances.index(min(distances))
        return index


if __name__ == '__main__':
    x = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
    k_means = K_Means(k=2)
    k_means.fit(x)
    print(k_means.centers_)
    for center in k_means.centers_:
        pyplot.scatter(k_means.centers_[center][0], k_means.centers_[center][1], marker='*', s=150)

    for cat in k_means.clf_:
        for point in k_means.clf_[cat]:
            pyplot.scatter(point[0], point[1], c=('r' if cat == 0 else 'b'))

    predict = [[2, 1], [6, 9]]
    for features in predict:
        cat = k_means.predict(predict)
        pyplot.scatter(feature[0], feature[1], c=('r' if cat == 0 else 'b'), marker='x')

    pyplot.show()

The running results are as follows:

{0: array([1.16666667, 1.46666667]), 1: array([7.33333333, 9. ])}

Note: * is the “center point” of the two sets of data; x is the prediction point grouping.

Sklearn implementation of 3.2K-means

import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot

if __name__ == '__main__':
    x = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

# Divide the above data points into two groups (unsupervised learning)
clf = KMeans(n_clusters=2)
clf.fit(x) #Group

centers = clf.cluster_centers_ # Center points of two sets of data points
labels = clf.labels_ #The group to which each data point belongs
print(centers)
print(labels)

for i in range(len(labels)):
    pyplot.scatter(x[i][0], x[i][1], c=('r' if labels[i] == 0 else 'b'))
pyplot.scatter(centers[:, 0], centers[:, 1], marker='*', s=100)

# predict
predict = [[2, 1], [6, 9]]
label = clf.predict(predict)
for i in range(len(label)):
    pyplot.scatter(predict[i][0], predict[i][1], c=('r' if label[i] == 0 else 'b'), marker='x')

pyplot.show()

The running results are as follows:

[[7.33333333 9. ]
 [1.16666667 1.46666667]]
[1 1 0 0 1 0]

Note: * is the “center point” of the two sets of data; x is the prediction point grouping.

3.3. User clustering

Dataset: titanic.xls (list of Titanic victims and survivors)

Data set acquisition address of titanic.xls:

titanic.xls

Task: Group users (live/dead) using k-means based on data except survived field

Clustered user groups are often used in early stages to attempt user exploration. In actual implementation, user tags or user portraits are often used to group users.

The python code for user clustering is as follows:

# -*- coding:utf-8 -*-
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing
import pandas as pd

# Download Data
df = pd.read_excel('titanic.xls')
df.drop(['body', 'name', 'ticket'], 1, inplace=True)
df.fillna(0, inplace=True) # Replace NaN with 0

# Map strings to numbers, such as {female:1, male:0}
df_map = {}
cols = df.columns.values
for col in cols:
    if df[col].dtype != np.int64 and df[col].dtype != np.float64:
        temp = {}
        x = 0
        for ele in set(df[col].values.tolist()):
            if ele not in temp:
                temp[ele] = x
                x + = 1

        df_map[df[col].name] = temp
        df[col] = list(map(lambda val: temp[val], df[col]))

# Standardize each column of features to a standard normal distribution
x = np.array(df.drop(['survived'], 1).astype(float))
x = preprocessing.scale(x)
clf = KMeans(n_clusters=2)
clf.fit(x)

# Calculate grouping accuracy
y = np.array(df['survived'])
correct = 0
for i in range(len(x)):
    predict_data = np.array(x[i].astype(float))
    predict_data = predict_data.reshape(-1, len(predict_data))
    predict = clf.predict(predict_data)
    if predict[0] == y[i]:
        correct + = 1

print(correct * 1.0 / len(x))

Results of the:

First execution: 0.6974789915966386

Second execution: 0.3017570664629488

Note: The results fluctuate greatly because it randomly assigns groups (live: 0, dead: 1) (live: 1, dead: 0).

4. K-means summary

Advantages and disadvantages of 4.1K-means

4.1.1 Advantages

It is easy to understand that the clustering effect is good. Although it is a local optimum, the local optimum is often enough;
The principle is simple, easy to implement, and fast in convergence speed;
This algorithm can ensure better scalability when processing large data sets;
When the cluster approximates a Gaussian distribution, the effect is very good;
Algorithm complexity is low;
The clustering effect is better;
The model has strong interpretability;
Parameter adjustment only requires adjusting the number of classes k.

4.1.2 Disadvantages

The K value needs to be set manually. Different K values will give different results, and the selection of k is difficult to grasp;
Sensitive to the initial cluster center, different selection methods will yield different results;
Sensitive to outliers;
Samples can only be classified into one category and are not suitable for multi-classification tasks;
It is not suitable for classification that is too discrete, classification of unbalanced sample categories, and classification of non-convex shapes;
Clustering will not perform well if the types of data are unbalanced
An iterative method is used to obtain the local optimal solution.
Sensitive to noise and outliers

What is a convex set?

convex set

A subset of Euclidean space in which a line connecting any two points still falls entirely within the subset. For example, the two graphs below are both convex sets:

Therefore, a convex data set, that is, the samples of the data set exhibit a convex set distribution.

In contrast, neither of the following two graphs is convex:

Therefore, the data set is not convex, that is, the samples of the data set do not present a convex set distribution.

Improvements of 4.2K-means

In response to the shortcomings of K-means, K-means has many improved algorithms, such as data preprocessing (removing outliers), reasonable selection of K values, high-dimensional mapping, etc.

K-means improvements
Disadvantages	Improvements
1. The selection of k is not good Grasp	ISODATA
2. Sensitive to the initial clustering center	k-means + +
3. It is difficult to converge for data sets that are not convex	DESCAN
4. If the type of data is unbalanced, then The clustering effect is not good	CUSBoost
5. It uses an iterative method to obtain the local optimal solution	Bisection K -means
6. Sensitive to noise and outliers	K-Mediods

[Machine Learning] K-means (very detailed) – Zhihu (zhihu.com)https://zhuanlan.zhihu.com/p/ 78798251

Kmeans + + clustering algorithm principle and implementation – Zhihu (zhihu.com)https://zhuanlan.zhihu.com/p/152286357

4.3The difference between clustering and classification

What is the difference between clustering and classification?

The biggest difference is: clustering is unsupervised; classification is supervised learning.

The classification of machine learning can be divided into binary classification (Binary Classification), multi-class classification (Multi-Class Classification) and multi-label classification (Multi-Label Classification) according to the output category (label).

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57237 people are learning the system