K-means (k-means, also recorded as kmeans) is one of the clustering algorithms. Due to its simple principle, strong interpretability, easy implementation, and fast convergence speed, it is widely used in data mining, cluster analysis, data clustering, It has a wide range of applications in the fields of pattern recognition, financial risk control, data science, intelligent marketing and data operations.
K-means is our most commonly used clustering algorithm based on Euclidean distance. It believes that the closer the distance between two targets, the greater the similarity.
1. K-means basics
1.1 Priest-Villager Model
K-means has a famous explanation: the priest-villager model:
Four pastors went to preach in the suburbs. At first, the pastors randomly selected a few preaching points and announced the conditions of these preaching points to all the villagers in the suburbs. So each villager went to the preaching point closest to his home. Listen to the lecture.
After listening to the class, everyone felt that the distance was too far, so each pastor counted the addresses of all the villagers in his class, moved them to the center of all addresses, and updated the location of his preaching point on the poster.
Every time the pastor moves, he cannot be closer to everyone. Some people find that after Pastor A moves, it is better to go to Pastor B to listen to the class to be closer, so each villager goes to the preaching point closest to them…
In this way, the pastor updated his location every week, and the villagers chose preaching points according to their own circumstances, and eventually the situation stabilized.
We can see that the purpose of the priest is to minimize the distance between each villager and its nearest center point.
1.2 Clustering
What is clustering?
In layman’s terms, clustering is dividing a bunch of data into different groups.
Popular definition of Clustering: Dividing a bunch of data into different groups. A type of unsupervised learning in which the resulting categories are unknown.
Academic definition of clustering: The process of dividing a collection of data objects into clusters (subsets) so that objects within clusters are similar to each other and objects between clusters are dissimilar.
1.3 Cluster classification
Among clustering algorithms, K-means and DBSCAN are commonly used, but this article focuses on K-means.
HMM is the Hidden Markov Model (HiddenMarkovModel) which is widely used in speech recognition, machine translation, Chinese word segmentation, named entity recognition, It is widely used in fields such as part-of-speech tagging and gene recognition.
1.4 Clustering algorithm based on partitioning
Academic definition: Divide data into several groups according to a certain objective, and the result of the division is to maximize (or minimize) the value of the objective function.
Popular definition: Divide sample features into several categories based on their similarity or distance.
1.4.1 Similarity
What is similarity? That is, the degree of similarity between two objects.
Principle | Representation | |
---|---|---|
Measurement based on distance | Small distance means greater similarity | Euclidean distance |
Measurement based on angle | Small angle, large similarity | Cosine similarity |
1.4.2 Distance
What is distance? That is the distance between two points.
2. K-means principle
K-means, where K refers to the number of classes and means refers to the mean.
2.1K-means principle
K-means is a clustering algorithm based on the partition of sample sets and is a type of unsupervised learning.
The idea of K-means:
- Divide the sample set into k subsets to form k classes;
- Divide n samples into k classes, and the distance between each sample and the center of the class to which it belongs is the smallest.
The assumption of K-means: a sample only belongs to one class, or the intersection of classes is the empty set.
How does K-means determine categories, and how does it determine similarity?
Through the principle of K-means algorithm, it can be seen that the essence of K-means is that birds of a feather flock together.
2.2K-means algorithm
The algorithm steps of K-means are:
- Select the initialized k samples as the initial clustering center;
- For each sample in the dataset calculate it to k distance between cluster centers and assign them to the cluster corresponding to the cluster center with the smallest distance;
- For each category, recalculate its cluster Class center (that is, it belongs to the centroid of all samples of the class);
- Repeat the above two steps 2 and 3 until a certain termination condition is reached (number of iterations, minimum error change, etc.).
The main steps of K-means clustering algorithm:
- Step 1: Initialize the clustering center: Randomly select k samples as the initial clustering center. (k needs to be determined in advance)
- Step 2: Assign samples to cluster centers: Calculate the distance between each sample and each cluster center, and assign each sample to the cluster center closest to it (selection of initial center affect the clustering results)
- Step 3: Move the cluster center: The new cluster center is moved to the average of all samples in this cluster (empty clusters may exist)
- Step 4: Stop moving: Repeat step 2 until the cluster center no longer moves.
Note: The K-means algorithm uses an iterative method to obtain the local optimal solution.
2.2.1. How does K-means determine the K value?
K-means often determines the K value based on SSE and silhouette coefficients.
Method 1: Try different k values: Select several k values, compare the clustering effects, and choose the optimal k value.
Method 2: Combining business characteristics: Suppose you want to divide the article into three types: table tennis, basketball, and comprehensive, set k=3.
Method 3: Based on SSE and silhouette coefficient: The smaller the SSE, the better the clustering effect; the larger the silhouette coefficient, the better the clustering effect.
2.2.2. How to select the initial center point for K-means?
K-means chooses different initial centers and will get different clustering results.
K-means The K-means++ method is often used to determine the initial center point.
K-means++: Select the mutual distance between the initial cluster centers to be as far as possible.
Bipartite K-means: Select the class with the largest error and perform a binary split.
2.2.3. How does K-means handle empty clusters?
Cluster centers are not assigned to samples and are often deleted.
Empty cluster problem: K-means planned to cluster into 20 categories, but ended up clustering into 19 categories, and 1 category was empty.
Cause of empty cluster: The cluster center is not assigned to the sample.
Solution:
- Method 1: Mean points of other cluster centers
- Method 2: Delete the empty family
- Method 3: The point farthest from the cluster center
2.3. K-means feature engineering
Neither categorical features nor large numerical features are suitable for K-means clustering.
Reason: K-means is based on distance, while categories have no distance.
k-means is obvious for outliers, such as age, amount, etc.
2.4. K-means evaluation
What kind of K-means clustering is good K-means clustering?
In practical applications, SSE (Sum of Squared Errors) is often used in combination with Silhouette Coefficient to evaluate the effect of the clustering model.
SSE: The Sum of Squared Errors is the smallest and the clustering effect is the best.
Silhouette Coefficient: The larger the silhouette coefficient, the better the clustering effect.
2.4.1. SSE
The smaller the SSE, the better the clustering effect.
2.4.2. Contour coefficient
The larger the silhouette coefficient, the better the clustering effect.
2.5 complexity
Let’s look at the pseudocode first:
Get data n pieces of m-dimensional data Randomly generate K m-dimensional points while(t) for(int i=0;i < n;i + + ) for(int j=0;j < k;j + + ) Calculate the distance from point i to class j for(int i=0;i < k;i + + ) 1. Find all data points belonging to your category 2. Modify your own coordinates to the center point coordinates of these data points end
Time complexity: , where t is iteration degree, k is the number of clusters, n is the number of sample points, and m is the sample point dimension.
Space complexity: , where k is the number of clusters, m is the sample point dimension, and n is the number of sample points.
3. K-means application
Python implementation of 3.1K-means
# -*- coding:utf-8 -*- import numpy as np from matplotlib import pyplot class K_Means(object): # k is the number of groups; tolerance center point error’; max_iter is the number of iterations def __init__(self, k=2, tolerance=0.0001, max_iter=300): self.k_ = k self.tolerance_ = tolerance self.max_iter_ = max_iter def fit(self, data): self.centers_ = {} for i in range(self.k_): self.centers_[i] = data[i] for i in range(self.max_iter_): self.clf_ = {} for i in range(self.k_): self.clf_[i] = [] # print("Particle:",self.centers_) for feature in data: # distances = [np.linalg.norm(feature-self.centers[center]) for center in self.centers] distances = [] for center in self.centers_: # Euler distance # np.sqrt(np.sum((features-self.centers_[center])**2)) distances.append(np.linalg.norm(feature - self.centers_[center])) classification = distances.index(min(distances)) self.clf_[classification].append(feature) # print("Group situation:",self.clf_) prev_centers = dict(self.centers_) for c in self.clf_: self.centers_[c] = np.average(self.clf_[c], axis=0) # Is the 'center point' within the error range? optimized=True for center in self.centers_: org_centers = prev_centers[center] cur_centers = self.centers_[center] if np.sum((cur_centers - org_centers) / org_centers * 100.0) > self.tolerance_: optimized=False if optimized: break def predict(self, p_data): distances = [np.linalg.norm(p_data - self.centers_[center]) for center in self.centers_] index = distances.index(min(distances)) return index if __name__ == '__main__': x = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]) k_means = K_Means(k=2) k_means.fit(x) print(k_means.centers_) for center in k_means.centers_: pyplot.scatter(k_means.centers_[center][0], k_means.centers_[center][1], marker='*', s=150) for cat in k_means.clf_: for point in k_means.clf_[cat]: pyplot.scatter(point[0], point[1], c=('r' if cat == 0 else 'b')) predict = [[2, 1], [6, 9]] for features in predict: cat = k_means.predict(predict) pyplot.scatter(feature[0], feature[1], c=('r' if cat == 0 else 'b'), marker='x') pyplot.show()
The running results are as follows:
{0: array([1.16666667, 1.46666667]), 1: array([7.33333333, 9. ])}
Note: * is the “center point” of the two sets of data; x is the prediction point grouping.
Sklearn implementation of 3.2K-means
import numpy as np from sklearn.cluster import KMeans from matplotlib import pyplot if __name__ == '__main__': x = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]) # Divide the above data points into two groups (unsupervised learning) clf = KMeans(n_clusters=2) clf.fit(x) #Group centers = clf.cluster_centers_ # Center points of two sets of data points labels = clf.labels_ #The group to which each data point belongs print(centers) print(labels) for i in range(len(labels)): pyplot.scatter(x[i][0], x[i][1], c=('r' if labels[i] == 0 else 'b')) pyplot.scatter(centers[:, 0], centers[:, 1], marker='*', s=100) # predict predict = [[2, 1], [6, 9]] label = clf.predict(predict) for i in range(len(label)): pyplot.scatter(predict[i][0], predict[i][1], c=('r' if label[i] == 0 else 'b'), marker='x') pyplot.show()
The running results are as follows:
[[7.33333333 9. ] [1.16666667 1.46666667]] [1 1 0 0 1 0]
Note: * is the “center point” of the two sets of data; x is the prediction point grouping.
3.3. User clustering
Dataset: titanic.xls (list of Titanic victims and survivors)
Data set acquisition address of titanic.xls:
titanic.xls
Task: Group users (live/dead) using k-means based on data except survived field
Clustered user groups are often used in early stages to attempt user exploration. In actual implementation, user tags or user portraits are often used to group users.
The python code for user clustering is as follows:
# -*- coding:utf-8 -*- import numpy as np from sklearn.cluster import KMeans from sklearn import preprocessing import pandas as pd # Download Data df = pd.read_excel('titanic.xls') df.drop(['body', 'name', 'ticket'], 1, inplace=True) df.fillna(0, inplace=True) # Replace NaN with 0 # Map strings to numbers, such as {female:1, male:0} df_map = {} cols = df.columns.values for col in cols: if df[col].dtype != np.int64 and df[col].dtype != np.float64: temp = {} x = 0 for ele in set(df[col].values.tolist()): if ele not in temp: temp[ele] = x x + = 1 df_map[df[col].name] = temp df[col] = list(map(lambda val: temp[val], df[col])) # Standardize each column of features to a standard normal distribution x = np.array(df.drop(['survived'], 1).astype(float)) x = preprocessing.scale(x) clf = KMeans(n_clusters=2) clf.fit(x) # Calculate grouping accuracy y = np.array(df['survived']) correct = 0 for i in range(len(x)): predict_data = np.array(x[i].astype(float)) predict_data = predict_data.reshape(-1, len(predict_data)) predict = clf.predict(predict_data) if predict[0] == y[i]: correct + = 1 print(correct * 1.0 / len(x))
Results of the:
First execution: 0.6974789915966386
Second execution: 0.3017570664629488
Note: The results fluctuate greatly because it randomly assigns groups (live: 0, dead: 1) (live: 1, dead: 0).
4. K-means summary
Advantages and disadvantages of 4.1K-means
4.1.1 Advantages
- It is easy to understand that the clustering effect is good. Although it is a local optimum, the local optimum is often enough;
- The principle is simple, easy to implement, and fast in convergence speed;
- This algorithm can ensure better scalability when processing large data sets;
- When the cluster approximates a Gaussian distribution, the effect is very good;
- Algorithm complexity is low;
- The clustering effect is better;
- The model has strong interpretability;
- Parameter adjustment only requires adjusting the number of classes k.
4.1.2 Disadvantages
- The K value needs to be set manually. Different K values will give different results, and the selection of k is difficult to grasp;
- Sensitive to the initial cluster center, different selection methods will yield different results;
- Sensitive to outliers;
- Samples can only be classified into one category and are not suitable for multi-classification tasks;
- It is not suitable for classification that is too discrete, classification of unbalanced sample categories, and classification of non-convex shapes;
- Clustering will not perform well if the types of data are unbalanced
- An iterative method is used to obtain the local optimal solution.
- Sensitive to noise and outliers
What is a convex set?
convex set
A subset of Euclidean space in which a line connecting any two points still falls entirely within the subset. For example, the two graphs below are both convex sets:
Therefore, a convex data set, that is, the samples of the data set exhibit a convex set distribution.
In contrast, neither of the following two graphs is convex:
Therefore, the data set is not convex, that is, the samples of the data set do not present a convex set distribution.
Improvements of 4.2K-means
In response to the shortcomings of K-means, K-means has many improved algorithms, such as data preprocessing (removing outliers), reasonable selection of K values, high-dimensional mapping, etc.
Disadvantages | Improvements |
---|---|
1. The selection of k is not good Grasp | ISODATA |
2. Sensitive to the initial clustering center | k-means + + |
3. It is difficult to converge for data sets that are not convex | DESCAN |
4. If the type of data is unbalanced, then The clustering effect is not good | CUSBoost |
5. It uses an iterative method to obtain the local optimal solution | Bisection K -means |
6. Sensitive to noise and outliers | K-Mediods |
[Machine Learning] K-means (very detailed) – Zhihu (zhihu.com)https://zhuanlan.zhihu.com/p/ 78798251
Kmeans + + clustering algorithm principle and implementation – Zhihu (zhihu.com)https://zhuanlan.zhihu.com/p/152286357
4.3The difference between clustering and classification
What is the difference between clustering and classification?
The biggest difference is: clustering is unsupervised; classification is supervised learning.
The classification of machine learning can be divided into binary classification (Binary Classification), multi-class classification (Multi-Class Classification) and multi-label classification (Multi-Label Classification) according to the output category (label).
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57237 people are learning the system