Clustering algorithm–K-Means algorithm

cluster learning

1. Clustering concept: 1. Unsupervised problem: We have no labels in our hands, and the computer needs to learn the labels by itself.

2. Clustering: In a pile of data, similar things are grouped into a group, and a group is a category, also called a cluster.

3. Difficulty: How to evaluate the quality of clustering results, and how to adjust parameters to a size suitable for data clustering.

2. K-Means algorithm:

(1) Basic concepts:

  1. We need to choose the number of clusters, that is, how many categories you want to divide the data into, that is, we need to set the value of k.
  2. Center of mass: It is the mean, which can be obtained by averaging each dimension of the vector.
  3. Distance measure: Euclidean and cosine similarity distances are commonly used, but they need to be standardized before calculating the distance.
  4. Optimization goal: min\sum_{i=1}^{k}\sum_{x\in {c_{i}}^{}}^{}({c_{i}} ^{},x)^{2}, all k clusters must be To optimize, each center point of each cluster needs to calculate the distance from all points in the cluster to the center point. We want this distance to be as small as possible, which also means that each data should be as close as possible. , the clustering results will naturally be better.

3. When calculating distance, the most commonly used method is Euclidean distance calculation. The calculation idea is as follows:

  1. Two-dimensional data: If there are two points A(x_{1},y_{1}) ,B(x_{2},y_{2}) , let the Euclidean distance be d, then we have
    d=\sqrt{(x_{2}-x_{1})^{2} + (y_{2}-y_{1})^{2}}
  2. n-dimensional data: If there are two n-dimensional vectors, A(x_{1},x_{2},...,x_{n})B (y_{1},y_{2},...,y_{n})
    d=\sqrt{\sum_{i=1}^{n}(x_{i}-y_{i})^{2}}=\sqrt{(x_{1} -y_{1})^{2} + (x_{2}-y_{2})^{2} + ... + (x_{n}-y_{n})^{2}}
  3. When using Euclidean distance, you must first standardize the data, that is, normalize it, so that the range of the data is between 0 and 1. Here’s how to do it:
    d_{normalized}=\frac{\mathrm{d-min(D)} }{\mathrm{max(D)-min(D)} }
    Among them: d is the original Euclidean distance
    D is a large set of Euclidean distances between all samples
    max(D) is the maximum value among all distances
    min(D) is the minimum value among all distances
  4. Optimization: Through such iterations again and again, each calculation makes the sum of the distances from each point to the center point, the center of mass, become smaller and smaller. The smaller the better, the smaller the more similar the clustering effect. The better.

4. Work flow:

  1. Traversal: Randomly select k_{i} (Set it yourself) points, and then calculate the Euclidean distance. The point that is further away from point k belongs to the cluster represented by point k. For example, the Euclidean distances of the three points A, B, and C are all from k_{2}Recently, these three points belong to k_{2} represents this cluster.
  2. Update: After each clustering ends, the centroid is recalculated based on the clustering structure after the end. After the centroid of each cluster is updated, the first step is repeated to traverse all the points and perform another clustering.
  3. Continuously updated: When all sample points have almost no changes, the result of this clustering is the final clustering result.

5. Advantages and disadvantages of K-Mwans algorithm:

advantage:

  1. Very simple, very easy to understand the principle and operation.
  2. It is very fast. What needs to be calculated is to continuously update the distance between each sample point and the centroid point.
  3. Suitable for regular data sets.

shortcoming:

  1. The k value is difficult to determine: When the amount of data is large or the distribution is relatively scattered, it is not easy to see how many clusters it should be divided into. Therefore, it is necessary to set multiple sets of k values, and then compare the k value with the best performance as the total number of clusters.
  2. There is a linear relationship between complexity and samples: when the number of samples is very large, the required time complexity and space complexity will increase accordingly.
  3. It is difficult to achieve clusters of arbitrary shapes: for example, a smiley face-shaped data set cannot be reasonably clustered using the k-means algorithm. Smiley face data visualization k-means clustering is shown in the figure
  4. The k-means algorithm has very strict requirements on initial values. The unequal initial values you choose will have a very large impact on your final clustering results.

5. MATLAB K-MEANS algorithm code implementation:

% Input data matrix, each row is a sample, each column is a feature
data = [1, 2; 1.5, 1.8; 5, 8; 8, 8; 1, 0.6; 9, 11];

% Set the number of clusters
k = 2;

% Initialize the cluster center and randomly select k samples from the data as the initial center
initialCentroids = data(randperm(size(data, 1), k), :);

% Set the number of iterations
maxIter = 100;

% Run k-means algorithm
[centroids, assignments] = kMeans(data, initialCentroids, maxIter);

% Output the cluster center and the distribution result of each sample
disp('Clustering center:');
disp(centroids);

disp('Distribution result of each sample:');
disp(assignments);

% Define k-means algorithm function
function [centroids, assignments] = kMeans(data, initialCentroids, maxIter)
    centroids = initialCentroids;
    numData = size(data, 1);
    numFeatures = size(data, 2);
    k = size(initialCentroids, 1);
    assignments = zeros(numData, 1);

    for iter = 1:maxIter
        % Assign each sample to the nearest cluster center
        for i = 1:numData
            distances = sqrt(sum((centroids - data(i, :)).^2, 2));
            [~, assignments(i)] = min(distances);
        end

        % Update cluster center
        for j = 1:k
            clusterPoints = data(assignments == j, :);
            centroids(j, :) = mean(clusterPoints, 1);
        end
    end
end

Code description:

  1. Run steps:

    a. Prepare data: In the code, enter the data by defining the data matrix. Each row is a sample and each column is a feature.

    b. Set the number of clusters: Specify the number of clusters to divide the data into through the variable k.

    c. Initialize clustering centers: Randomly select k samples from the data as initial clustering centers.

    d. Set the number of iterations: Specify the maximum number of iterations for the algorithm to run through maxIter.

    e. Run k-means: Call the kMeans function to execute the k-means algorithm and obtain the cluster center centroids and the assignment results of each sample assignments .

    f. Output results: Output the cluster center and the distribution results of each sample.

  2. Variable meaning:

    • data: Input data matrix, each row represents a sample, and each column represents a feature.

    • k: The number of clusters, specifies the number of clusters into which the data is divided.

    • initialCentroids: Initial clustering centers, randomly selected samples from the data.

    • maxIter: Maximum number of iterations, specifying the maximum number of iterations for the algorithm to run.

    • centroids: The final result of the cluster center, indicating the center point of each cluster.

    • assignments: The assignment result of each sample, indicating the index of the cluster to which each sample is assigned.

  3. Run the example:

    In MATLAB, copy and paste the above code into the editor and execute the entire script. Make sure to prepare data in advance and adjust parameters such as the number of clusters according to actual needs. Finally, view the output results, namely the cluster centers and the distribution of each sample.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57607 people are learning the system