K-means clustering algorithm implements iris clustering

1.Experimental purpose:

Understand the principle of K-means algorithm and use K-means algorithm to implement iris clustering.

2.K-meansIntroduction to the algorithm

Clustering is a process of classifying and organizing data members that are similar in some aspects of the data set. Clustering is a technique for discovering this inherent structure. Clustering technology is often called unsupervised learning. K-means clustering is the most famous partitioning clustering algorithm. Due to its simplicity and efficiency, it has become the most widely used among all clustering algorithms. Given a set of data points and the required number of clusters k, where k is specified by the user, the k-means algorithm repeatedly divides the data into k clusters based on a certain distance function.

3.K-meansAlgorithm principles and steps

The K-means algorithm is a typical clustering algorithm based on distance (Euclidean distance, Manhattan distance). It uses distance as the evaluation index of similarity, that is, it is considered that the closer the distance between two objects, the greater the similarity. This algorithm believes that clusters are composed of objects that are close together, so it takes the ultimate goal of obtaining compact and independent clusters.

Algorithm steps:

First define how many clusters there are in total, and randomly select K samples as cluster centers.
Calculate the distances from all samples to the randomly selected K cluster centers.
The sample is assigned to which cluster center is closest to the center.
Calculate the mean of each center sample (the simplest method is to find the mean of each point of the sample) as the new cluster center.
Repeat 2, 3, and 4 until the new center is basically unchanged from the original center, and the algorithm ends.

Algorithm end condition:

k-menas can be stopped when the centroid of each cluster no longer changes. When the number of cycles reaches the predetermined number, k-means is stopped.

Principle diagram:

K-means flow chart:

Euclidean distance calculation formula: The Euclidean distance between points x= (x_1,x_2,…,x_n) and y= (y_1,y_2,…,y_n) is:

Manhattan distance: The Manhattan distance between points x= (x_1,x_2,…,x_n) and y= (y_1,y_2,…,y_n) is:

Cluster center update formula: During each iteration, we need to update the center of each cluster. Assuming that the data points contained in cluster C are x(1), x(2)…x(m), then the center of cluster C can be obtained by calculating the mean of all data points in the cluster:

4.Iris data analysis

1) Contains 3 data types, a total of 150 pieces of data

2) Contains 4 features: sepal length, sepal width, petal length, and petal width.

5.Experimental Tools

Windows 11, PyCharm.

6.Code Implementation

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets

# Get the data set directly from sklearn
iris = datasets.load_iris()
X = iris.data[:, :4] # means we take 4 dimensions in the feature space
print(X.shape)

# Take the first two dimensions (sepal length, sepal width) and draw the data distribution chart
plt.scatter(X[:, 0], X[:, 1], c="red", marker='o', label='see')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(loc=2)
plt.show()

def Model(n_clusters):
    estimator = KMeans(n_clusters=n_clusters)# Construct clusterer
    return estimator

def train(estimator):
    estimator.fit(X) # Clustering

# Initialize the instance and start training fitting
estimator=Model(3)
train(estimator)

label_pred = estimator.labels_ # Get clustering labels
# Plot k-means results
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
plt.scatter(x0[:, 0], x0[:, 1], c="red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c="green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c="blue", marker=' + ', label='label2')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(loc=2)
plt.show()


# Euclidean distance calculation
def distEclud(x, y):
    return np.sqrt(np.sum((x - y) ** 2)) # Calculate Euclidean distance
# Construct a set of K random centroids for the given data set
def randCent(dataSet, k):
    m, n = dataSet.shape # m=150,n=4
    centroids = np.zeros((k, n)) # 4*4
    for i in range(k): # Execute four times
        index = int(np.random.uniform(0, m)) # Generate random numbers from 0 to 150 (randomly pick a vector in the data set as the initial value of the centroid)
        centroids[i, :] = dataSet[index, :] # Pass the four dimensions of the corresponding row to the set of centroids
    return centroids
# k-means clustering algorithm
def KMeans(dataSet, k):
    m = np.shape(dataSet)[0] # Number of rows 150
    # The first column stores which cluster each sample belongs to (four clusters). The second column stores the error of each sample to the center point of the cluster.
    clusterAssment = np.mat(np.zeros((m, 2))) # .mat() creates a 150*2 matrix
    clusterChange=True
    # 1. Initialize centroids
    centroids = randCent(dataSet, k) # 4*4
    while clusterChange:
        # Stop iteration when the cluster to which the sample belongs is no longer updated
        clusterChange = False
        # Traverse all samples (number of rows 150)
        for i in range(m):
            minDist = 100000.0
            minIndex = -1
            # Traverse all centroids
            # 2. Find the nearest centroid
            for j in range(k):
                # Calculate the Euclidean distance between the sample and the four centroids and find the nearest centroid minIndex
                distance = distEclud(centroids[j, :], dataSet[i, :])
                if distance < minDist:
                    minDist=distance
                    minIndex = j
            # 3. Update the cluster to which the row sample belongs
            if clusterAssment[i, 0] != minIndex:
                clusterChange=True
                clusterAssment[i, :] = minIndex, minDist ** 2
        # 4. Update centroid
        for j in range(k):
            # np.nonzero(x) returns the subscript of the element whose value is not zero. Its return value is a tuple with length x.ndim (number of axes of x)
            # Each element of the tuple is an integer array, and its value is the value of the subscript of the non-zero element on the corresponding axis.
            #Matrix name.A represents converting the matrix into array array type

            #Here, take the first column of all rows of the matrix clusterAssment, convert it into an array, compare it with j (cluster label value), and return true or false
            # Generate an array through np.nonzero, which is the subscript value (x) of all points corresponding to the cluster class
            # Then use these subscript values to find the corresponding rows in the dataSet data set, and save them as pointsInCluster (x*4)
            pointsInCluster = dataSet[np.nonzero(clusterAssment[:, 0].A == j)[0]] # Get all the points of the corresponding cluster class (x*4)
            centroids[j, :] = np.mean(pointsInCluster, axis=0) # Find the mean and generate a new centroid
            # axis=0, then the output is 1 row and 4 columns. What is found is the average value of each column of pointsInCluster, that is, what is the axis, which indicates which dimension is compressed into 1

    print("cluster complete")
    return centroids, clusterAssment
def draw(data, center, assment):
    length = len(center)
    fig = plt.figure
    data1 = data[np.nonzero(assment[:, 0].A == 0)[0]]
    data2 = data[np.nonzero(assment[:, 0].A == 1)[0]]
    data3 = data[np.nonzero(assment[:, 0].A == 2)[0]]
    #Select the first two dimensions to draw a scatter plot of the original data
    plt.scatter(data1[:, 0], data1[:, 1], c="red", marker='o', label='label0')
    plt.scatter(data2[:, 0], data2[:, 1], c="green", marker='*', label='label1')
    plt.scatter(data3[:, 0], data3[:, 1], c="blue", marker=' + ', label='label2')
    # Draw the centroid point of the cluster
    for i in range(length):
        plt.annotate('center', xy=(center[i, 0], center[i, 1]), xytext= \
            (center[i, 0] + 1, center[i, 1] + 1), arrowprops=dict(facecolor='yellow'))
        # plt.annotate('center',xy=(center[i,0],center[i,1]),xytext=\
        # (center[i,0] + 1,center[i,1] + 1),arrowprops=dict(facecolor='red'))
    plt.show()

    #Select the last two dimensions to draw a scatter plot of the original data
    plt.scatter(data1[:, 2], data1[:, 3], c="red", marker='o', label='label0')
    plt.scatter(data2[:, 2], data2[:, 3], c="green", marker='*', label='label1')
    plt.scatter(data3[:, 2], data3[:, 3], c="blue", marker=' + ', label='label2')
    # Draw the centroid point of the cluster
    for i in range(length):
        plt.annotate('center', xy=(center[i, 2], center[i, 3]), xytext= \
            (center[i, 2] + 1, center[i, 3] + 1), arrowprops=dict(facecolor='yellow'))
    plt.show()
dataSet = X
k=3
centroids, clusterAssment = KMeans(dataSet, k)
draw(dataSet, centroids, clusterAssment)

The drawing results are shown below:

7.Experimental Summary

Advantages of K-means:

Simple, easy to understand and implement; fast convergence, generally only 5-10 iterations are needed, efficient

shortcoming:

Different choices of K value will have a big difference in the results.
It is sensitive to the selection of initial points, and the clustering results obtained by different random initial points may be completely different.
It is more difficult to converge for data sets that are not convex.
It is too sensitive to noise because the algorithm is based on the mean.
The result is not necessarily the global optimal, but can only guarantee the local optimal.
The effect of grouping spherical clusters is better, but the effect of grouping non-clusters, clusters of different sizes and different densities is not good.