(Experiment 4) Implementation of k-means algorithm

1. Load the data set from the file

import numpy as np

inf = 0x3f3f3f3f

def loadDataSet(fileName):
    dataMat = []
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split('\t')
        fltLine = list(map(float,curLine))
        dataMat.append(fltLine)
    return dataMat

2. Calculate the Euclidean distance of two vectors

def disEclud(vecA,vecB):
    return np.sqrt(np.sum(np.power(vecA-vecB,2)))

3. Construct a set containing K random centroids

def randCent(dataSet,k):
    n = np.shape(dataSet)[1]
    centroids = np.mat(np.zeros((k,n)))
    for j in range (n):
        minJ = min(dataSet[:,j])
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = np.mat(minJ + rangeJ * np.random.rand(k,1))
    return centroids

After completing the above programming, we can first test these functions in the main program

def testBasicFunc():
    datMat = np.mat(loadDataSet('KMeans-testSet.txt'))

    print('min(datMat[:,0])=',min(datMat[:,0]))
    print('min(datMat[:,1])=',min(datMat[:,1]))
    print('max(datMat[:,1])=',max(datMat[:,1]))
    print('max(datMat[:,0])=',max(datMat[:,0]))

    print('randCent(datMat,2) = ',randCent(datMat,2))

    print('disEclud(datMat[0],datMat[1])=',disEclud(datMat[0],datMat[1]))

4. K-Means clustering algorithm (note :This algorithm is prone to errors during the writing process, please check carefully)

def kmeans(dataSet, k, distMeas=disEclud, createCent=randCent):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m,2)))
    centroids = createCent(dataSet, k )
    clusterChanged=True
    num = 1
    while clusterChanged:
        clusterChanged = False
        for i in range (m):
            minDist = inf
            minIndex = -1
            for j in range(k):
                distJI= distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j
            if clusterAssment[i,0]!= minIndex:
                clusterChanged=True
                clusterAssment[i,:]= minIndex,minDist**2
        print('The result of the %dth operation is: '% num)
        num + =1
        print(centroids)
        for cent in range(k):
            ptsInClust = dataSet[np.nonzero(clusterAssment[:,0])[0]]
        centroids[cent,:] = np.mean(ptsInClust,axis=0)
    return centroids,clusterAssment
def testKMeans():
    datMat = np.mat(loadDataSet('KMeans-testSet.txt'))
    myCentroids , clustAssing = kmeans(datMat,4)
    print('centroids=',myCentroids)


if __name__ == '__main__':
    testKMeans()

Run the above code and get the result as shown in the figure

(Note: This is just one of the possible random results. Actual individual random results may vary. Each run will produce new results. Please take screenshots of at least two of the results)

5. Defects of the K-Means clustering algorithm. In the function test of kMeans, it may occasionally fall into a local minimum (a local optimal result, but not a global optimal result). There are many reasons for this problem. It may be that the k value is inappropriate, it may be that the distance function is inappropriate, it may be that the initially randomly selected centroids are too close, or it may be a problem with the distribution of the data itself. To solve this problem, we can post-process the generated clusters. One way is to divide the cluster with the largest SSE value into two clusters. In specific implementation, the points contained in the largest cluster can be filtered out and the K-means algorithm can be run on these points, setting k to 2. In order to keep the total number of clusters unchanged, two clusters can be merged. It is obvious from the above figure that the two erroneous cluster centroids in the lower part of the above figure should be merged. So the question is, we can easily visualize clustering on two-dimensional data, but what should we do if we encounter 40-dimensional data? There are two quantifiable methods: merge the nearest centroid, or merge two The centroid that minimizes the increase in SSE. The first idea is to calculate the distance between all centroids and then merge the two closest points. The second method requires merging two clusters and then calculating the total SSE value. The above process must be repeated on all possible two clusters until the best two clusters are found. Because the above post-processing process is really cumbersome, someone proposed another algorithm called bisecting K-Means.

6. Bipartite K-Means clustering algorithm This algorithm first treats all points as a cluster, and then divides the cluster into two. Then select one of the clusters to continue dividing. Which cluster is selected depends on the value that can minimize the SSE (Sum of Squares Error) when dividing it.

The above SSE-based partitioning process is repeated until the user-specified number of clusters is obtained.

7. Binary K-Means clustering algorithm code (note: this algorithm is prone to errors during the writing process, please check carefully)

def biKMeans(dataSet, k,distMeas = disEclud):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m,2)))
    centroid = np.mean(dataSet,axis = 0).tolist()[0]
    centList = [centroid]
    for j in range (m):
        clusterAssment[j,1] = distMeas(np.mat(centroid),dataSet[j,:])**2
    while(len(centList)<k):
        lowestSS = inf
        for i in range(len(centList)):
            ptsInCurrCluster = dataSet[np.nonzero(clusterAssment[:,0].A==i)[0],:]
            centroidMat,splitClustAss = kmeans(ptsInCurrCluster,2,distMeas)
            sseSplit = np.sum(clusterAssment[:,1])
            sseNotSplit = np.sum(clusterAssment[np.nonzero(clusterAssment[:,0].A != i)[0],1])
            print("sseSplit and notSplit:",sseSplit,sseNotSplit)
            if sseSplit + sseNotSplit <lowestSS:
                bestCentToSplit = i
                bestNewCents = centroidMat
                bewsClustAss = splitClustAss.copy()
                lowestSS = sseSplit + sseNotSplit

        bewsClustAss[np.nonzero(bewsClustAss[:,0].A==1)[0],0]=len(centList)
        bewsClustAss[np.nonzero(bewsClustAss[:, 0].A == 0)[0], 0] = bestCentToSplit
        print("the best is :",bestCentToSplit)
        print("the len of is :",len(bewsClustAss))
        centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]
        centList.append(bestNewCents[1,:].tolist()[0])
        clusterAssment[np.nonzero(clusterAssment[:,0].A==bestCentToSplit)[0],:] = bewsClustAss

    return np.mat(centList),clusterAssment

def testBiKMeans():
    datMat = np.mat(loadDataSet('KMeans-testSet.txt'))
    centList , myNewAssments = biKMeans(datMat,3)
    print('centList = ' , centList)

if __name__ == '__main__':
    testBiKMeans()

The above function can be run multiple times and the clustering will converge to a global minimum, whereas the original kMeans() function will occasionally get stuck in a local minimum.

Complete code

import numpy as np

inf = 0x3f3f3f3f

def loadDataSet(fileName):
    dataMat = []
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split('\t')
        fltLine = list(map(float,curLine))
        dataMat.append(fltLine)
    return dataMat

def disEclud(vecA,vecB):
    return np.sqrt(np.sum(np.power(vecA-vecB,2)))

def randCent(dataSet,k):
    n = np.shape(dataSet)[1]
    centroids = np.mat(np.zeros((k,n)))
    for j in range (n):
        minJ = min(dataSet[:,j])
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = np.mat(minJ + rangeJ * np.random.rand(k,1))
    return centroids

def testBasicFunc():
    datMat = np.mat(loadDataSet('KMeans-testSet.txt'))

    print('min(datMat[:,0])=',min(datMat[:,0]))
    print('min(datMat[:,1])=',min(datMat[:,1]))
    print('max(datMat[:,1])=',max(datMat[:,1]))
    print('max(datMat[:,0])=',max(datMat[:,0]))

    print('randCent(datMat,2) = ',randCent(datMat,2))

    print('disEclud(datMat[0],datMat[1])=',disEclud(datMat[0],datMat[1]))

def kmeans(dataSet, k, distMeas=disEclud, createCent=randCent):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m,2)))
    centroids = createCent(dataSet, k )
    clusterChanged=True
    num = 1
    while clusterChanged:
        clusterChanged = False
        for i in range (m):
            minDist = inf
            minIndex = -1
            for j in range(k):
                distJI= distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j
            if clusterAssment[i,0]!= minIndex:
                clusterChanged=True
                clusterAssment[i,:]= minIndex,minDist**2
        print('The result of the %dth operation is: '% num)
        num + =1
        print(centroids)
        for cent in range(k):
            ptsInClust = dataSet[np.nonzero(clusterAssment[:,0])[0]]
        centroids[cent,:] = np.mean(ptsInClust,axis=0)
    return centroids,clusterAssment

def testKMeans():
    datMat = np.mat(loadDataSet('KMeans-testSet.txt'))
    myCentroids , clustAssing = kmeans(datMat,4)
    print('centroids=',myCentroids)

def biKMeans(dataSet, k,distMeas = disEclud):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m,2)))
    centroid = np.mean(dataSet,axis = 0).tolist()[0]
    centList = [centroid]
    for j in range (m):
        clusterAssment[j,1] = distMeas(np.mat(centroid),dataSet[j,:])**2
    while(len(centList)

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57139 people are learning the system