Data mining experiment five, k-means clustering algorithm

Data mining experiment five, k-means clustering algorithm

1. The purpose of the experiment:
(1) Familiar with VC++ programming tools and k-means clustering algorithm.
(2) Write the program for k-means clustering with VC++ programming tools on the training sample set, run the k-means clustering algorithm on the task-related data, and debug the experiment.
(3) Master the distance calculation method and clustering evaluation criteria.
(4) Write an experiment report.
2. Experimental principle:
1. k-means clustering
K-means clustering is a centroid-based partitioning technique. The specific iterative calculation steps are as follows:

  1. Randomly generate k centroid coordinates in the attribute vector space.
  2. Calculate the distance measure Dist(i,j) (1≤i≤n, 1≤j≤k) from each data object Ti (1≤i≤n) to all k centroids in the data set D respectively, and set Data objects Ti are clustered into the cluster with the smallest distance metric. That is, Ti∈CJ, which means that the data object Ti is gathered into the Jth cluster. Among them, J=argmin(Dist(i,j)), means that J is the j that can make Dist(i,j) take the minimum value.
  3. Calculate the centroid coordinates of each cluster according to the centroid definition to form k centroid coordinates of the next generation.
  4. If the termination condition is not satisfied, go to 2) to continue iteration; otherwise, end.
    Among them, the centroid of the cluster can have different definitions, for example, it can be the mean value of the attribute vector of the data object in the cluster (that is, the center of gravity), or it can be the center point, etc.; the distance measure can also have different definitions, and the commonly used one is Eu Its distance, Manhattan (or city block, block) distance, Minkowski distance, etc.; the terminal condition can be adopted. When the reallocation of objects no longer occurs, the program iteration ends.
    2. Termination conditions
    Termination conditions can be any of the following:
    1) No (or minimum number) of objects are reassigned to different clusters.
    2) No (or the minimum number) cluster centers change again.
    3) The sum of squared errors is locally minimized.
    3. Experiment content:
  1. Experimental content
  1. According to the calculation steps of the k-means clustering algorithm, draw the program flow chart when k=3;
  2. The k-means clustering algorithm is realized by programming the k-means program flow chart;
  3. A series of screenshots showing the k-means clustering process in the experiment report, indicating the gradual evolution of each cluster;
  4. In the report, the selection of the initial centroid, the selection of the termination condition, and the selection of the distance measure in the experimental code are pointed out and explained.
  1. Experimental procedure
    Program to achieve the following functions:
  1. First, the attribute vector in the data set D={D1,D2,D3} is input as the experimental data;
  2. The k-means clustering algorithm is realized by programming the k-means program flow chart, and it is run with experimental data;
  3. Pause at the appropriate iteration algebra during operation and display the results of real-time iterations, such as the position of the cluster center, the results of clustering by distance to the nearest neighbor, etc.;
  1. block diagram
  2. key code
#include<iostream>
#include<string>
#include <fstream>
#include <algorithm>
#include <math.h>
using namespace std;
#define k 3 //number of clusters
#define n 2 //Data dimension
#define size 30 //data size
typedef struct {<!-- --> //Define the storage structure
    double d[n];
    double distance[k];
}Data;
typedef struct {<!-- -->
    Data center[k]; // That is, the cluster center
    int cluster[k][size]; //cluster array
    int cluster_num[k];//The number of a group of data in the cluster class
    Data old_center[k];
    int isend[k][n]; //Whether the center of each cluster is equal to the marked value
    int is;
} Tdata;
Data data[size];
Tdata td;
void Init() //Create and read data
{<!-- -->
    char name1[50]="data.txt";
    ifstream infile;
    cout<<"The file to be opened is: data.txt"<<endl;
    infile.open(name1,ios::in);
    if(infile. fail())
    {<!-- -->
        cout << "error open!" << endl;
        exit(1);
    }
    for(int i = 0; i < size; i ++ )
        for(int j = 0;j<n;j++)
            infile>>data[i].d[j];
    cout<<"the data are:"<<endl;
    for(int i=0;i<size;i + + )//output the data to be processed
    {<!-- -->
        for(int j = 0;j<n;j++)
            cout<<data[i].d[j]<<" ";
        cout<<endl;
    }
    infile.close();//Close the file
}
void Init_center() //Initialize the center of mass
{<!-- -->
    for(int i = 0;i<k;i ++ )
    {<!-- -->
        cout<<"Initial centroid"<<i + 1<<":"<<endl;
        for(int j = 0;j<n;j++)
        {<!-- -->
            td.center[i].d[j] = data[i].d[j];
            cout<<td.center[i].d[j]<<" ";
        }
        cout<<endl;
    }
}
double Euclid(int x, int y) //Calculate the Euclidean distance from a set of arrays to the centroid
{<!-- -->
    double distance = 0;
    for(int i = 0; i < n; i ++ )
        distance + = pow((data[x].d[i] - td.center[y].d[i]), 2);
    return sqrt(distance);
}
void calculate_distance()// Calculate the Euclidean distance from the data to K centroids
{<!-- -->
    int i, j;
    for(i = 0; i < size; i ++ )
        for(j = 0; j < k; j ++ )
            data[i].distance[j] = Euclid(i, j); //i indicates which array j indicates the distance from the centroid
   
}
void new_cluster() // classify the data into clusters
{<!-- -->
    int i, j;
    double min;
    for(i = 0; i < k; i ++ ) //initialization number
        td.cluster_num[i] = 0;
    for(i = 0; i < size; i ++ )
    {<!-- -->
        int index = 0; //find the smallest Euclidean distance number
        min = data[i].distance[0];
        for(j = 1; j < k; j ++ )
        {<!-- --> // filter to the minimum value of cluster center Euclidean
            if(data[i].distance[j] < min)
            {<!-- -->
                min = data[i].distance[j];
                index = j;
            }
        }
    td.cluster[index][td.cluster_num[index] + + ] = i; //Divide clusters
    }
}
void new_center() //update centroid
{<!-- -->
    int i, j, m;
    double sum;
    for(i = 0; i < k; i ++ )
        for(j = 0; j < n; j ++ )
        {<!-- -->
            sum = 0;
            td.old_center[i].d[j] = td.center[i].d[j];
            for(m = 0; m <td. cluster_num[i]; m ++ )
              sum + = data[td.cluster[i][m]].d[j]; // sum of all data of dimension j of the i-th cluster
            td.center[i].d[j] = sum / td.cluster_num[i]; // Take the average to get the new cluster center
        }
}
void comp( ) // compare the centroid
{<!-- -->
    int i , j,m = 0;
    new_center(); //Update the center of mass
    for(i = 0;i<k;i ++ )
        for(j=0;j<n;j++)
            td.isend[i][n]=td.old_center[i].d[j] != td.center[i].d[j]?0:1;

    for(i = 0;i<k;i ++ )
        for(j=0;j<n;j++)
            td.is=td.isend[i][n]*td.is;

    td.is=td.is==0?1:0; //Determine the termination condition
}
void output_info( ) //output result function
{<!-- -->
    int i, j, m;
    for(i = 0; i < k; i ++ )
    {<!-- -->
        cout<<"centroid"<<i + 1<<":"<<endl;
        for(m = 0; m < n; m ++ )
            cout<<td.center[i].d[m]<<" ";
        cout<<endl;
        cout<<"cluster class"<<i + 1<<":"<<endl;
        for(j = 0; j <td. cluster_num[i]; j ++ )
        {<!-- -->
            for(m = 0; m < n; m ++ )
                cout<<data[td.cluster[i][j]].d[m]<<" ";
            cout<<endl;
        }
    }
}
int main()
{<!-- -->
    int count = 0;
    Init(); //read the file
    Init_center(); //Select the initial center of mass
    td.is = 1;
    while (td.is) //Determine the termination condition
    {<!-- -->
        calculate_distance(); //Select the distance function to perform Euclidean calculation
        new_cluster(); //Cluster the data into clusters
        count + + ;
        cout<<"______________________________________________________"<<endl;
        cout<<"th "<<count<<" clustering: "<<endl;
        output_info(); //output result
        comp(); // compare centroids
    }
    system( "PAUSE ");
    return 0;
}

4. Experimental results:

  1. Experimental data (data.txt)
    The number of clusters is known, set k=3. Data set D consists of three non-overlapping subsets, namely D={D1, D2, D3}. For the convenience of display, this experiment uses a two-dimensional attribute vector. Among them, all attributes have been normalized to the [0, 1] interval, each row represents an attribute vector, the first column represents the horizontal axis coordinates of the attribute space, and the second column represents the vertical axis coordinates of the attribute space.
    D1 =
    0.1000 0.8000
    0.2000 0.7000
    0.1000 0.9000
    0.1500 0.7500
    0.2500 0.8000
    0.1800 0.8100
    0.2200 0.6600
    0.3000 0.7800
    0.1200 0.6300
    0.2700 0.8800
    D2 =
    0.4500 0.1000
    0.5500 0.1500
    0.6500 0.2800
    0.6300 0.3200
    0.5600 0.3600
    0.5300 0.3400
    0.5100 0.2800
    0.4700 0.2300
    0.4300 0.1800
    D3 =
    0.6700 0.7500
    0.7200 0.8200
    0.7800 0.8800
    0.8500 0.8800
    0.8800 0.9200
    0.8400 0.9500
    0.8000 0.9400
    0.7700 0.9100
    0.7300 0.8800
    0.7000 0.7900
    0.6600 0.7700

  2. process result

    (First read in the file and initialize the centroid)

    (for the first clustering calculation)

    (for the second clustering calculation)

    (Perform the third clustering calculation and get the result)
    The program automatically reads the file, obtains the initial centroid, implements the 3-means clustering algorithm through Euclidean calculation, and obtains the target result when the termination condition is met.

3. Experimental conclusion
The k-means clustering algorithm is an iterative algorithm, it can even have no termination condition, and by dividing the data into K categories, each category can be easily distinguished, and then the operation is performed.

The above are the five experimental contents of data mining (the CSDN underline is so shallow)