Practical Data Analysis | K-means Algorithm – Analysis of Protein Consumption Characteristics

Table of Contents

1. Data and analysis objects

2. Purpose and analysis tasks

3. Methods and Tools

4. Data reading

5. Data understanding

6. Data preparation

7. Model training

?Edit 8. Model evaluation

9. Model tuning participates in prediction


1. Data and analysis objects

txt file – “protein.txt”, mainly records 9 attributes of 25 countries. The main attributes are as follows:

(1) ID: The ID of the country.

(2) Country (country category): This data involves the relationship between meat and other food products in 25 European countries.

(3) 9 data on meat and other foods include RedMeat (red meat), WhiteMeat (white meat), Eggs (eggs), Milk (milk), Fish (fish), Cereals (cereals), Starch (starch) Categories), Nuts (nuts), Fr & Veg (fruits and vegetables).

2. Purpose and Analysis Tasks

Understand the application of machine learning methods in data analysis-using the k-means method for cluster analysis.

(1) After importing the data set, k clusters are randomly selected for clustering during the initialization stage and the initial clustering center is determined.

(2) Based on the initialized classification model, re-determine the cluster center by calculating the center point of each cluster.

(3) Iteratively repeat the process of “calculating distance – determining cluster center – clustering”.

(4) Verify the correctness and rationality of k-means model clustering by testing specific indicators.

3. Methods and Tools

Python toolkits such as scikit-learn, pandas and matplotlib.

4. Data reading

import pandas as pd
protein=pd.read_table("D:\Download\JDK\Data Analysis Theory and Practice by Chaolemen_Machinery Industry Press\Chapter 5 Cluster Analysis\ \protein.txt",
                     sep='\t')
protein.head()

5. Data Understanding

Conduct exploratory analysis on the data frame protein. The implementation method used here is to call the describe() method of the data frame (DataFrame) in the pandas package.

protein.describe()

In addition to the describe() method, you can also call the shape attribute and the pandas_profiling package to perform exploratory analysis on the data frame.

protein.shape
(25, 10)

6. Data preparation

When analyzing protein consumption results in different countries, it is necessary to extract data useful for information analysis, that is, 9 columns about meat and other foods. The specific implementation method is to call the drop() method of the data frame in the pandas package to delete the data with the column name “Country”.

sprotein=protein.drop(['Country'],axis=1)
sprotein.head()

After extracting the data to be clustered, the data set needs to be standardized around the mean. The Z-score standardization method in statistics is used here.

from sklearn import preprocessing
sprotein_scaled=preprocessing.scale(sprotein)
sprotein_scaled
array([[ 0.08294065, -1.79475017, -2.22458425, -1.1795703 , -1.22503282,
         0.9348045, -2.29596509, 1.24796771, -1.37825141],
       [-0.28297397, 1.68644628, 1.24562107, 0.40046785, -0.6551106,
        -0.39505069, -0.42221774, -0.91079027, 0.09278868],
       [1.11969872, 0.38790475, 1.06297868, 0.05573225, 0.06479116,
        -0.5252463, 0.88940541, -0.49959828, -0.07694671],
       [-0.6183957, -0.52383718, -1.22005113, -1.2657542, -0.92507375,
         2.27395937, -1.98367386, 0.32278572, 0.03621022],
       [-0.03903089, 0.96810416, -0.12419682, -0.6624669, -0.6851065,
         0.19082957, 0.45219769, -1.01358827, -0.07694671],
       [0.23540507, 0.8023329, 0.69769391, 1.13303099, 1.68457011,
        -0.96233157, 0.3272812 , -1.21918427, -0.98220215],
       [-0.43543839, 1.02336124, 0.69769391, -0.86356267, 0.33475432,
        -0.71124003, 1.38907137, -1.16778527, -0.30326057],
       [-0.10001666, -0.82775116, -0.21551801, 2.38269753, 0.45473794,
        -0.55314536, 0.51465594, -1.06498727, -1.5479868 ],
       [2.49187852, 0.55367601, 0.33240914, 0.34301192, 0.42474204,
        -0.385751, 0.3272812, -0.34540128, 1.33751491],
       [0.11343353, -1.35269348, -0.12419682, 0.07009624, 0.48473385,
         0.87900638, -1.29663317, 2.4301447, 1.33751491],
       [-1.38071781, 1.24438959, -0.03287563, -1.06465843, -1.19503691,
         0.73021139, -0.17238476, 1.19656871, 0.03621022],
       [1.24167025, 0.58130455, 1.61090584, 1.24794286, -0.62511469,
        -0.76703815, 1.20169663, -0.75659327, -0.69930983],
       [-0.25248108, -0.77249407, -0.03287563, -0.49009911, -0.26516381,
         0.42332173, -1.35909141, 0.63117972, 1.45067184],
       [-0.10001666, 1.57593211, 0.60637272, 0.90320726, -0.53512697,
        -0.91583314, -0.04746827, -0.65379528, -0.24668211],
       [-0.13050955, -0.88300824, -0.21551801, 0.88884328, 1.62457829,
        -0.86003502, 0.20236471, -0.75659327, -0.81246676],
       [-0.89283166, 0.63656164, -0.21551801, 0.31428395, -0.38514744,
         0.35822393, 1.0143219, -0.55099728, 1.39409338],
       [-1.10628185, -1.15929368, -1.67665709, -1.75412962, 2.97439408,
        -0.48804755, 1.0143219, 0.83677571, 2.12961342],
       [-1.10628185, -0.44095155, -1.31137232, -0.86356267, -0.98506557,
         1.61368162, -0.73450896, 1.14516971, -0.75588829],
       [-0.83184589, -1.24217931, 0.14976676, -1.22266225, 0.81468882,
        -0.28345445, 0.88940541, 1.45356371, 1.73356417],
       [0.02195488, -0.0265234, 0.51505153, 1.08993904, 0.96466835,
        -1.18552405, -0.35975949, -0.85939127, -1.20851601],
       [0.99772718, 0.60893309, 0.14976676, 0.96066319, -0.59511878,
        -0.61824316, -0.9218837 , -0.34540128, 0.43225947],
       [2.30892121, -0.60672281, 1.61090584, 0.50101573, 0.00479935,
        -0.73913909, 0.26482296, 0.16858872, -0.47299597],
       [-0.16100243, -0.91063679, -0.76344517, -0.07354359, -0.38514744,
         1.05570042, 1.32661312, 0.16858872, -0.69930983],
       [0.47934814, 1.27201813, 1.06297868, 0.24246404, -0.26516381,
        -1.26922123, 0.57711418, -0.80799227, -0.19010364],
       [-1.65515377, -0.80012261, -1.5853359, -1.0933864, -1.10504919,
         2.19956187, -0.79696721, 1.35076571, -0.52957443]])

Seven. Model Training

Before using the k-means algorithm to cluster its data set, we generate a random k value as the number of clusters in the initialization stage. In the scikit-learn framework, the “coefficient of determination” is used as the score for performance evaluation. This parameter can judge the ability of the statistical model to fit the data in different classification situations. The implementation method used here is to call the sklearn.cluster module k-means.fit().score() method.

#K is worth choosing
from sklearn.cluster import KMeans
NumberOfClusters=range(1,20)
kmeans=[KMeans(n_clusters=i) for i in NumberOfClusters]
score=[kmeans[i].fit(sprotein_scaled).score(sprotein_scaled) for i in range(len(kmeans))]
score
[-225.00000000000003,
 -139.5073704483181,
 -110.40242709032155,
 -93.99077697163418,
 -77.34315775475405,
 -64.22779496320605,
 -52.68794493054983,
 -46.148487504020046,
 -41.95277071693863,
 -35.72656669867341,
 -30.429164116494334,
 -26.430420929243024,
 -22.402609200395787,
 -19.80060035995764,
 -16.86466531666006,
 -13.979313757304398,
 -11.450822978023083,
 -8.61897844282252,
 -6.704106008601115]

The above output result is the predicted value of each kmeans(n_clusters=i)(1<=i<=19). In order to observe the change of each value more intuitively, a ROC curve can be drawn. The specific implementation is to call the pyplot() method of the matplotlib package.

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(NumberOfClusters,score)
plt.xlabel('Number of Clusters')
plt.ylabel('score')
plt.title('Elbow Curve')
plt.show()

Next, randomly set the number of clusters to 5, and perform mean clustering on the data matrix based on this, and view the model prediction results. The specific implementation method here is to call the KMeans() method and predict of the scikit-learn package. () method, the main parameters that need to be set in the KMeans() method are as follows:

(1) The algorithm uses the default value “auto”, which means using the elkan or full algorithm in k-means, which is determined by the density of the sample data.

(2) n_clusters represents the number of classification clusters.

(3) n_init represents the number of initialization attempts to run the algorithm.

(4) max_iter represents the maximum number of iterations.

(5) verbose represents log information. The default “0” value is used here and no log information is output.

myKMeans=KMeans(algorithm="auto",
               n_clusters=5,
               n_init=10,
               max_iter=200,
               verbose=0)
myKMeans.fit(sprotein_scaled)
y_kmeans=myKMeans.predict(sprotein)
print(y_kmeans)
[2 4 4 2 4 4 4 3 4 2 2 4 2 4 4 4 0 2 2 4 4 4 2 4 2]

Through the above analysis, the data set protein is divided into 5 clusters based on the determined k=5, and the cluster numbers are 0, 1, 2, 3, and 4. Next, the cluster number to which each sample belongs is displayed.

protein["the cluster it belongs to"]=y_kmeans
protein

8. Model Evaluation

It can be seen that the k-means algorithm can complete the corresponding clustering output. Next, the silhouette coefficient is introduced to evaluate the Xuanfa clustering results. The implementation method adopted here is to call the kcluster() method of the Cluster module of the Bio package, and call the silhouette_score() method to return the silhouette coefficients of all samples. The value range is [-1,1]. The larger the silhouette coefficient value, the better.

from sklearn.metrics import silhouette_score
silhouette_score(sprotein,y_kmeans)
0.2222236683250513
number=range(2,20)
myKMeans_list=[KMeans(algorithm="auto",
               n_clusters=i,
               n_init=10,
               max_iter=200,
               verbose=0) for i in number]
y_kmeans_list=[myKMeans_list[i].fit(sprotein_scaled).
               predict(sprotein_scaled) for i in range(len(number))]
score=[silhouette_score(sprotein,y_kmeans_list[i]) for i in range(len(number))]
score
[0.4049340501486218,
 0.31777138102456476,
 0.16996270462188423,
 0.21041645106099247,
 0.1943500298289292,
 0.16862742616667453,
 0.1868090290661263,
 0.08996856437394235,
 0.10531808817576255,
 0.13528249120860153,
 0.07381598489593617,
 0.09675173868153258,
 0.056460835203354785,
 0.10871862224667578,
 0.04670651599769748,
 0.03724019668260051,
 0.0074356180520073045,
 0.013165944671952217]
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.plot(number,score)
plt.xlabel("k value")
plt.ylabel("Contour coefficient")

9. Model adjustment participates in prediction

Through the analysis of the silhouette coefficient, we can determine that the number of cluster centers is 2, and based on this, perform clustering on the sample data set protein.

estimator=KMeans(algorithm="auto",
               n_clusters=2,
               n_init=10,
               max_iter=200,
               verbose=0)
estimator.fit(sprotein_scaled)
y_pred=estimator.predict(sprotein_scaled)
print(y_pred)
[1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1]

Draw a cluster plot:

x1=[]
y1=[]
x2=[]
y2=[]
for i in range(len(y_pred)):
    if y_pred[i]==0:
        x1.append(sprotein['RedMeat'][i])
        y1.append(sprotein['WhiteMeat'][i])
    if y_pred[i]==1:
        x2.append(sprotein['RedMeat'][i])
        y2.append(sprotein['WhiteMeat'][i])
plt.scatter(x1,y1,c="red")
plt.scatter(x2,y2,c="orange")
plt.show()