Clustering Algorithm_Case: Exploring Users’ Preference Segmentation of Item Categories

Directory

  • Target:
  • Import module
  • retrieve data
  • View data
  • Basic data processing
    • Merge tables
    • Crosstab merge
  • Feature Engineering – pca
  • Machine learning (k-means)

Goal:

PCA and K-means should be used to achieve segmentation of user preferences for item categories.

Import module

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Get data

# Get data
order_product = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")

Data are as follows:

  • order_products__prior.csv: order and product information
    • Fields: order_id, product_id, add_to_cart_order, reordered
  • products.csv: product information
    • Fields: product_id, product_name, aisle_id, department_id
  • orders.csv: Customer’s order information
    • Fields: order_id, user_id, eval_set, order_number,….
  • aisles.csv: The specific item category to which the product belongs
    • Fields: aisle_id, aisle

View data

order_product.head()
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0
products.head()
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit.. . 38 1
4 5 Green Chile Anytime Sauce 5 13
orders.head()

td>

order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
aisles.head()
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation

Basic data processing

Merge tables

# Basic data processing
# Merge tables
table1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table1
order_id product_id add_to_cart_order reordered product_name aisle_id department_id
0 2 33120 1 1 Organic Egg Whites 86 16
1 26 33120 5 0 Organic Egg Whites 86 16
2 120 33120 13 0 Organic Egg Whites 86 16
3 327 33120 5 1 Organic Egg Whites 86 16
4 390 33120 28 1 Organic Egg Whites 86 16
32434484 3265099 43492 3 0 Gourmet Burger Seasoning 104 13
32434485 3361945 43492 19 0 Gourmet Burger Seasoning 104 13
32434486 3267201 33097 2 0 Piquillo & amp; Jalapeno Bruschetta 81 15
32434487 3393151 38977 32 0 Original Jerky 100 21
32434488 3400803 23624 7 0 Flatbread Pizza All Natural 79 1

32434489 rows × 7 columns

table2 = pd.merge(table1, orders, on=["order_id", "order_id"])
table = pd.merge(table2, aisles, on=["aisle_id", "aisle_id"])
table.shape

(32434489, 14)

table.head()

Crosstab merge

# Crosstab merge
data = pd.crosstab(table["user_id"], table["aisle"])
data.shape

(206209, 134)

data.head()
order_id product_id add_to_cart_order reordered product_name aisle_id department_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order aisle
0 2 33120 1 1 Organic Egg Whites 86 16 202279 prior 3 5 9 8.0 eggs
1 26 33120 5 0 Organic Egg Whites 86 16 153404 prior 2 0 16 7.0 eggs
2 120 33120 13 0 Organic Egg Whites 86 16 23750 prior 11 6 8 10.0 eggs
3 327 33120 5 1 Organic Egg Whites 86 16 58707 prior 21 6 9 8.0 eggs
4 390 33120 28 1 Organic Egg Whites 86 16 166654 prior 48 0 12 9.0 eggs
aisle air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
user_id
1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
2 0 3 0 0 0 0 2 0 0 0 3 1 1 0 0 0 0 2 0 42
3 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 2 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
5 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3

5 rows × 134 columns

# Data interception
new_data = data[:1000]

Feature Engineering – pca

# Feature Engineering - pca
transfer = PCA(n_components=0.9)
trans_data = transfer.fit_transform(new_data)
trans_data.shape

(1000, 22)

Machine learning (k-means)

estimator = KMeans(n_clusters=5)
y_pre = estimator.fit_predict(trans_data)
y_pre
array([2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2,
       0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 0, 0,
       0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 2, 0, 2, 1, 2, 0, 2, 2, 0, 3, 2, 2, 2, 1, 2, 0, 2, 2, 0, 0,
       0, 0, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 1,
       2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 0, 2, 2,
       2, 2, 3, 4, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2,
       0, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
       2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 3, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 3, 0, 2, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 3, 2, 2, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 3, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 0,
       2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 2, 2, 2, 0, 1, 2, 2, 1, 0, 2, 0, 0, 2, 1, 2, 0, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 2, 0, 2, 2, 2,
       0, 0, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2,
       2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2,
       2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 1, 0, 2, 2, 2, 1, 2, 3, 0, 2, 0, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 2, 1, 2, 3, 2, 2, 2, 0, 2, 2, 2, 2, 1, 1,
       0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 2, 2, 2, 1,
       0, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 0, 1, 2], dtype=int32)
# 5. Model evaluation
silhouette_score(trans_data, y_pre)

0.4793021644455867

sklearn.metrics.silhouette_score(X, labels)

  • Calculate the average silhouette coefficient of all samples
  • X: Eigenvalue
  • labels: label value marked by clustering