Clustering Algorithm_Case: Exploring Users’ Preference Segmentation of Item Categories

Goal:

PCA and K-means should be used to achieve segmentation of user preferences for item categories.

Import module

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Get data

# Get data
order_product = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")

Data are as follows:

order_products__prior.csv: order and product information
- Fields: order_id, product_id, add_to_cart_order, reordered
products.csv: product information
- Fields: product_id, product_name, aisle_id, department_id
orders.csv: Customer’s order information
- Fields: order_id, user_id, eval_set, order_number,….
aisles.csv: The specific item category to which the product belongs
- Fields: aisle_id, aisle

View data

order_product.head()

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

products.head()

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit.. .	38	1
4	5	Green Chile Anytime Sauce	5	13

orders.head()

td>

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

aisles.head()

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

Basic data processing

Merge tables

# Basic data processing
# Merge tables
table1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table1

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id
0	2	33120	1	1	Organic Egg Whites	86	16
1	26	33120	5	0	Organic Egg Whites	86	16
2	120	33120	13	0	Organic Egg Whites	86	16
3	327	33120	5	1	Organic Egg Whites	86	16
4	390	33120	28	1	Organic Egg Whites	86	16
…	…	…	…	…	…	…	…
32434484	3265099	43492	3	0	Gourmet Burger Seasoning	104	13
32434485	3361945	43492	19	0	Gourmet Burger Seasoning	104	13
32434486	3267201	33097	2	0	Piquillo & amp; Jalapeno Bruschetta	81	15
32434487	3393151	38977	32	0	Original Jerky	100	21
32434488	3400803	23624	7	0	Flatbread Pizza All Natural	79	1

32434489 rows × 7 columns

table2 = pd.merge(table1, orders, on=["order_id", "order_id"])

table = pd.merge(table2, aisles, on=["aisle_id", "aisle_id"])

table.shape

(32434489, 14)

table.head()

Crosstab merge

# Crosstab merge
data = pd.crosstab(table["user_id"], table["aisle"])

data.shape

(206209, 134)

data.head()

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	aisle
0	2	33120	1	1	Organic Egg Whites	86	16	202279	prior	3	5	9	8.0	eggs
1	26	33120	5	0	Organic Egg Whites	86	16	153404	prior	2	0	16	7.0	eggs
2	120	33120	13	0	Organic Egg Whites	86	16	23750	prior	11	6	8	10.0	eggs
3	327	33120	5	1	Organic Egg Whites	86	16	58707	prior	21	6	9	8.0	eggs
4	390	33120	28	1	Organic Egg Whites	86	16	166654	prior	48	0	12	9.0	eggs

aisle	air fresheners candles	asian foods	baby accessories	baby bath body care	baby food formula	bakery desserts	baking ingredients	baking supplies decor	beauty	beers coolers	…	spreads	tea	tofu meat alternatives	tortillas flat bread	trail mix snack mix	trash bags liners	vitamins supplements	water seltzer sparkling water	white wines	yogurt
user_id
1	0	0	0	0	0	0	0	0	0	0	…	1	0	0	0	0	0	0	0	0	1
2	0	3	0	0	0	0	2	0	0	0	…	3	1	1	0	0	0	0	2	0	42
3	0	0	0	0	0	0	0	0	0	0	…	4	1	0	0	0	0	0	2	0	0
4	0	0	0	0	0	0	0	0	0	0	…	0	0	0	1	0	0	0	1	0	0
5	0	2	0	0	0	0	0	0	0	0	…	0	0	0	0	0	0	0	0	0	3

5 rows × 134 columns

# Data interception
new_data = data[:1000]

Feature Engineering – pca

# Feature Engineering - pca
transfer = PCA(n_components=0.9)
trans_data = transfer.fit_transform(new_data)

trans_data.shape

(1000, 22)

Machine learning (k-means)

estimator = KMeans(n_clusters=5)
y_pre = estimator.fit_predict(trans_data)

y_pre

array([2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2,
       0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 0, 0,
       0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 2, 0, 2, 1, 2, 0, 2, 2, 0, 3, 2, 2, 2, 1, 2, 0, 2, 2, 0, 0,
       0, 0, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 1,
       2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 0, 2, 2,
       2, 2, 3, 4, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2,
       0, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
       2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 3, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 3, 0, 2, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 3, 2, 2, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 3, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 0,
       2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 2, 2, 2, 0, 1, 2, 2, 1, 0, 2, 0, 0, 2, 1, 2, 0, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 2, 0, 2, 2, 2,
       0, 0, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2,
       2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2,
       2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 1, 0, 2, 2, 2, 1, 2, 3, 0, 2, 0, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 2, 1, 2, 3, 2, 2, 2, 0, 2, 2, 2, 2, 1, 1,
       0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 2, 2, 2, 1,
       0, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 0, 1, 2], dtype=int32)

# 5. Model evaluation
silhouette_score(trans_data, y_pre)

0.4793021644455867

sklearn.metrics.silhouette_score(X, labels)

Calculate the average silhouette coefficient of all samples
X: Eigenvalue
labels: label value marked by clustering

Directory