Practical data analysis | Association rules analysis – shopping cart analysis

Table of Contents

1. Data and analysis objects

2. Purpose and analysis tasks

3. Methods and Tools

4. Data reading

5. Data understanding

6. Data preprocessing

7. Generate frequent itemsets

8. Calculate correlation

9. Visualization


1. Data and analysis objects

Dataset link: Online Retail.xlsx

This data set records 541,909 online communication records from December 01, 2010 to December 09, 2011, and contains the following 8 attributes:

(1) InvoiceNo: order number, represented by a 6-digit integer, and the return order number starts with the letter “C”;

(2) StockCode: product number, each different product is represented by a non-repeating 5-digit integer;

(3) Description: product description;

(4) Quantity: Product quantity, the quantity of each product in each transaction;

(5) InvoiceDate: Order date and time, indicating the date and time when each transaction is generated;

(6) UnitPrice: unit price, the pound price of each product;

(7) CustomerID: customer number, each customer is represented by a unique 5-digit integer;

(8) Country: Country name, the name of the country/region where each customer is located.

2. Purpose and Analysis Tasks

Understand the specific application of the Apriori algorithm

(1) Calculate the frequent itemsets of products purchased by German customers with a minimum support of 0.07.

(2) Calculate the correlation between product purchases by German customers with a minimum confidence level of 0.8 and an improvement degree of not less than 2.

3. Methods and Tools

Third-party Python toolkits that can implement the Aprior algorithm include mlxtend, kiwi-apriori, apyori, apriori_python, efficient-apriori, etc. The more commonly used ones are mlxtend, apriori_python, and efficient-apriori. This project uses the mlxtend package.

4. Data reading

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
df_Retails=pd.read_excel("C:\Users\LEGION\AppData\Local\Temp\360zip$Temp\360$0\Online Retail.xlsx")
df_Retails.head()

5. Data Understanding

Call the shape attribute to view the shape of the data frame df_Retails.

df_Retails.shape

View column names

df_Retails.columns
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

Perform exploratory analysis on the data frame df_Retails.

df_Retails.describe()

Among them, count, mean, std, min, 25%, 50%, 75% and max mean number, mean, standard deviation, minimum value, upper quartile, median and lower quartile respectively. and maximum value.

In addition to the describe() method, you can also call the info() method to view an overview of the relevant information of the sample data:

df_Retails.info()

As can be seen from the output results, there are missing values in the Description and CustomerID columns of the data frame df_Retails.

Look at the country column:

df_Retails.Country.unique()
array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)
df_Retails["Country"].unique()
array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

Check out the number of purchases by country:

df_Retails["Country"].value_counts()
United Kingdom 495478
Germany 9495
France 8557
EIRE 8196
Spain 2533
Netherlands 2371
Belgium 2069
Switzerland 2002
Portugal 1519
Australia 1259
Norway 1086
Italy 803
Channel Islands 758
Finland 695
Cyprus 622
Sweden 462
Unspecified 446
Austria 401
Denmark 389
Japan 358
Poland 341
Israel 297
USA 291
Hong Kong 288
Singapore 229
Iceland 182
Canada 151
Greece 146
Malta 127
United Arab Emirates 68
European Community 61
RSA 58
Lebanon 45
Lithuania 35
Brazil 32
Czech Republic 30
Bahrain 19
Saudi Arabia 10
Name: Country, dtype: int64

It can be seen that customers in the United Kingdom purchased the largest number of goods, with 495,478 records, followed by customers in Germany, with 9,495 records.

Check whether there are duplicate values in the order number (InvoiceNo) column.

df_Retails.duplicated(subset=["InvoiceNo"]).any()
True

Repeated order numbers indicate that there are multiple products purchased at the same time in the same order, which meets the data requirements of the Apriori algorithm.

6. Data preprocessing

Check if there are any missing values in the data.

df_Retails.isna().sum()
InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0
dtype: int64

It can be seen that there are 1454 missing values for Description and 135080 missing values for CustomerID.

Delete the blank characters at the beginning and end of the string in the product name (Description) column:

df_Retails['Description']=df_Retails['Description'].str.strip()

Look at the dataset shape again;

df_Retails.shape
(541909, 8)

Check the number of missing values in the product name (Description) column:

df_Retails['Description'].isna().sum()
1455

After processing blank characters in the product name (Description) column, one missing value was added. Remove all missing values:

df_Retails.dropna(axis=0,subset=['Description'],inplace=True)

Look at the dataset shape again:

df_Retails.shape
(540454, 8)

Check if there are any missing values in the data set at this time:

df_Retails['Description'].isna().sum()
0

It can be seen that all missing values in the product name (Description) column in the data frame df_Retails have been deleted.

Since the returned orders begin with the letter “C”, delete the canceled orders containing the letter C:

df_Retails['InvoiceNo']=df_Retails['InvoiceNo'].astype('str')
df_Retails=df_Retails[~df_Retails['InvoiceNo'].str.contains('C')]
df_Retails.shape
(531166, 8)

Change the data to one record per row, and take into account the memory limit and the fact that Germany ranks second in terms of shopping quantity. Therefore, in this project, only the frequent item sets and association rules of the goods purchased by German customers are calculated. All calculations are The amount of calculation is too much.

df_ShoppingCarts=(df_Retails[df_Retails['Country']=="Germany"].groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack()
                 .reset_index().fillna(0).set_index('InvoiceNo'))
df_ShoppingCarts.shape
(457, 1695)
df_ShoppingCarts.head()

There are 457 shopping records in Germany, containing a total of 1,695 different products.

Check whether there are duplicate values in the order number (InvoiceNo) column:

df_Retails.duplicated(subset=["InvoiceNo"]).any()
True

Repeated order numbers indicate that there are multiple products purchased at the same time in the same order, which meets the data requirements of the Apriori algorithm. Since the allowed value of the df parameter in the apriori method is 0/1 or True/False, these items are converted to 0/1 form in the data frame, that is, converted into data in a format acceptable to the model, and frequent itemsets can be summed Calculation of relevance.

def encode_units(x):
    if x<=0:
        return 0
    if x>=1:
        return 1

df_ShoppingCarts_sets=df_ShoppingCarts.applymap(encode_units)

Seven. Generate frequent item sets

The apriori() method of mlxtend.frequnet_patterns can calculate frequent itemsets and set the minimum support to 0.07:

df_Frequent_Itemsets=apriori(df_ShoppingCarts_sets,min_support=0.07,use_colnames=True)
df_Frequent_Itemsets

View the shape of the data frame df_Frequent_Itemsets:

df_Frequent_Itemsets.shape
(39, 2)

It can be seen that there are 39 frequent itemsets that meet the minimum support of 0.07.

8. Calculate relevance

Use lift as a metric to calculate association rules, and set the threshold to 1, which means calculating association rules with a positive correlation. This task is implemented by the association_rules() method of mlxtend.frequent_patterns:

df_AssociationRules=association_rules(df_Frequent_Itemsets,metric="lift",min_threshold=1)
df_AssociationRules

The detailed information of each association rule can be seen from the results.

Taking the first association rule as {6 RIBBONS RUSTIC CHARM}->{POSTAGE} as an example, the support of {6 RIBBONS RUSTIC CHARM} is 0.102845, the support of {POSTAGE} is 0.818381, and the itemset {