Table of Contents
1. Data and analysis objects
2. Purpose and analysis tasks
3. Methods and Tools
4. Data reading
5. Data understanding
6. Data preprocessing
7. Generate frequent itemsets
8. Calculate correlation
9. Visualization
1. Data and analysis objects
Dataset link: Online Retail.xlsx
This data set records 541,909 online communication records from December 01, 2010 to December 09, 2011, and contains the following 8 attributes:
(1) InvoiceNo: order number, represented by a 6-digit integer, and the return order number starts with the letter “C”;
(2) StockCode: product number, each different product is represented by a non-repeating 5-digit integer;
(3) Description: product description;
(4) Quantity: Product quantity, the quantity of each product in each transaction;
(5) InvoiceDate: Order date and time, indicating the date and time when each transaction is generated;
(6) UnitPrice: unit price, the pound price of each product;
(7) CustomerID: customer number, each customer is represented by a unique 5-digit integer;
(8) Country: Country name, the name of the country/region where each customer is located.
2. Purpose and Analysis Tasks
Understand the specific application of the Apriori algorithm
(1) Calculate the frequent itemsets of products purchased by German customers with a minimum support of 0.07.
(2) Calculate the correlation between product purchases by German customers with a minimum confidence level of 0.8 and an improvement degree of not less than 2.
3. Methods and Tools
Third-party Python toolkits that can implement the Aprior algorithm include mlxtend, kiwi-apriori, apyori, apriori_python, efficient-apriori, etc. The more commonly used ones are mlxtend, apriori_python, and efficient-apriori. This project uses the mlxtend package.
4. Data reading
import pandas as pd from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules
df_Retails=pd.read_excel("C:\Users\LEGION\AppData\Local\Temp\360zip$Temp\360$0\Online Retail.xlsx") df_Retails.head()
5. Data Understanding
Call the shape attribute to view the shape of the data frame df_Retails.
df_Retails.shape
View column names
df_Retails.columns
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country'], dtype='object')
Perform exploratory analysis on the data frame df_Retails.
df_Retails.describe()
Among them, count, mean, std, min, 25%, 50%, 75% and max mean number, mean, standard deviation, minimum value, upper quartile, median and lower quartile respectively. and maximum value.
In addition to the describe() method, you can also call the info() method to view an overview of the relevant information of the sample data:
df_Retails.info()
As can be seen from the output results, there are missing values in the Description and CustomerID columns of the data frame df_Retails.
Look at the country column:
df_Retails.Country.unique()
array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany', 'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal', 'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland', 'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria', 'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore', 'Lebanon', 'United Arab Emirates', 'Saudi Arabia', 'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta', 'RSA'], dtype=object)
df_Retails["Country"].unique()
array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany', 'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal', 'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland', 'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria', 'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore', 'Lebanon', 'United Arab Emirates', 'Saudi Arabia', 'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta', 'RSA'], dtype=object)
Check out the number of purchases by country:
df_Retails["Country"].value_counts()
United Kingdom 495478 Germany 9495 France 8557 EIRE 8196 Spain 2533 Netherlands 2371 Belgium 2069 Switzerland 2002 Portugal 1519 Australia 1259 Norway 1086 Italy 803 Channel Islands 758 Finland 695 Cyprus 622 Sweden 462 Unspecified 446 Austria 401 Denmark 389 Japan 358 Poland 341 Israel 297 USA 291 Hong Kong 288 Singapore 229 Iceland 182 Canada 151 Greece 146 Malta 127 United Arab Emirates 68 European Community 61 RSA 58 Lebanon 45 Lithuania 35 Brazil 32 Czech Republic 30 Bahrain 19 Saudi Arabia 10 Name: Country, dtype: int64
It can be seen that customers in the United Kingdom purchased the largest number of goods, with 495,478 records, followed by customers in Germany, with 9,495 records.
Check whether there are duplicate values in the order number (InvoiceNo) column.
df_Retails.duplicated(subset=["InvoiceNo"]).any()
True
Repeated order numbers indicate that there are multiple products purchased at the same time in the same order, which meets the data requirements of the Apriori algorithm.
6. Data preprocessing
Check if there are any missing values in the data.
df_Retails.isna().sum()
InvoiceNo 0 StockCode 0 Description 1454 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 135080 Country 0 dtype: int64
It can be seen that there are 1454 missing values for Description and 135080 missing values for CustomerID.
Delete the blank characters at the beginning and end of the string in the product name (Description) column:
df_Retails['Description']=df_Retails['Description'].str.strip()
Look at the dataset shape again;
df_Retails.shape
(541909, 8)
Check the number of missing values in the product name (Description) column:
df_Retails['Description'].isna().sum()
1455
After processing blank characters in the product name (Description) column, one missing value was added. Remove all missing values:
df_Retails.dropna(axis=0,subset=['Description'],inplace=True)
Look at the dataset shape again:
df_Retails.shape
(540454, 8)
Check if there are any missing values in the data set at this time:
df_Retails['Description'].isna().sum()
0
It can be seen that all missing values in the product name (Description) column in the data frame df_Retails have been deleted.
Since the returned orders begin with the letter “C”, delete the canceled orders containing the letter C:
df_Retails['InvoiceNo']=df_Retails['InvoiceNo'].astype('str') df_Retails=df_Retails[~df_Retails['InvoiceNo'].str.contains('C')]
df_Retails.shape
(531166, 8)
Change the data to one record per row, and take into account the memory limit and the fact that Germany ranks second in terms of shopping quantity. Therefore, in this project, only the frequent item sets and association rules of the goods purchased by German customers are calculated. All calculations are The amount of calculation is too much.
df_ShoppingCarts=(df_Retails[df_Retails['Country']=="Germany"].groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack() .reset_index().fillna(0).set_index('InvoiceNo')) df_ShoppingCarts.shape
(457, 1695)
df_ShoppingCarts.head()
There are 457 shopping records in Germany, containing a total of 1,695 different products.
Check whether there are duplicate values in the order number (InvoiceNo) column:
df_Retails.duplicated(subset=["InvoiceNo"]).any()
True
Repeated order numbers indicate that there are multiple products purchased at the same time in the same order, which meets the data requirements of the Apriori algorithm. Since the allowed value of the df parameter in the apriori method is 0/1 or True/False, these items are converted to 0/1 form in the data frame, that is, converted into data in a format acceptable to the model, and frequent itemsets can be summed Calculation of relevance.
def encode_units(x): if x<=0: return 0 if x>=1: return 1 df_ShoppingCarts_sets=df_ShoppingCarts.applymap(encode_units)
Seven. Generate frequent item sets
The apriori() method of mlxtend.frequnet_patterns can calculate frequent itemsets and set the minimum support to 0.07:
df_Frequent_Itemsets=apriori(df_ShoppingCarts_sets,min_support=0.07,use_colnames=True) df_Frequent_Itemsets
View the shape of the data frame df_Frequent_Itemsets:
df_Frequent_Itemsets.shape
(39, 2)
It can be seen that there are 39 frequent itemsets that meet the minimum support of 0.07.
8. Calculate relevance
Use lift as a metric to calculate association rules, and set the threshold to 1, which means calculating association rules with a positive correlation. This task is implemented by the association_rules() method of mlxtend.frequent_patterns:
df_AssociationRules=association_rules(df_Frequent_Itemsets,metric="lift",min_threshold=1) df_AssociationRules
The detailed information of each association rule can be seen from the results.
Taking the first association rule as {6 RIBBONS RUSTIC CHARM}->{POSTAGE} as an example, the support of {6 RIBBONS RUSTIC CHARM} is 0.102845, the support of {POSTAGE} is 0.818381, and the itemset {