Grocery Market Basket Analysis with Apriori Algorithm
Market basket analysis is a data analysis technique commonly used in retail companies to identify purchasing patterns and relationships between products purchased together by customers. It helps businesses in cross-product sales efforts, improve marketing strategies, understand customer preferences, and spot purchasing trends.
In this article, I will show the process of market basket analysis using Python. Below is the process I use:
- Gathering transactional data, including purchase history, shopping carts, or invoices. I found an ideal dataset for this task. You can download the dataset from Here.
- Perform data preprocessing and analysis of sales and product trends.
- Use Apriori algorithms to discover frequent item sets and generate association rules.
- Interpret discovered association rules into usable insights, and design strategies based on analysis results.
Gathering Data
We have grocery market transaction data, which consists of 38765 rows and 3 columns. This dataset does not contain missing values, but there are 759 duplicated values, for that we need to delete them in the preprocessing process, as well as change the ‘Date’ column format to the DateTime data type.
Data Preprocessing
# drop duplicate
data.drop_duplicates(inplace=True)
After we have dropped duplicate data, the next step is to change the Date column data type to the DateTime data type. Then create new columns, namely year, month, and day columns, to analyze sales and product trends.
# Convert 'Date' data type
data['Date'] = pd.to_datetime(data['Date'], format='%d-%m-%Y')
# Add 'year', 'month', and 'day' columns
data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
data['day_name'] = data['Date'].dt.day_name()
Exploratory Data Analysis
10 items with the highest sales
plt.figure(figsize=(12,6))
sns.barplot(x = data['itemDescription'].value_counts()[:10].index,
y = data['itemDescription'].value_counts()[:10].values)
plt.xticks(size=12, rotation=45)
plt.title('10 Best Selling Items')
plt.show()
Number of transactions each month
# Number of transactions each month
plt.figure(figsize=(12,6))
sns.barplot(x = data['month'].value_counts().index,
y = data['month'].value_counts().values)
plt.xticks(size=12)
plt.title('Number of Transactions Each Month')
plt.show()
Number of transactions each day
# Number of transactions each day
plt.figure(figsize=(12,6))
sns.barplot(x = data['day_name'].value_counts().index,
y = data['day_name'].value_counts().values)
plt.xticks(size=12)
plt.title('Number of Transactions Each Day')
plt.show()
From the data from the exploratory data analysis above, we can see that:
- Whole milk is the item with the highest sales, followed by other vegetables, and rolls/buns.
- The number of sales varies greatly each month, the highest sales are in August and the lowest are in February.
- The difference in the number of sales each day is very slight, the highest sales are on Thursday, and the lowest are on Monday.
Data Preparation
Before we carry out modeling processing using an a priori algorithm, we must first prepare the data. At this stage, we group the data and store it in the ‘item_count’ variable. So that the data can provide information about member number, date, item Description, and number of items purchased.
item_count = data.groupby(['Member_number', 'Date', 'itemDescription'])['itemDescription'].count().reset_index(name='Count')
item_count.head(10)
Next, we will create a transaction column that informs that if member_number buys an item on one day, then the transaction is worth 1, and if the next day he buys again then the transaction is worth 2. and if a member_number buys then the transaction is worth 3. and so on. the transaction value will always increase.
item_count['Transaction'] = item_count.groupby(['Member_number', 'Date'])['Date'].ngroup()+1
Next, we will create a pivot on the data so that we can analyze the relationship between the items using an a priori algorithm. If the item is purchased using the transaction number, the value is 1, otherwise, the value is 0.
item_count_pivot = item_count.pivot_table(index='Transaction', columns='itemDescription', values='Count', aggfunc='sum').fillna(0)
The data value is in float form (there is a comma), so we will remove the comma by changing it to the int data type.
# drop comma
item_count_pivot = item_count_pivot.astype('int32')
Apriori Modelling
First, we import the required libraries, namely ‘association_rule’ and ‘apriori’. Next, we create a basket variable to hold the grouped data in list form. After that, the basket_encoded variable holds the dataset that will be analyzed using a priori.
from mlxtend.frequent_patterns import association_rules, apriori
basket = item_count.groupby('Transaction')['itemDescription'].apply(list).reset_index()
basket_encoded = basket['itemDescription'].str.join('|').str.get_dummies('|')
Next, we will look at frequent items.
frequent_itemsets = apriori(basket_encoded, min_support=0.01, use_colnames=True)
frequent_itemsets.sort_values('support', ascending=False).head(10)
Now, let’s use the Apriori algorithm to create association rules. The Apriori algorithm is used to discover frequent item sets in large transactional datasets. It aims to identify items that are frequently purchased together in transactional data. It helps uncover patterns in customer behavior, allowing businesses to make informed decisions about product placement, promotions, and marketing. Here’s how to implement Apriori to generate association rules:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=0.5)
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values('confidence', ascending=False).head(10)
Finally, we use an a priori algorithm to create association rules. A priori algorithms are used to find frequently occurring items to identify frequently purchased items together uncover customer behavior patterns, and help businesses make informed decisions about product placement, promotions, and marketing. The following is an explanation of the columns above:
- Antecedent: an item that is considered the starting point or “if” part of the association rule. For example, yogurt, rolls/buns, and other vegetables are antecedent in this analysis.
- Consequents: These are the items that tend to be purchased along with the antecedents or the “then” part of the association rule.
- Support: Support measures how frequently a particular combination of items (both antecedents and consequents) appears in the dataset. It is essentially the proportion of transactions in which the items are bought together. For example, the first rule indicates that yogurt and whole milk are bought together in approximately 1.11% of all transactions.
- Confidence: Confidence quantifies the likelihood of the consequent item being purchased when the antecedent item is already in the basket. In other words, it shows the probability of buying the consequent item when the antecedent item is bought. For example, the first rule tells us that there is a 12.99% chance of buying whole milk when yogurt is already in the basket.
- Lift: Lift measures the degree to which two items together are more likely to be purchased than if they were purchased independently. The lift value is 0.822940, which is less than 1, indicating that purchases of “yogurt” and “whole milk” are less likely to occur together than if they were purchased separately. This may indicate that there is a negative or insignificant relationship between these two items.