Recommendation algorithm based on Jaccard similarity—example

Directory

  • Data Display
  • Classification of recommendation algorithms
    • based on similarity
    • Based on popularity/context/social network
  • Jaccard similarity
    • Analyze data characteristics
    • Methods to consider
    • Advantages and Disadvantages of Calculation Methods
    • Calculate Jaccard similarity between users
    • Get the 10 users most similar to a given
    • 10 books recommended to 1713353 users

Data display

import pandas as pd
import numpy as np

# Read CSV file
data = pd.read_csv('E:/recommended_s/Books.csv',header=None,names=['userid','bookid','rating' ,'timestamp'])[:10000]
print(data.head(10))

Classification of recommendation algorithms

Based on similarity

  • Jaccard similarity: measures the similarity by calculating the ratio of the number of intersection elements and the number of union elements of two user preference sets.
  • Cosine similarity: Represent the user’s preferences as vectors, and measure the similarity by calculating the cosine value of the two vectors. Cosine similarity takes into account the direction and length of vectors and is suitable for handling differences in preferences and weights.
  • Pearson correlation coefficient: measures similarity by calculating the ratio of covariance and standard deviation between two user preference vectors. The Pearson correlation coefficient measures linear correlation and is suitable when dealing with user ratings.
  • Euclidean distance: Represent user preferences as vectors, and measure similarity by calculating the Euclidean distance between two vectors. Euclidean distance measures the distance between vectors. The smaller the value, the more similar the vectors are.
  • Manhattan distance: Represent user preferences as vectors, and measure similarity by calculating the Manhattan distance between two vectors. Manhattan distance measures the distance between vectors, with smaller values indicating more similarity.

Based on popularity/context/social networks

  • Popularity-based recommendations: recommend popular and popular items to users. This method assumes that users may be interested in popular items and is suitable for new users or situations where personalized information is lacking.
  • Context-based recommendation: Consider the user’s contextual information, such as time, location, device, etc., to make recommendations. Recommend appropriate items based on the user’s current context. For example, recommend breakfast recipes in the morning and movies in the evening.
  • Recommendation based on social networks: Make recommendations using users’ relationships and interaction information in social networks. For example, recommendations can be made based on the preferences of the user’s friends, or using the community structure in social networks to make recommendations.

Jaccard similarity

  • Jaccard similarity is a measure used to compare the similarity of two sets. Itmeasures the similarityof two sets by calculating the ratio of the number of intersection elements to the number of union elements.
  • Specifically, given two sets A and B, the Jaccard similarity can be calculated by the following formula: J(A, B) = |A ∩ B| / |A ∪ B|, where |A ∩ B| represents the set A The number of intersection elements with B, |A ∪ B| represents the number of union elements of sets A and B.
  • The value range of Jaccard similarity is between 0 and 1. The closer the value is to 1, the more similar the two sets are. The closer the value is to 0, the less similar the two sets are.

Characteristics of analytical data

  • Low dimension. For products, there is only the book ID, and there is no other content about the book, such as content, attributes, etc. [Of course, the book ID can also be disassembled, but due to time reasons, it is not detailed]
  • There are one-to-one and one-to-one 100 situations. The number of books that each user has read is different and there is a large difference.

Methods that can be considered

  • User-based collaborative filtering recommendation algorithm: Since the data contains user rating records for different items, recommendations can be made based on the similarity of rating behaviors between users. By calculating the similarity between users, we find other users with similar interests to the target user, and then recommend items that the target user may be interested in based on the rating records of these users.
  • Content-based recommendation algorithm: The item ID in the data can be used to represent the content characteristics of the item, such as keywords, tags, etc. Recommendations can be made based on content similarities between items. By calculating the similarity between items, other items similar to the target item are found and then recommended to the user.
  • Time-based recommendation algorithm: The timestamp in the data can be used to represent the time when the user rated the item. The most recent or most popular items can be recommended to users in order of time. For example, you can recommend items that have been popular to the user in the recent period, or predict items that the user may be interested in in the future based on the user’s historical rating records.

Advantages and disadvantages of calculation methods

  • Simple and intuitive: The calculation method of Jaccard similarity is simple and clear, easy to understand and implement. [Main reason for choosing this algorithm]
  • Not affected by data scale: Jaccard similarity only focuses on the number of common elements in a set, without considering the size of the set, so it is not affected by data scale.
  • Suitable for sparse data: For sparse user-item matrices, Jaccard similarity can effectively measure the preference similarity between users.
  • However, Jaccard similarity only considers the common elements of the set, but does not consider the weight difference between elements.

Calculate Jaccard similarity between users

# Calculate Jaccard similarity between users
def Jaccard_similarity(user_id, data):
    i=0
    similarities = []
    for other_user_book in other_users_books:
        intersection = user_books.intersection(other_user_book)
        # print(intersection)
        union = user_books.union(other_user_book)
        # print(union)
        Jaccard = float(len(intersection)) / (len(union) + 1e-8)
        user_id = other_users_ids[i]
        i=i+1
        similarities.append((user_id, Jaccard))

    return similarities
similarities = Jaccard_similarity(user_id, data)
print(similarities)

Get the 10 users most similar to the given one

def get_similar_users(user_id, data, top_n=10):
    similar_users = Jaccard_similarity(user_id, data)
    
    # Sort the similar_users list in descending order according to similarity and find the most similar users
    similar_users.sort(key=lambda x: x[1], reverse=True)
    # print(similar_users)
    # Get the book IDs that the given user has read and store them in the user_books collection
    user_books = set(data[data['userid'] == user_id]['bookid'])
    # print("%%%")
    # print(user_books)
    #Create an empty list to store recommended book IDs
    recommended_books = []
    # Traverse the top_n users most similar to the given user
    for other_user, _ in similar_users[:top_n]:
        # Get the book IDs liked by the current similar user and store them in the other_user_books collection
        other_user_books = set(data[data['userid'] == other_user]['bookid'])
        # print(data[data['userid'] == other_user])
        # Use list derivation to filter out books that a given user has not read from books liked by similar users, and add the filtered book IDs to the recommended_books list
        recommended_books.extend([book for book in other_user_books if book not in user_books])
    # Return the top 10 recommended book ID lists
    return recommended_books[:10]

10 books recommended to users 1713353

# For the user with user ID 1713353, recommend 10 books
user_id = '1713353'
recommended_books = get_similar_users(user_id, data)
# Output recommended books
for book in recommended_books:
    print(book)