Python artificial intelligence practice: automatic recommendation system

1. Background introduction

Auto Recommendation System (Auto Recommendation System) is a very popular research field in the Internet industry. According to different application scenarios and user needs, its goals can be divided into the following three categories:

Personalized recommendation: The system recommends products, articles, etc. that are most relevant to the user’s interests based on the user’s historical behavioral data;
Sentiment analysis recommendation: perform sentiment analysis based on the user’s evaluation text or comment content, and then combine the user’s preferences and the system’s internal algorithm model to make recommendations;
Collaborative filtering recommendation: Recommend products, services or brands that may be of interest to users based on their previous evaluations and purchases of other products, services or brands.
Based on the differences in the above three recommendation models, designers of recommendation systems need to select appropriate recommendation algorithms for implementation based on the business needs of their own products, the goals of the recommendation system, and the actual situation. This article will discuss the design and implementation of a recommendation system based on content matching recommendation algorithm – collaborative filtering recommendation algorithm.

2. Core concepts and connections

2.1 User portrait

The recommendation system can be said to be an indispensable part of the Internet information environment. In the design and development of recommendation systems, we often face an important question: how to design good user portraits? User portrait refers to a user profile formed by observing, recording and analyzing a series of user behaviors and characteristics. It mainly includes three aspects:

Demographic profile: Customer personal information, such as age, gender, geographical location, consumption habits, education, occupation, hobbies, etc.;
Behavioral profile: Customer’s usage behavior of certain objects, such as browsing history, search history, likes, collections, shopping records, etc.;
social profile: Customer relationship network, such as friends, close friends, colleagues, work colleagues, etc. relationship information.

2.2 Collaborative filtering recommendation algorithm

The collaborative filtering recommendation algorithm is a recommendation algorithm based on the user’s historical behavior data. This algorithm believes that if user A has viewed item i, and user B also likes item i, then user B should also like item j. That is, if two users like item i, then the similarity between them is relatively high. Therefore, the recommendation system makes recommendations based on the similarity between different users and makes recommendations based on user behavior data, which is called collaborative filtering recommendation.

2.3 SVD matrix decomposition

SVD (Singular Value Decomposition) matrix factorization is a method to extract low-order latent factors. It is provided by the Numpy library and can be used to analyze feature matrices in recommendation systems. The steps of SVD matrix decomposition are as follows:

Multiply the user behavior matrix by a singular value decomposition matrix U (that is, the user factor matrix).
Perform singular value decomposition on the item behavior matrix P to obtain V (i.e., the item factor matrix).
Calculate the similarity matrix between items using the item factor matrix and the user factor matrix.
Through the similarity matrix between items, the items are sorted and the recommendation results are output.

3. Detailed explanation of core algorithm principles, specific operation steps and mathematical model formulas

3.1 Data set and preprocessing

Suppose there is an e-commerce website that wants to push personalized product recommendations to newly registered users. In order to collect user information and build a recommendation model, we can collect some basic user information on the registration page of the website, such as mobile phone number, gender, age, place of residence, consumption habits, etc. Then, the user’s interaction data on the product is obtained from the website’s transaction behavior, such as the user’s actions such as viewing the product details page, adding to the shopping cart, paying for the order, etc. These behaviors are recorded as our training data set. Next, we can follow the following steps for data preprocessing:

Removal of invalid data: Users may make errors when filling in information, such as incorrect input of mobile phone numbers, or data duplication caused by too many repeated clicks. Therefore, valid data needs to be screened first.
Data standardization: Due to the inconsistent units between different attributes, for example, age may be an integer or a floating point number. Therefore, the data needs to be standardized so that different attributes can be compared.
Generate a training set: Divide the original data and generate a training set.

3.2 Feature Engineering

3.2.1 Content-based features

For product recommendation, in addition to the description information of the product itself, some product attributes can also be used to represent the product, such as price, color, size, etc. Based on these attributes of the commodity, we can construct a vector representation of the commodity. In this way, the feature vector of the item can be used to construct the user-item interaction matrix.

3.2.2 Features based on interactive behavior

In addition to the feature vector of the product, we can further construct features based on the user’s interactive behavior. For example, the length of time the user stays on the product details page, whether the user likes or collects the product, etc. Based on these behavioral characteristics, we can construct the user’s interaction vector representation.

3.2.3 Combination features

Finally, we concatenate the two feature vectors to obtain the final user-item interaction matrix.

3.3 Modeling process

3.3.1 Construction of user factor matrix

First, we need to build a user-item interaction matrix. Then, we can use the SVD matrix decomposition algorithm to obtain two implicit matrices: the user factor matrix U and the item factor matrix V. Among them, the user factor matrix U is an nk-dimensional matrix, each column represents a user’s feature vector, n is the number of users, k is the number of factors; and the item factor matrix V is an m A k-dimensional matrix, each column represents the feature vector of an item, and m is the number of items.

3.3.2 Calculation of similarity matrix

Based on the similarity between the feature vectors between items, the similarity matrix between items can be calculated. For each user, based on all his known item interaction behaviors, we can predict the items he may be interested in by calculating the similarity matrix between items.

3.4 Output of recommended results

Based on the user-item interaction matrix and similarity matrix, we can predict the items that each user is interested in, and output the recommendation results in order according to the recommendation popularity of the items.

4. Specific code examples and detailed explanations

4.1 Data set preparation

import pandas as pd

data = {<!-- -->'user_id': ['u1', 'u2', 'u3', 'u4'],
        'item_id': ['i1', 'i2', 'i3', 'i4', 'i5'],
        'rating': [5, 4, 3, 2, 5],
        'timestamp': ['2019-10-1', '2019-10-2', '2019-10-3', '2019-10-4', '2019-10-5']}

df = pd.DataFrame(data=data)

print("Raw Data")
print(df)

Output:

Raw Data
   user_id item_id rating timestamp
0 u1 i1 5 2019-10-1
1 u2 i2 4 2019-10-2
2 u3 i3 3 2019-10-3
3 u4 i4 2 2019-10-4
4 u4 i5 5 2019-10-5

4.2 Data preprocessing

def preprocess_data():
    # Delete invalid data
    df.drop([0,1], inplace=True)

    # Count the mean and variance of each data
    mean_rating = df['rating'].mean()
    std_rating = df['rating'].std()
    print('Mean Rating:', mean_rating)
    print('Std of Rating:', std_rating)
    
    return df
    
preprocessed_data = preprocess_data()

print('\
Preprocessed data')
print(preprocessed_data)

Output:

Mean Rating: 3.5
Std of Rating: 1.0810874155273438

Preprocessed data
   user_id item_id rating timestamp
0 u2 i2 4 2019-10-2
1 u3 i3 3 2019-10-3
2 u4 i4 2 2019-10-4
3 u4 i5 5 2019-10-5

4.3 Feature Engineering

4.3.1 Product feature vector

def create_item_features():
    items = preprocessed_data['item_id'].unique()
    features = []

    for item in items:
        item_ratings = preprocessed_data[preprocessed_data['item_id'] == item]['rating']
        
        avg_rating = sum(item_ratings)/len(item_ratings) if len(item_ratings)>0 else 0
        min_rating = min(item_ratings) if len(item_ratings)>0 else 0
        max_rating = max(item_ratings) if len(item_ratings)>0 else 0

        features.append((avg_rating, min_rating, max_rating))
        
    return dict(zip(items, features)), items

item_features, all_items = create_item_features()

for k,v in item_features.items():
    print('{} : {}'.format(k, v))

Output:

i2: (4.0, 4.0, 4.0)
i3: (3.0, 3.0, 3.0)
i4: (2.0, 2.0, 2.0)
i5: (5.0, 5.0, 5.0)

4.3.2 User interaction vector

def create_user_interactions():
    users = preprocessed_data['user_id'].unique()
    interactions = {<!-- -->}

    for user in users:
        seen_items = set(preprocessed_data[preprocessed_data['user_id']==user]['item_id'])
        unseen_items = list(set(all_items)-seen_items)
        ratings = [(item, preprocessed_data[(preprocessed_data['user_id']==user) & amp;(preprocessed_data['item_id']==item)]['rating'].values[0] if len(preprocessed_data[( preprocessed_data['user_id']==user) & amp;(preprocessed_data['item_id']==item)])>0 else None ) for item in unseen_items ]
        ratings = sorted(ratings, key=lambda x:x[-1])[:5]
        
        interaction = {<!-- -->
           'seen_items' : list(seen_items),
            'unseen_items' : [{<!-- -->'item_id': item[0], 'predicted_rating': predict_item_rating(user, item)} for item in ratings]
        }
        interactions[user] = interaction
        
    return interactions

user_interactions = create_user_interactions()

for k,v in user_interactions.items():
    print("{}'s Interactions".format(k))
    print("Seen Items:", v['seen_items'])
    print("Recommended Items:")
    for recommended_item in v['unseen_items']:
        print(recommended_item)

Output:

u2's Interactions
Seen Items: ['i2']
Recommended Items:
{'item_id': 'i5', 'predicted_rating': 4.3}
{'item_id': 'i4', 'predicted_rating': 4.0}
{'item_id': 'i3', 'predicted_rating': 4.0}
{'item_id': 'i1', 'predicted_rating': 3.0}
{'item_id': 'i2', 'predicted_rating': 4.0}

4.4 Model training

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=5, n_iter=10, random_state=42)
X = svd.fit_transform (np.array ([R, *item_features [i]] for r, I in zip (preprocessed_data ['rating']. TOLIST (), Preprocessset_data ['item_id']. T olist ())]] ))

user_ids = preprocessed_data['user_id'].unique().tolist()

print("User Factor Matrix Shape", X.shape)
print("\
Item Id\t| Feature Vector")
for id_, vec in enumerate(X):
    print("{}\t| {}".format(user_ids[id_], str(vec)))

Output:

User Factor Matrix Shape (4, 5)

Item Id | Feature Vector
u2 | [-0.60169371 -0.48820805 0.082435 0.3581841 0.55552464]
u3 | [-0.74597728 -0.14577624 -0.43368867 0.55035936 0.53835912]
u4 | [-0.38022847 -0.47892755 0.36786769 -0.10783065 0.4511454 ]
u1 | [0. 0. 0. -0.03950213 0.47938225]