WWW2023 | How to set the temperature coefficient? A method for adaptively adjusting the representation modulus length for recommendation

2e371a25bc8fe875366842fe8ff7eb33.png

Source: Machine Learning and Recommendation Algorithms
This article is about 4000 words, it is recommended to read for 8 minutes
This paper focuses on the representation modulus in recommender systems, emphasizing the importance of normalization through theory and experiments. 

TLDR: This paper focuses on the representation modulus in recommender systems, emphasizing the importance of normalization through theory and experiments. At the same time, aiming at the temperature coefficient sensitivity problem, this paper proposes an adaptive and personalized strategy to solve the practical problems in the recommendation system.

57a649cb51ffa99c1e4b8aacfb60effb.png

paper:
https://arxiv.org/abs/2302.04775

code:
https://github.com/junkangwu/Adap_tau

Homepage:
https://junkangwu.github.io/

I. Summary

In recent years, approaches based on representation learning have achieved great success in recommender systems. Despite their promising performance, we consider a potential limitation of these methods-the representation modulus length is not explicitly tuned, which may exacerbate the prevalence bias and training instability, preventing the model from making correct recommendations. By normalizing user and item representation moduli to a specific value ( ), we observe significant performance gains (average 9%) on four real-world datasets. At the same time, however, we also reveal a serious flaw when applying normalization in recommendations-model performance is extremely sensitive to the choice of temperature coefficient.

In order to give full play to the advantages of normalization while circumventing its limitations, this paper studies how to adaptively set appropriate . To do this, we first conduct a comprehensive analysis of , to fully understand its role in the proposal. Then, we propose an adaptive fine-grained strategy Adap- for temperature coefficients, which satisfies four desirable properties, including adaptiveness, personalization, efficiency, and model-independence. A large number of experiments are carried out to verify the effectiveness of the method.

2. Research Background

2.1 Loss function

There are many choices of loss functions for training recommendation models, including pointwise loss (such as BCE, MSE), pairwise loss (such as BPR) and Softmax loss. Recent work [1] found that the Softmax loss mitigates the popularity bias, achieves good training stability, and has a consistent association with the metric (ndcg). Furthermore, the Softmax loss can be considered as an extension of the commonly used BPR loss. Therefore, we use Softmax as the representative loss for analysis, which can be formulated as:

646fe6981145d0cc1b5fb0713392cd80.png

2.2 Representation modulus length

This work investigates the properties of characterizing modulus length in recommendation. Based on the inner product, we use representation normalization as the prediction target:

4f6c474c7b84aa468d5805eddf05e4d1.png

The representation moduli of users and items have been rescaled. For example, the first part of the formula can be understood as cosine similarity, and the second factor can be understood as the normalization of the modulus length.

dad7de6370b53a8401038b4e32a96aab.png

We note that instead of directly introducing a penalty term that constrains the representation modulus length, we borrow a similar idea in contrastive learning and leverage the traditional temperature. This comparison can allow our findings to be better generalized to other fields.

3. Theoretical analysis of characterization modulus length

3.1 Theoretical Analysis

Lemma 1: By using the inner product as the calculation method for the similarity between users and items, we can obtain that the item representation modulus satisfies during the iterative process:

62cb8b6e2f7f51eaa36e0604facf975f.png

Especially in the early stage of training, it is proportional to item popularity:

It can be seen from the expression that at the initial stage of training, the distribution of users and items is relatively uniform due to random initialization, and there is no significant difference, while the popularity of items is the dominant value controlling the representation modulus of items.

3.2 Experimental Analysis

3.2.1 Experiment Setup

To show the effect of free variation in representation modulus length, we conduct four experiments: (1) We first visualize the variation trend of item representation modulus length for different item popularity during training (figure top left). Here, we follow [1] and divide items into ten groups according to their popularity. The larger the group ID, the more popular items the group contains. (2) We also report the performance on groups of items with different popularity (figure top right). At the same time, regarding whether to normalize (whether to control the length of the representation module), we show the scoring change trend during positive sample training (bottom left of the figure); and the comparison of the convergence of the two models during training (bottom right of the figure)

3.2.2 Experimental Analysis

358b8707f8bd8a01c30158e7b0751a46.jpeg

  • If we focus on the early stage of training (top left), the representation modulus of popular items rises rapidly, which is consistent with the theoretical proof. Therefore, popular items tend to get higher scores because the representation modulus length directly contributes to the model prediction. In addition, different representation modulus lengths also hurt the training of user representations. The gradient of user representation can be written as: 4a327b47b5a37c1a666ef1cbc224d9c3.png, popular items The signal will affect the Contributions from other items cause the model to fall into biased predictions. (pictured upper right). It can be seen that the model with normalization produces fairer results than the model without normalization.

  • If we pay attention to the change of prediction scores (bottom left of the figure), we observe that even in the late training period (for example, 500), the inner product-based MF prediction score and representation modulus are still in a state of increasing rather than converging, while at the same time the performance continued to decline (lower right in the figure). Interestingly, once normalization was utilized, we observed extremely fast convergence of the model and remained stable thereafter.

  • To further validate the benefits of normalization, we test recommendation performance, variable whether to normalize user or item representations (table below). It can be seen that the model with two-sided normalization (that is, normalizes both user and item representations, denoted as Y-Y) is significantly better than the model with one-sided normalization (that is, Y-N or N-Y); and They all outperform the model without normalization (N-N).

2c0113dafe15ea88dbadd778a03071b0.png

3.3 The pitfalls of normalization

Although the above theory and experiments have proved that normalization is of great help to the performance of the recommendation system, however, in our actual research, we found that it has an obvious disadvantage, that is, it is extremely sensitive to the selection of the temperature coefficient. To verify this, we test the variation of recommendation performance under different choices ranging from 0.02 to 1 with a step size of 0.02. The result is shown in the figure below. The ordinate is the effect of the relative best performance, in order to meet the purpose of comparing different datasets with each other. We draw the following observations:

7e2a59c98fe32e0705f119056b012228.jpeg

1) Performance is highly sensitive to. Even a small fluctuation (e.g. Amazon-Book from 0.08 to 0.12 on ) can cause a large performance drop (e.g. 10%);

2) Different datasets require quite different . For example, at the time, the best performance was achieved on the Amazon Book dataset, but the best performance was achieved on the MoiveLens. If we simply transfer the best from one dataset (e.g. MoiveLens) to another (e.g. Amazon Book), we get rather poor performance (e.g. 30%+ reduction).


3.4 The meaning of temperature coefficient

According to the above knowledge, although normalization has obvious advantages for the recommendation system, its high sensitivity to temperature coefficient still limits its application, so we further think about its properties:


3.4.1 Avoid vanishing gradients:

Temperature mainly affects the gradient of the loss function with respect to the prediction score. For convenience, let the notation be the logit of the instance controlled by the parameter, specifically expressed as:

f36a063323f19c0ce130ee267389659a.png

The gradient can be written as:

742cc4e6a75ac261d9bd222e6ad5b9b8.png

The above can be understood as the product of the sum of positive sample logits () and the sum of negative sample logits (1-). When it is too small, due to the exploding nature of the exponential function, the difference in ? will be magnified, and positive instances will usually obtain much larger logits (\eg ) than negative instances, and the gradient will disappear. On the contrary, when it is too large, it will not show much difference. But due to the long-tailed nature of RS, i.e., the number of negative instances is much larger than positive instances, the sum of positive logits will be very small and the gradient disappears again.

3.4.2 Mining hard negative samples:

Some recent work in contrastive learning reveals hard mining. Here, we borrow their ideas but provide a more insightful analysis in terms of RS scenarios. As mentioned earlier, a smaller one will amplify the between-sample variability. Therefore, those samples with larger hard negatives will have extremely high and thus contribute more to model training. On the contrary, larger ones tend to make the model treat negative samples equally.

This property greatly motivates us to offer our users. Note that in typical RS, data quality often varies from user to user. For users with a lot of noisy feedback, it is unwise to focus too much on hard negative samples, since they are likely to be noisy samples. But for those users with clear and sufficient feedback, lowering will be a better choice, because it can bring more informative samples, thus enhancing the convergence and discriminative power of the model. Therefore, continuing to maintain a fixed habit is no longer the most ideal option. It is better to provide fine-grained ones to accommodate different strengths for different users.

Fourth, method

To address this problem, in this section, we propose Adap-, which is able to adaptively and automatically adjust the representation modulus length in recommender systems. According to the above theoretical analysis, we meet the following two goals:

  • Adaptive principle: The temperature coefficient should be adaptive to avoid vanishing gradients.

  • Fine-grained principle: The temperature coefficient should be personalized by the user, that is, the more difficult the user’s samples are to distinguish, the larger the temperature coefficient should be used.

4.1 Adap-: Implementing Adaptive Temperature

Following the lemma, we delve into the calculation of the temperature coefficient that maximizes the gradient value:

063a6341fc5138ef6766744b4f94a523.png

There are complicated calculations (user-item interaction) to directly optimize the above formula, so we use an estimation method to perform approximate calculations. First we propose an upper bound that this objective satisfies:

Lemma 2: Let be the logit score of the instance controlled by the parameter, such as 766bd5680624160c45b38f9fc464 a78e.png, and the lower bound of is . We have a target lower bound of:

9e0f9d37a898ac115dd4621ba03ef887.png

The gradient objective reaches the optimal value of the upper limit when the following conditions hold:

With the upper bound constraint of Lemma 2, we further have:

Lemma 3: Let (or ) be the distribution of all samples (or the distribution of positive samples). Let (or ) be a variable randomly sampled by (or ). Assume distributions and have subexponential tails such that the following conditions apply for some “>:

8606916716a23adad226a8304f3330de.png

When , it can be approximated as:

16616229210b20f4f4e61e3a1346f6af.png

When approached (the appendix demonstrates the validity of this assumption), the expression can be reduced to:

Here we make an assumption about the distribution that the sum is convergent and that the tails of the distribution decay at least as fast as the exponential decay (decays to ). This assumption is valid and valid because subexponential distributions are actually quite common. It contains Guassian, exponential, Gamma, Pareto, Cauchy distributions, etc. Furthermore, the work [2] proves that all bounded random variables are subexponential.

In fact, in our experiments, we always observe that the sum converges to a specific region with a rather small sum. Furthermore, we observe that the two distributions generally have very close variances (see appendix to the article). These observations verify that the approximate expression can be satisfied.

4.2 Adap-: Implementing adaptive fine-grained temperature

Following principle two, we introduce a personalized temperature for each user and supervise their learning with the help of ideas from Superloss (work [3]). Specifically, the role of Superloss is to adjust the temperature value adaptively according to the sample loss of each user. It consists of a loss-aware term and a regularization term:

1b5392d5b8422df7b12d79de3b637634.png

According to the characteristics of SuperLoss, we can solve its closed-form solution as:

bcb935a4815459408a90e498f386d700.png

where represents the Lambert-W function, which is the inverse function of . As indicated by the objective, as user loss increases monotonically – users with larger losses will get a larger temperature to reduce the confidence of the user sample. At the same time, as a baseline, scale the temperature to the appropriate area.

5. Experiment

In terms of experiments, we mainly design for the following three questions:

  • How does Adap- perform compared to other strategies?

  • Our Adap – does it adapt to different datasets and users?

  • How do models representing normalization and adaptation perform in terms of accuracy and efficiency compared to state-of-the-art models?

5.1 Model Performance Comparison

bf8c0da09a0f27d48751c52377790b41.png1ec8639e3e4c83aa6d3624d16c9277fe.jpeg

Experiments show that our model strategy can achieve performance improvement on a variety of benchmark models, and it can also alleviate the problem of popularity bias.

5.2 Model Adaptive Performance

In this section, we exploit the adaptability of our model to different “noisy data”. Two strategies were employed to add noise to the dataset. 1) According to the historical interaction frequency of each user, we added false positive samples in the same proportion. 2) Divide users into four groups randomly, and add false positive samples to each group in turn (10%, 20%, 30%, 40%). Strategy 1 focuses on adaptive performance when faced with the same proportion of noisy data (i.e. global adaptability), while strategy 2 focuses on the performance differences of individual users under different “noise ratios” (i.e. local adaptability).

586d15013c4164dc0f66cbafaee39a0e.png

For global adaptation, our proposed Adap- can exceed the results of hyperparameter grid search at various noise ratios.

97809c9d4ee343e494a2575187eaf8bd.jpeg

At the same time, for local self-adaptation, we record the distribution of each group. From the figure, we can see that our strategy can indeed achieve fine-grained adjustment among users, that is, the smaller the noise ratio, the smaller the temperature coefficient value; and vice versa.

Comparison between 5.3 and SOTA

da324d2acfde924335f4d4affe51abbd.jpeg

Finally, we consider the time and performance of the model and the SOTA model in the past two years. It can be seen from the figure that this model can achieve a better balance, and the effect is the best when the time complexity does not increase.

6. Summary

In this work, we focus on the representation modulus in recommender systems. Through theoretical and empirical analysis, we emphasize the importance of representation normalization. We also point out the drawbacks of just doing normalization. Therefore, we propose two principles to guide adaptive learning. Experiments verify that our simple method is effective for a large number of datasets. Most importantly, our model obtains an adaptive, user-personalized model without repeated searches across different datasets.

We believe that a comprehensive understanding of normalized representations will be extremely beneficial in the development of the recommender systems community. In the future, we anticipate further applications to address practical problems in CF. We want to expand this to more areas than just recommendations.

References

[1] Jiancan Wu, Xiang Wang, Xingyu Gao, Jiawei Chen, Hongcheng Fu, Tianyu Qiu, and Xiangnan He. 2022. On the Effectiveness of Sampled Softmax Loss for Item Recommendation. arXiv preprint arXiv:2201.02327 (2022).

[2] HenryWBlock and Zhaoben Fang. 1988. A multivariate extension of Hoeffding’s lemma. The Annals of Probability (1988), 1803–1820.

[3] Thibault Castells, Philippe Weinzaepfel, and Jerome Revaud. 2020. SuperLoss: A Generic Loss for Robust Curriculum Learning. In NeurIPS.

Editor: Yu Tengkai

Proofreading: Qiu Tingting

b00d2781fe991ec40c71c6d28332fb36.png