Looking at Attention’s Scale operation from gradient maximization

?PaperWeekly original · Author | Su Jianlin

Units | Dark Side of the Moon

Research Directions | NLP, Neural Network

We know that the Scale factor of Scaled-Dot Product Attention is, where is the dimension of . The general explanation of this Scale factor is: if it is not divided by , then the initial Attention will be very close to the one hot distribution, which will cause the gradient to disappear and make the model unable to train. However, it can be proved that when Scale is equal to 0, there will also be a vanishing gradient problem, which means that Scale will not work if it is too large or too small.

So what size of Scale is suitable? Is it the best Scale yet? This article attempts to answer this question from a gradient perspective.

Results available

In “A Brief Talk on Initialization, Parameterization and Standardization of Transformer” [1], we have already derived the standard Scale factor. The derivation idea is very simple. It is assumed that the initial stage is sampled from a distribution with “mean value 0 and variance 1” , then it can be calculated

So we divide by , changing the variance of the Attention Score to 1. In other words, the previous derivation was purely based on the belief that “the mean is 0 and the variance is 1” will be better, but there is no explanation to make the variance of the Attention Score 1, nor is there any evaluation of whether it is really solved. Vanishing gradient problem.

Of course, judging from existing experiments, this problem has been alleviated to at least a certain extent, but these are experimental results after all, and we still hope to know theoretically what a “certain degree” is.

Calculate gradient

Since gradients are involved, the best way is to calculate the gradients and then set an optimization goal. Assume , , are normalization factors, then it can be calculated directly:

Or it can be abbreviated as . Obviously, the gradient is 0 when

In order to be more conducive to optimization, we should choose to maximize the gradient as much as possible. To do this, we use the L1 norm as a measure of gradient size:

It is not difficult to guess from the final results that the fundamental reason for choosing L1 instead of others is because the calculation result of the L1 norm is simple enough. It is worth pointing out that appears here, which is essentially what we said in “How to measure the sparsity of data?” “Rényi entropy” introduced in “Rényi Entropy” is similar to information entropy. It is also a measure of uncertainty.

Once we have the optimization goal, we can start maximizing it. Note that the definition of also contains , so this is a complex nonlinear target about . It seems that it is impossible to find an analytical solution, but we can find approximate solutions for some special examples.

Normal distribution

First, we can follow the previous results. When we divide by so that the mean value of Attention Score is 0 and the variance is 1, we can approximate the hypothesis and then find the optimal solution. If, then it means that the original is the optimal Scale ratio, otherwise it is the optimal Scale ratio.

We use expectations to estimate the sum

For follows the standard normal distribution, we have

Substituting into the above equation, and then into equation (3), we get

Although the final approximation is simplified enough, it is not easy to find the maximum value. But it doesn’t matter, we can traverse some and then numerically solve for the maximum value, so that we can roughly see the relationship with . The reference code of Mathematica is as follows:

1(*define function*)
 2f[a_, n_] := a*(1 - Exp[a^2]/n)
 3(*Find a* corresponding to the maximum point of the function)
 4FindArg[n_] :=
 5 Module[{a}, a = a /. Last@NMaximize[{f[a, n], a > 0}, a][[2]]; a]
 6(*given the range of n*)
 7nRange = 40*Range[1, 500];
 8(*Find the a* corresponding to each n)
 9args = FindArg /@ nRange;
10(*Draw the function graph of a and n*)
11ListLinePlot[{args, 0.84*Log[nRange]^0.5},
12 DataRange -> {40, 20000}, AxesLabel -> {"n", "a"},
13 PlotLegends -> {Row[{"a", Superscript["", "*"]}],
14 TraditionalForm[HoldForm[0.84*Sqrt[Log[n]]]]}]

After fitting, the author found that the optimal point within a certain range roughly satisfies the relationship between , so the corresponding approximate functions have been drawn together:

▲ The optimal alpha and n relationship of the standard normal distribution

It can be seen that in a fairly large range, the optimal value of is between, so if you make a compromise, blindly taking as the Scale factor of Attention is theoretically more conducive to optimization.

Cosine Distribution

Now let’s consider another less common example: when we normalize them into unit vectors, their inner product becomes the cosine of the angle, which approximately obeys the angle between two random vectors in the dimensional space. Angular cosine distribution. Some readers may not be familiar with this distribution, but we have discussed it before in “Angle Distribution of Two Random Vectors in n-Dimensional Space” [2]. Its probability density has the form

It doesn’t look complicated, but in fact this form is much more difficult to handle than the normal distribution. The main reason is that it can no longer be expressed as an elementary function like equation (5). However, it is not a big problem for numerical solution in Mathematica. . Following the same idea as the previous section, the approximate formula (4) is also applicable. First numerically solve the maximum value, and then fit it. The results are as follows (in the figure, and are related to ):

▲ The optimal alpha and n relationship of cosine distribution

It can be seen that the fit with is also good (if you change to another , the coefficient of 3.5 will change). It can be seen that in a fairly large range, they are all between , so if you use the value as the Attention Score, you need to multiply it by a Scale between to make the model easier to train. This also explains why when we use values to construct a Softmax distribution (such as AM-Softmax, SimCSE [3], etc.), we need to multiply it by a Scale of about 30 afterwards, because it is difficult to train the model without multiplying. .

For different and , readers can modify the following code to calculate the optimal:

1(*define function*)
 2h[a_] :=
 3 Integrate[Exp[a*s]*(1 - s^2)^((d - 3)/2), {s, -1, 1},
 4 Assumptions -> {d > 10}]
 5g[a_] = h[a]/h[0] // FullSimplify;
 6f[a_, n_] := a (1 - g[2*a]/g[a]^2/n) /. {d -> 128}
 7(*Find the a* corresponding to the maximum point of the function)
 8FindArg[n_] :=
 9 Module[{a}, a = a /. Last@NMaximize[{f[a, n], a > 0}, a][[2]]; a]
10(*given the range of n*)
11nRange = 40*Range[1, 500];
12(*Find the a* corresponding to each n)
13args = FindArg /@ nRange;
14(*Draw the function graph of a and n*)
15ListLinePlot[{args, 3.5*Log[nRange]},
16 DataRange -> {40, 20000}, AxesLabel -> {"n", "a"},
17 PlotLegends -> {Row[{"a", Superscript["", "*"]}],
18 TraditionalForm[HoldForm[3.5*Log[n]]]}]

Related Thoughts

The title and results of this article, especially the result that is approximately proportional to in the cosine distribution, can easily remind us of another article discussing Attention Scale, “Looking at Attention’s Scale Operation from the Invariance of Entropy.”

In fact, the connection between the two articles does exist. “Rényi entropy” appears in the optimization goal (3) of this article, and the entropy of “entropy invariance” refers to Shannon information entropy. The properties of the two are largely Consistent. Maximizing formula (3) makes it enter a “slowly changing” region, which means that the “Rényi entropy” changes very slowly with respect to about, which also means that the information entropy changes with respect to about very slowly, which is approximately Equal to entropy invariance.

In addition, for bidirectional Attention (Encoder), assuming that the training sample length is the same, it is a constant. We can calculate the corresponding optimal and then fix it in the model; but for one-way Attention (Decoder), Each token is actually different (position id plus 1), so theoretically it is impossible to maximize equation (3) for all tokens. However, since the change of the token is slow, a similar value is enough. For example, it can be chosen, which is more friendly to the gradient of most tokens.

Article summary

This article discusses the selection of Attention Scale factors from a gradient perspective. As we all know, the “standard answer” to this Scale factor is, but its optimality issue was not discussed in its derivation process. Therefore, the author defined an optimization goal of Softmax gradient and discussed Scale from the perspective of maximizing the goal. The optimal value of the factor. The relevant results can be used to improve the Scale factor of Attention, or to explain the temperature parameter of contrastive learning of similarity.

References

[1] https://kexue.fm/archives/8620#NTK parameterization

[2] https://kexue.fm/archives/7076

[3] https://kexue.fm/archives/8348

Read more

#Submission Channel#

Let your text be seen by more people

How can we allow more high-quality content to reach the readership through a shorter path and shorten the cost for readers to find high-quality content? The answer is: people you don’t know.

There are always people you don’t know who know what you want to know. PaperWeekly may become a bridge that allows scholars from different backgrounds and directions to collide with each other and create more possibilities.

PaperWeekly encourages university laboratories or individuals to share all kinds of high-quality content on our platform, which can be interpretation of the latest papers, or analysis of academic hot spots, scientific research Experienceor Competition experience explanation, etc. We have only one purpose, to make knowledge truly flow.

Basic requirements for manuscripts:

? The article is indeed an individual’s original work and has not been published in public channels. If it is an article published or to be published on other platforms, please clearly mark it

? Manuscripts are recommended to be written in markdown format, and the pictures in the text should be sent as attachments. The pictures must be clear and have no copyright issues.

? PaperWeekly respects the author’s right of authorship and will provide industry-competitive royalties for each original original manuscript that is accepted, based on a tiered settlement system based on article reading volume and article quality.

Submission channel:

? Submission email: [email protected]

? When submitting a manuscript, please note the instant contact information (WeChat) so that we can contact the author as soon as possible after the manuscript is selected.

? You can also directly add the editor’s WeChat account (pwbot02) to quickly submit, note: Name-Submission

△Long press to add PaperWeekly editor

Now, you can also find us on “Zhihu”

Enter the Zhihu homepage and search for “PaperWeekly”

Click “Follow” to subscribe to our column