Comparison of NLLloss, KLDivLoss, and CrossEntropyLoss loss functions

Prerequisite knowledge

These three functions are very common in deep learning models, especially in the field of knowledge distillation, and these three functions are often compared

1. Softmax function

The softmax function is usually used as a multi-classification and normalization function, and its formula is as follows:

the s

o

f

t

m

a

x

(

x

)

=

e

x

i

i

=

1

e

x

i

softmax(x)=\frac{e^{x_i}}{\sum_{i=1} e^{x_i}}

softmax(x)=∑i=1?exi?exi
The softmax function has some key features:

  • The sum of all softmax output values is 1 and greater than 0, which satisfies a probability distribution. This is easy to understand, because the denominator is a sum, and the denominator will be obtained by adding up all the fractional elements
  • Maximize the size gap. This is given by the exponential function

    e

    x

    e^x

    caused by ex, according to the exponential function image, the independent variable value

    x

    x

    The larger x is, the dependent variable

    the y

    the y

    The faster the y value increases. You can see the following example

x 1 2 3 4
softmax(x) 0.032 0.087 0.237 0.644

This further brings up a question, is it possible to use a method to make the gap between the numbers not so large? Here we use a hyperparameter Temperature T to control the gap, the formula is as follows:

the s

o

f

t

m

a

x

(

x

,

T

)

=

e

x

i

t

i

=

1

e

x

i

t

softmax(x,T)=\frac{e^\frac {x_i}t}{\sum_{i=1} e^\frac {x_i}t}

softmax(x,T)=∑i=1?etxietxi?
Then following the previous example, we set T=0.5, 1, 2, 4 to observe the changes in the data.

T\x 1 2 3 4
0.5 0.002 0.016 0.117 0.865
1 0.032 0.087 0.237 0.644
2 0.101 0.167 0.276 0.272 0.350

can be found with

T

T

With the increase of T, the smaller the gap value between different categories (the more attention is paid to the negative label, that is, the incorrect label), but the size relationship does not change. The figure below is another graph of softmax value versus temperature.
Please add picture description

2. log_softmax function

The log_softmax function is to use the output value obtained by softmax as the input value of the logarithmic function

l

o

g

(

the s

o

f

t

m

a

x

(

x

)

)

log(softmax(x))

log(softmax(x))

NLLLoss function

NLLloss is a measure of the gap between the two, the formula is as follows:

N

L

L

l

o

the s

the s

(

p

,

q

)

=

?

i

=

1

q

i

l

o

g

p

i

NLLloss(p,q)=-\sum_{i=1} q_ilogp_i

NLLloss(p,q)=?i=1∑?qi?logpi?
The concept of information entropy is involved here, and friends can learn about it through my blog about Bayesian machine learning. In simple terms,

p

,

q

p,q

The greater the difference between p and q, the greater the value of the final loss function. The function of the negative sign is to satisfy this relationship.

CrossEntropyLoss function

The CrossEntropy function is also known as the cross-entropy loss function. In fact, the expression of the formula is consistent with the NLL loss function, but

p

,

q

p, q

The specific meanings of p and q are different, here

p

,

q

p, q

p and q are to go through log_softmax [in pytorch]

C

r

o

the s

the s

E.

no

t

r

o

p

the y

L

o

the s

the s

(

p

,

q

)

=

?

i

=

1

q

i

l

o

g

p

i

CrossEntropyLoss(p,q)=-\sum_{i=1} q_ilogp_i

CrossEntropyLoss(p,q)=?i=1∑?qi?logpi?

KLDivLoss function

KLDivLoss is used to judge the degree of fitting/similarity/matching of two distributions, assuming that there are now two probability distributions

P

,

Q

P, Q

P, Q, their KL divergence are:

D.

K

L

(

P

Q

)

=

?

i

P

(

i

)

l

no

Q

(

i

)

P

(

i

)

=

i

P

(

i

)

l

no

P

(

i

)

Q

(

i

)

D.

K

L

(

p

q

)

=

i

p

(

x

i

)

l

no

p

(

x

i

)

q

(

x

i

)

=

i

p

(

x

i

)

[

l

o

g

(

p

(

x

i

)

)

?

l

o

g

(

q

(

x

i

)

)

]

D_{KL}(P||Q)=-\sum_i P(i)ln\frac{Q(i)}{P(i)}=\sum_i P(i)ln\frac{P(i)}{Q(i)}\ D_{KL}(p||q)=\sum_i p(x_i)ln\frac{p(x_i)}{q(x_i)}=\sum_ i p(x_i)[log(p(x_i))-log({q(x_i)})]

DKL?(P∣∣Q)=?i∑?P(i)lnP(i)Q(i)?=i∑?P(i)lnQ(i)P(i)?DKL?(p∣∣q)=i∑?p(xi?)lnq(xi?)p(xi?)?=i∑?p(xi?)[log(p(xi?))?log(q(xi?))]
KLDivLoss is suitable for distance measures with continuous distributions; and is often useful for regressing distributions over continuous output spaces for discrete adoption

The difference between NLLLoss and CrossEntropyLoss, KLDivLoss

The difference between NLLLoss and CrossEntropyLoss is a log_softmax() function. The difference between KLDivLoss and the other two is that the difference between the two KL divergences is summed, while the other two are summed (of course, the specific formula will be different)

#NLLloss
def forward()
x=self.fc2(x)
x=F.log_softmax(x,dim=1)
return x

F.nll_loss()

#CrossEntropy
def forward()
x=self.fc2(x)
return x

F. cross_entropy()