CrossEntropy (cross entropy loss function pytorch)

Introduction

The crossentropy loss function is mainly used for multiple classification tasks. It calculates the cross-entropy loss between the model output and the real label, which can be used as the objective function for model optimization.

In a multi-classification task, each sample has multiple possible categories, and the model outputs the probability distribution that each sample belongs to each category. The cross-entropy loss function can measure the distance between the probability distribution of the model output and the real label, thereby guiding the model optimization.

Usage of Pytorch library

class torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')

Parameter introduction

  • weight, is a one-dimensional tensor, the specific size is M, M is the number of labels of the sample, representing the weight of the assigned category
  • ignore_index, int type data, used to specify to ignore the index of a certain category. Defaults to -100, meaning no categories are ignored.
  • reduction: Specifies the calculation method of the loss function. Options include: ‘none’ (don’t return the loss per sample), ‘mean’ (return the average loss per sample), ‘sum’ (return the total loss per sample).

Concrete usage examples

import torch
import torch.nn as nn
batch_size = 32
class_num = 3
inputs = torch.rand(batch_size, class_num) # [32, 3]
target = torch.randint(0, 3, size=(batch_size,)) # [32]
softmax = nn.Softmax()
inputs = softmax(inputs)
loss_func = nn.CrossEntropyLoss()
predict = loss_func(inputs, target)
print(predict)
# It should be noted that the loss function/softmax function needs to be defined first, and an additional parenthesis needs to be added when setting the size

Model input

  • inputs: The output of the model, the shape is (batch_size, class_num), class_num represents the number of categories. It can be regarded as the probability value of each sample being classified into each category (here, it is generally necessary to use softmax, etc. for conversion).
  • target: ground truth label, of shape (batch_size), where the value of each element is the class index to which the sample belongs.

Calculation method

Binary classification cross entropy loss function

L

=

1

N

i

L

i

=

1

N

i

?

[

the y

i

?

log

?

(

p

i

)

+

(

1

?

the y

i

)

?

log

?

(

1

?

p

i

)

]

L=\frac{1}{N} \sum_i L_i=\frac{1}{N} \sum_i-\left[y_i \cdot \log \left(p_i\right) + \left(1-y_i\right) \cdot \log \left(1-p_i\right)\right ]

L=N1?∑i?Li?=N1?∑i[yilog(pi?) + (1?yi?)?log(1?pi?)]

Parameter introduction

  • N, representing N samples
  • L

    i

    L_{i}

    Li?, is the value of the corresponding loss function of a certain sample

  • the y

    i

    y_{i}

    yi? is the label value of the sample, if it is, it is 1, if it is not, it is 0

  • p

    i

    p_{i}

    pi? is the probability distribution (value) output by the model, between 0-1

Multiple classification cross entropy loss function

L

=

1

N

i

L

i

=

?

1

N

i

c

=

1

m

the y

i

c

log

?

(

p

i

c

)

L=\frac{1}{N} \sum_i L_i=-\frac{1}{N} \sum_i \sum_{c=1}^M y_{i c} \log \left(p_{i c}\right)

L=N1?∑i?Li?=?N1?∑i?∑c=1M?yic?log(pic?)

Parameter introduction

  • N, representing N samples
  • M, for M types or categories
  • the y

    i

    c

    y_{ic}

    yic?, which represents the label value of the i-th sample for the C-th category

  • p

    i

    c

    p_{ic}

    pic?, which represents the probability distribution of the i-th sample for the C-th category/(value)

Advantages

When using backpropagation and gradient descent optimization, the model depends on the learning rate (learning rate) and partial derivative, and the learning rate can be set manually, so we start from the partial derivative. The larger the partial derivative, the worse the effect of the model is, but the faster the learning rate is. Therefore, using the cross-entropy loss function, the learning speed will be faster and it will be easier to converge when the model effect is poor.

Disadvantages

The focus is on classification, which makes it easier to learn information between different categories. It is more concerned about the accuracy of the correct prediction probability, and it is easy to ignore the differences and connections of other labels. The learned features are looser.

Reference

Loss function|Cross entropy loss function (know almost)
Wikipedia Introduction to Cross Entropy