Softmax, Cross-entropy Loss and Gradient derivation and Implementation

Table of Contents

1. Overview

2. Sigmoid function

3. Softmax function

3.1 Definition

3.2 Partial derivatives of softmax function

3.3 Gradient of softmax function

3.4 python implementation of softmax and its gradient

4. cross-entropy loss

4.2 logistic loss

4.3 logistic loss gradient

4.4 cross-entropy loss

4.5 cross-entropy loss gradient

5. Gradient of “softmax + cross-entropy” combo

6. batch implementation


1. Summary

Briefly introduce the softmax function, cross-entropy loss function commonly used in machine learning and deep learning, and their gradient derivation (especially the gradient derivation after softmax and cross-entropy loss cascade). In particular, from the partial derivative of a single variable, to the partial derivative of the input vector (that is, the gradient), and even to the matrix representation of the gradient of the entire batch. Finally, the corresponding python implementation is given. These will become a basic building block for completely DIY implementation of a classification neural network in python.

2. Sigmoid function

The sigmoid function (also known as the logistic function) is a commonly used nonlinear activation function that maps input values to a value range between 0 and 1. It is defined as:

\sigma(z)=\frac{1}{1 + e^{-z}} (1)

Among them, the image of the Sigmoid function is S-shaped and is often used as an activation function in neural networks to convert the output value into a probability value. The advantage of the Sigmoid function is that it has good mathematical properties, being monotonic, continuous, and differentiable, and its derivative function is very “beautiful” (9 can be expressed by the Sigmoid function itself, so it is very convenient to implement)! The derivative of the Sigmoid function can be expressed as:

\sigma'(x)=\sigma(x)(1-\sigma(x)) (2)

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    '''
    '''
    return 1 / (1 + np.exp(-z))

x = np.arange(100) * 0.1 - 5
y = sigmoid(x)
y_derivative = y * (1 - y)
plt.plot(x,y, label = 'sigmoid')
plt.plot(x,y_derivative, label = 'derivative of sigmoid')
plt.legend()

Figure 1: sigmoid and its derivative

However, from the perspective of deep learning, the shortcomings of Sigmoid are also obvious. As shown in the figure above, when the absolute value of the input data is large, the derivative of the function is close to 0, which can easily lead to the problem of gradient disappearance and affect the training effect of the neural network. Therefore, in fact, in modern deep learning, the Sigmoid function (used as the activation function of the hidden layer) is replaced by more effective activation functions (such as ReLU, LeakyReLU, etc.) in most cases. Usually only used in the output layer of binary classification neural networks.

3. Softmax function

3.1 Definition

The Sigmoid function is suitable for two-classification problems. For multi-classification problems, it needs to be expanded on the basis of the Sigmoid function. This expansion results in the Softmax function. The input to the softmax function is usually a vector and the output is a vector of the same size. The characteristic of the softmax function is that it can convert any real vector into a probability distribution (strictly speaking, it is converted into a set of data that conforms to the probability axioms, so it can be used as a probability value). Its output is a vector summing to 1, where each element represents the probability of the corresponding category. The formula is:

\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum\limits_{j=1}^{n} e^ {z_j}} (3)

Among them, z represents a vector with n components (corresponding to n classification problems).

Consider the case of n=2,

\text{softmax}(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{ 1}{1 + e^{z_2-z_1}} (4)

\text{softmax}(\mathbf{z})_2 = \frac{e^{z_2}}{e^{z_1} + e^{z_2}} = \frac{ 1}{1 + e^{z_1-z_2}} (5)

Assuming that z1 and z2 represent the evaluation scores of category 1 and category 2 respectively, then obviously, z1>z2 means that the result should be judged as category 1, and z10, which means it should be judged as category 1, and z<0, which means it should be judged as category 2.

P(class1 | data) = \frac{1}{1 + e^{z_2-z_1}} = \frac{1}{1 + e^{-z}} = \ sigma(z) (6)

It can be seen that under the condition of two classification problems (the problem of determining whether it is category 1 or category 2 can be converted into the problem of determining whether it is category 1), the Softmax function degenerates into the Sigmoid function.

3.2 Partial derivatives of softmax function

The partial derivative of the softmax function (because there are multiple inputs, so it is a partial derivative) also has a very beautiful form (like sigmoid, this beautiful derivative function form is derived from the derivative properties of the exponential function, and the output result reduction Uniform [i.e. the sum is 1] characteristic).

First, for the sake of simplicity, the softmax function is recorded as follows:

? s(z_i) = \text{softmax}(\bold{z})_i

? ? ? s(\bold{z}) = \text{softmax}(\bold{z} ) (7)

s(z_i) for variables< img alt="z_j" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?z_j">‘s partial derivative function (note that whether it is < img alt="s(\bold{z})" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?s(\bold{z })">Or s(z_i)They are all n variables\{z_1,z_2,\cdots,z_n\}‘s function!) can be derived as follows. Here, a classified discussion is needed. Note that for the sake of simplicity below, \sum means \sum\limits_{j=1}^n{e^{z_j}}.

case1: i = j

\frac{\partial{y_i}}{\partial{z_j}} =\frac{\partial{}}{\partial{z_i}} (\frac{e ^{z_i}}{\sum}) =\frac{e^{z_i}\sum - (e^{z_i})^2}{\sum^2}=\frac{e^{ z_i}}{\sum} (\frac{\sum - e^{z_i}}{\sum})=y_i(1-y_i) (8 )

Doesn’t it look similar to formula (2) above?

case2: i != j

? \frac{\partial{y_i}}{\partial{z_j}} = -\frac{e^{z_i} e^{z_j}}{\sum^2} = -y_i y_j (9)

Combined, it can be written as follows:

\frac{\partial{y_i}}{\partial{z_j}} = -\frac{e^{z_i}e^{z_j}}{\sum^2} = I(i==j) y_i -y_i y_j (10 )

Among them, I() represents the Indicator function. If the conditions within () are true, the value is 1, otherwise the value is 0.

3.3 The gradient of the softmax function

In short, the gradient represents each partial derivative in the form of a vector (or matrix). In general, think of gradients as row vectors.

First consider the gradient of the y component y_i (scalar to vector derivative) (Note that the letters y and z without subscripts are used here to represent vectors, and the letters with subscripts represent the components. Usually the font is bolded to emphasize the vector, but this is not the case here. Sometimes It will be bolded and sometimes not, but without subscript it means it is a vector)

? \\
abla_{\bold{z}}{y_i} = [\frac{\partial{y_i}}{\partial{z_1}}, \ \frac{\partial{y_i}}{\partial{z_2}}, \cdots,\frac{\partial{y_i}}{\partial{z_n}}] (11)

The gradient of the entire vector y (vector to vector derivative) is to stack the gradients of the above components vertically to form a matrix (this is called the so-called numerator layout in the vector/matrix calculus. In contrast, there is another type called denominator layout. But The former is more common)

\\
abla_{\bold{z}}\bold{y} = \left[ {\begin{array}{cccc} \partial_{11} & amp; \partial_ {12} & amp; \cdots & amp; \partial_{1n}\ \partial_{21} & amp; \partial_{22} & amp; \cdots & amp; \partial_{ 2n}\ \vdots & amp; \vdots & amp; \ddots & amp; \vdots\ \partial_{n1} & amp; \partial_{n2} & amp; \ \cdots & amp; \partial_{nn}\ \end{array} } \right] (12)

Among them, the reduced representation is used: \partial_{ij}=\frac{\partial{y_i}}{\partial{z_j}}

3.4 python implementation of softmax

import numpy as np

def softmax(logits):
    '''
    Assuming a numpy ndarray (D,m), as input.
    Before doing the division, we must reshape the sums into a one-column matrix,
    otherwise, NumPy complains that it cannot divide a matrix by a one-dimensional array.
    '''
    tmp = np.exp(logits)
    tmp = tmp / np.sum(tmp, axis = 1).reshape(-1,1)
    np.testing.assert_almost_equal(np.sum(tmp, axis = 1), 1)
    return tmp


# test of softmax
# logits = np.random.rand(16).reshape(4,4)
logits = np.array([[2,2,2,2]])
s = softmax(logits)
print(logits)
print(s)

Output:
[[2 2 2 2]]
[[0.25 0.25 0.25 0.25]]

4. cross-entropy loss

Cross-entropy loss is a logarithm-based loss function (log loss). Also known as negative log likelihood (NLL) loss, it is a common loss function used for classification problems in machine learning and deep learning. It is used to compare the difference between the probability distribution of the model output and the probability distribution of the true label.

In the binary classification problem, corresponding to the logistic function used (ie, sigmoid function), the cross-entropy loss at this time is also called logistic loss. In other words, logistic loss is a special case of cross-entropy loss in the binary classification case.

4.2 logistic loss

The expression of logistic loss (single sample) is as follows:

? \text{logistic loss} = L = - \{y log (p) + (1-y)log(1-p)\} (13)

where y is the actual label of the sample (0 or 1), p = \hat{y} is the probability that the model predicts that the sample belongs to the positive class, and is the output of the sigmoid function (of course, the output of other excitation functions can also be used, as long as it can express the probability).

Obviously, when the true label y is 1, the loss function is $-\log(\hat{y})$, when the real label y is 0, the loss function is $-\ \log(1-\hat{y})$. Therefore, when the model’s predicted value is closer to the true label, the value of the loss function is smaller. On the contrary, when the model’s predicted value is more inconsistent with the true label, the value of the loss function is larger.

Generally speaking, sample data in deep learning are processed in (small) batches (mini-batch). The total loss of a batch of data is to take the average loss of each sample, as shown below:

? \text{batch logistic loss} =\frac{1}{N} \sum\limits_{n=1}^N - \{y^ n log (p^n) + (1-y^n)log(1-p^n)\} (14)

Here (since subscripts are used to represent components in a vector as mentioned above) superscripts are used to represent sample numbers.

4.3 logistic loss gradient

Performing a simple single-variable differential operation can obtain the logistic loss gradient (which is an ordinary derivative under single-variable conditions) as follows:

? \frac{\partial{L}}{\partial{\hat{y}}} = -(\frac{y}{\hat {y}}-\frac{1-y}{1-\hat{y}})=\frac{\hat{y}-y}{(1-\hat{y})\ \hat{y}} (15)

4.4 cross-entropy loss

The expression of cross-entropy loss (single sample) can be expressed as the expansion of logistic loss directly to the multi-classification situation (corresponding to treating the multi-classification problem as the merger of multiple two-classification problems), as shown in the following formula:

\text{cross-entropy loss} = L = - \sum\limits_{i=1}^k \{y_i log (p_i) + (1-y_i)log(1- p_i)\} (16)

In the above formula, each term of the summation is equivalent to the logistic loss for each category. where k represents the number of categories. p_i = \hat{y}_iIndicates the probability that the model predicts that the sample belongs to this class.

A more concise expression of cross-entropy loss is (in multi-classification, the second half of the information is redundant and is already included in the first half of other categories):

? \text{cross-entropy loss} = L = - \sum\limits_{i=1}^k \{y_i log (p_i) \} (17)

In the same batch processing, the cross-entropy loss of a batch is the average of the cross-entropy of each sample, as follows:

? $Loss = -\frac{1}{N}\sum_{i=1}^N{\sum_{j=1}^M{y_{ j}^i\log{\hat{y}_{j}^i}}}$ (18)

Among them, N is the number of samples, M is the number of categories, $y_{j}^i$Represents the probability that the true label of sample i is category j (if the true label of sample i is category j, it is 1, otherwise it is 0. That is, a one-hot vector), $\hat{y}_{j}^i$represents the probability that the model predicts that sample i belongs to category j.

The intuitive meaning of cross-entropy loss is that the closer the probability distribution predicted by the model is to the probability distribution of the true label, the smaller the loss. Therefore, when the model predicts the classification of samples more accurately, the cross-entropy loss is smaller.

4.5 cross-entropy loss gradient

Let’s first look at the partial derivative of loss with respect to one of its components (so simple):

? \frac{\partial{L}}{\partial{\hat{y}_i}} = - \frac{y_i}{\hat {y}_i} (19)

The gradient of cross-entropy loss is just to piece together the partial derivatives of each component into a vector (note that, as mentioned above, in mathematical analysis, the gradient is usually regarded as a row vector):

? \frac{\partial{L}}{\partial{\bold{y}}} = [\frac{\partial{y_1}} {\partial{\hat{y}_1}}, \frac{\partial{y_2}}{\partial{\hat{y}_2}},\cdots,\frac{\ \partial{y_n}}{\partial{\hat{y}_n}}] (20)

5. Gradient of “softmax + cross-entropy” combo

Figure 2: Combo of softmax and cross-entropy loss

In machine learning or deep learning, softmax and cross-entropy are usually used in combination. Therefore, the combination of the two can be regarded as a component/module, that is:

? \text{SML}(l) = L(\hat{y}) = L(\text{softmax}(l)) (21)

Here, SML is used to represent combo of softmax and cross-entropy loss, lrepresents the logits input of softmax (as mentioned before, it is a vector), \hat{y}is the output of softmax.

The gradient of SML refers to the gradient of the final loss against the input logits of softmax. Since SML is composed of a cascade of two functions, a so-called composite function, the chain rule needs to be used to find partial derivatives. And, as shown below, due to l and \hat{y}Both vector, so the fully differential version of the chain rule needs to be used.

Let’s first look at the partial derivatives of component i of logits.

step1: \frac{\partial{L}}{\partial{l_i}} = \sum\limits_{j=1}^n \frac{\partial{L }}{\partial{\hat{y}_j}} \frac{\partial{\hat{y}_j}}{\partial{l_i}} (22-1)

First, L is }”> = {…,\hat{y}_j,…} functions, and each \hat{y}_jThey are also l_i function, so find L pairl_iThe partial derivatives need to consider each \hat{y}_jContribution as an intermediary. This is the so-called total differential. Please refer to advanced mathematics textbooks for specific details.

step2: See 4-5.

? \frac{\partial{L}}{\partial{\hat{y}_j}} = - \frac {y_j}{\hat{y}_j} (22-2)

step3: See 3-2.

\frac{\partial{\hat{y}_j}}{\partial{l_i}} = -\frac{e^{l_i}e^{l_j}}{\ \sum^2} = I(i==j) \hat{y}_j -\hat{y}_j \hat{y}_i (22-3 )

step4: String (22-1~3) to get:

Figure 3: SML vs logits derivative

So beautiful! Such a dazzling pile of things, and finally came to such a simple result!

Furthermore, the gradient representation of L can be obtained (that is, arranging each component into a vector) as follows:

? ? ?\\
abla{L} \= [\frac{\partial{L}} {\partial{l}_1},\frac{\partial{L}}{\partial{l}_2},\cdots,\frac{\partial{L}}{\partial {l}_n}] \= [y_1-\hat{y}_1,y_2-\hat{y}_2,\cdots,y_n-\hat{y}_n] \ = \bold{y} - \bold{\hat{y}} (22-4)

6. batch implementation

coming soon

Further and more complete reference to the mathematical derivation of backpropagation: Tutorial: Mathematical Derivation of Backpropagation

reference:

[1] Killer Combo: Softmax and Cross Entropy

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 336749 people are learning the system