The process of softmax derivation


(Picture from teacher Li Hongyi’s PPT)

Students who are familiar with machine learning/deep learning will definitely be familiar with softmax. It sometimes appears in multi-classification to obtain the probability of each category, and sometimes appears in binary classification to obtain the probability of each category. The probability of a positive sample (of course, at this time softmax appears in the form of sigmoid).

1. From sigmoid to softmax

sigmoid appears frequently in machine learning/deep learning, from logistic regression to activation functions in deep learning. Let’s take a look at its form first:

\[sigmoid(z) = \sigma(z) = \frac{1}{1 + e^{-z}}, \:\:\:\: z \in \ mathbb{R} \]

Let’s draw its image:

import numpy as np
imort matplotlib.pyplot as plt

x = np.arange(-5, -5, 0.05)
z = x / (1 + np.exp(-x))

plt.plot(x, z)
plt.show()

sigmoid
Obviously, sigmoid compresses the values of the real number domain into the range \((0, 1)\). So how is it used in the second classification? Take logistic regression as an example. In logistic regression, it is usually assumed that the category of the sample presents a Bernoulli distribution, that is:

\[P(y=1) = p_1 \ P(y=0) = p_0 \]

And there is \(p_1 + p_0 = 1\). When using logistic regression to solve a binary classification problem, we use the probability that the sample is a positive sample when modeling (according to the Bernoulli distribution, the probability of being a negative sample is obvious): input the regression value obtained by logistic regression into sigmoid, we get \(p_1\), that is:

\[z = f(z) \ p_1 = \frac{1}{1 + e^{-z}} \]

Among them \(f\) is a regression function. So what does sigmoid have to do with softmax? Let’s first look at the definition of softmax:

\[softmax(\mathbf{z}) = \frac{e^{\mathbf{z}}}{\sum_{i}^{n} e^{z_i}}, \:\ \:\:\: \mathbf{z} \in \mathbb{R}^n \]

Obviously, softmax is a function of a vector, as shown in the diagram at the beginning of this article. softmax normalizes the values in a vector. In multi-classification, we often regard it as the probability value of the sample belonging to different categories. We will have a two-classification model whose output is a vector \(\mathbf{z} \in \mathbb{R}^2\), where \(z_0, z_1\) are respectively Unnormalized values of samples belonging to categories 0, 1. If we use \(softmax\) to get the class probability, then:

\[p_0 = \frac{e^{z_0}}{e^{z_0} + e^{z_1}} \ p_1 = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} \]

\(p_0, p_1\) is the probability that the sample belongs to 0, 1 respectively. We make the following transformations:

\[\begin{aligned} p_1 & amp;= \frac{e^{z_1}} {e^{z_0} + e^{z_1}} \ & amp;= \frac{e ^{z_1}} {e^{z_0}( e^{z_0 – z_0} + e^{z_1 – z_0} )} \ & amp;= \frac{e^{z_1 – z_0}} { e^{z_0 – z_0} + e^{z_1 – z_0} } \ & amp;= \frac{e^\Delta} {e^{0} + e^{\Delta}} \ \\ & amp;= \frac{e^\Delta} {1 + e^{\Delta}} \ & amp;= \frac{1} {1 + e^{-\ \Delta}} \end{aligned} \]

Among them \(\Delta = z_1 – z_0\), similarly we can get \(p_0 = \frac{1} {e^{\Delta} + 1}\). Recall that logistic regression models the probability that the sample belongs to 1, then \(\Delta\) is the predicted value before logistic regression enters sigmoid. Here we take a look at \(\ \Delta\) exactly what it is:

\[\frac{p_1} {p_0} = \frac{1} {e^\Delta} = e^{\Delta} \ \log \frac{p_1} {p_0} = \Delta \]

It can be seen from the above that Logistic regression is to regress the logarithm (log odds) of the ratio (probability) of the probability that the sample is 1 and is 0. Therefore, logistic regression is also a kind of regression, except that it regresses the logarithmic probability of the sample. After obtaining the logarithmic probability, the probability that the sample belongs to category 1 is obtained.

Let’s talk about the relationship between sigmoid and softmax. In fact, we can see from the above description that sigmoid is just softmax > is a case where sigmoid implicitly contains the softmax value of another element. When modeling a classification task, we usually model one category in the binary classification task, assuming that it obeys the Bernoulli distribution; or model it as a binomial distribution, and separately model the probability that the sample belongs to each category ( That is, each bit in \(\mathbf{z}\) represents the logarithmic probability that the sample is the corresponding category, and each bit in \(softmax(\mathbf{z})\) represents the sample is the probability of the corresponding category).

2. softmax derivation of loss

In multi-classification tasks, we usually use logarithmic loss (cross-entropy loss in binary classification):

\[\mathcal{L} = -\frac{1} {N} \sum_{i=1}^N \sum_{j=1}^C y_{ij} \log \hat {y}_{ij} \]

Among them, \(N, C\) are the number of samples and the number of categories respectively, \(y_{ij} \in \{0, 1\}\) represents the sample \(x_i\) Whether it belongs to category \(j\), \(\hat{y}_{ij}\) represents the corresponding prediction probability. In multi-classification, probability values are usually obtained through softmax, that is:

\[\hat{y}_{ij} = softmax(\mathbf{z}_i)_j = \frac{e^{z_{ij}}} {\sum_{k=1}^C e^{z_{ik}}} \]

Here we only consider the loss of one sample, namely:

\[\mathcal{l} = -\sum_{j=1}^C y_{j} \log \hat{y}_{j} \ \hat{y}_{ j} = softmax(\mathbf{z})_j = \frac{e^{z_{j}}} {\sum_{k=1}^C e^{z_{k}}} \]

Okay, let’s start with the main event, find the partial derivative of multi-class logarithmic loss on \(z_k\):

\[\begin{aligned} \frac{\partial \mathcal{l}} {\partial z_k} & amp;= – \frac{\partial} {\partial z_k} (\ \sum_{j=1}^C y_{j} \log \hat{y}_{j}) \ & amp;= – \sum_{j=1}^C y_{j} \frac{\partial \log \hat{y}_{j}} {\partial z_k} \ & amp;= – \sum_{j=1}^C y_{j} \frac{1} {\hat{y}_j} \frac{\partial \hat{y}_j} {\partial z_k} \end{aligned} \]

It’s actually very simple here. You just need to calculate \(\frac{\partial \hat{y}_j} {\partial z_k}\), so let’s do it:

\[\begin{aligned} \frac{\partial \hat{y}_j} {\partial z_k} & amp;= \frac{\partial} {\partial z_k} (\ \frac{e^{z_j}} {\sum_{c=1}^C e^{z_c}}) \ & amp;= \frac{\frac{\partial e^{z_j }} {\partial z_k} \cdot \sum – e^{z_j} \cdot \frac{\partial \sum} {\partial z_k} } {(\sum)^2} \ & amp;= \frac{\frac{\partial e^{z_j}} {\partial z_k} \cdot \sum – e^{z_j} \cdot e^{z_k } } {(\sum)^2} \end{aligned} \]

where \(\sum = \sum_{c=1}^C e^{z_c}\), where \(\frac{\partial e^{z_j}} {\partial z_k} \) It needs to be discussed on a case-by-case basis:

\[\frac{\partial e^{z_j}} {\partial z_k} = \begin{cases} e^{z_j} & amp; k = j \ 0 & amp; k \ \
eq j \end{cases} \]

therefore,

\[\frac{\partial \hat{y}_j} {\partial z_k} = \begin{cases} \frac{e^{z_j}\ \cdot \ \sum – (e^{z_j})^2 } {(\sum)^2} & amp; k = j \ \frac{0 \cdot \sum – e^{z_j} \cdot e^{z_k} } {(\sum)^2} & amp; k \\
eq j \end{cases} \]

It seems a bit complicated, let’s simplify it as follows:

\[\frac{\partial \hat{y}_j} {\partial z_k} = \begin{cases} \hat{y}_j (1 – \hat{y}_j) & amp; k = j \ – \hat{y}_j \cdot \hat{y}_k & amp; k \\
eq j \end{cases} \]

Call it a day? No! Our goal is to find \(\frac{\partial \mathcal{l}} {\partial z_k}\):

\[\begin{aligned} \frac{\partial \mathcal{l}} {\partial z_k} & amp;= -[y_k \frac{1} {\hat{y}_k } \frac{\partial \hat{y}_k} {\partial z_k} + \sum_{j \\
eq k} y_j \frac{1} {\hat{y}_j} \ \frac{\partial \hat{y}_j} {\partial z_k} ] \ & amp;= -[y_k \frac{1} {\hat{y}_k} \cdot \hat{y}_k (1 – \hat{y}_k) + \sum_{j \\
eq k} y_j \frac{1} {\hat{y}_j} \cdot (- \hat{y}_j \cdot \hat{y}_k) ] \ & amp;= -[y_k \cdot (1 – \hat{y}_k) – \sum_{j \\
eq k} y_j \cdot \hat{y}_k ] \ & amp;= -[y_k – y_k \cdot \hat{y}_k – \sum_{j \\
eq k } y_j \cdot \hat{y}_k ] \ & amp;= -[y_k – \sum_{j} y_j \cdot \hat{y}_k ] \ & amp; = \sum_{j} y_j \cdot \hat{y}_k – y_k \ & amp;= \hat{y}_k \cdot \sum_{j} y_j – y_k \\ \ & amp;= \hat{y}_k – y_k \end{aligned} \]

Although the calculation seems a bit complicated, the final result is still very elegant: Predicted value - true value.

Here are a few points to note:

  • In the second part of the above formula, the sum is divided into two parts based on whether \(j\) is equal to \(k\);
  • In the penultimate step of the above equation, only one of the multi-classification targets is 1, that is \(\sum_j y_j = 1\).

above!