3.6 Simple implementation of softmax regression

Use deep learning framework to concisely implement softmax regression model

Reference materials: Li Mu’s “Hands-on Learning of Deep Learning-Pytorch Edition”ch3 Linear Neural Network
Open source address: hands-on learning of deep learning
Link to the previous section: 3.5 Implementation of softmax regression from scratch
This article is just a learning record. For more detailed content, you can refer to open source books and codes as well as Mr. Li Mu’s video hands-on deep learning online course on station b.

Article directory

  • 1. Simple implementation of softmax regression
    • 1.0 Import Python package
    • 1.2 Import data
    • 1.3 Initialize model parameters
    • 1.4 Optimization algorithm
    • 1.5 Training
  • 2. Quote

1. Simple implementation of softmax regression

1.0 Import Python package

import torch
from torch import nn
from d2l import torch as d2l

1.2 Import data

Use the Fashion-MNIST dataset and keep the batch size at 256. (How to download can refer to the Fashion-MNIST data set)

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

1.3 Initialize model parameters

The output layer of softmax regression is a fully connected layer. Just add a fully connected layer with 10 outputs in Sequential. Again, Sequential is not necessary here, but it is the basis for implementing deep models. Still randomly initialize the weights with mean 0 and standard deviation 0.01.

# PyTorch does not reshape input implicitly. Therefore, a flattening layer (flatten) is defined before the linear layer to adjust the shape of the network input

# The nn.Sequential function is used to define a sequential container, including a flattening layer and a fully connected layer.
# nn.Flatten is used to flatten the input multi-dimensional tensor into a one-dimensional tensor, and nn.Linear is used to define a fully connected layer.
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))

#The init_weights function is used to initialize the weights of the model.
# The input parameter of this function is a nn.Module object, which represents a layer of the model.
def init_weights(m):
    
    # If this layer is a fully connected layer, its weights are randomly initialized with normal distribution.
    if type(m) == nn.Linear:
        # Use the nn.init.normal_ function to initialize the weights, where std represents the standard deviation of the normal distribution.
        nn.init.normal_(m.weight, std=0.01)

# Use the apply method to apply the init_weights function to each layer of the model to initialize the weights of the fully connected layer.
net.apply(init_weights);

The output of the model is calculated and then this output is fed into the cross-entropy loss. Mathematically, this is a perfectly reasonable thing to do. However, from a computational perspective, exponents can cause numerical stability problems. softmax function

y

^

j

=

exp

?

(

o

j

)

k

exp

?

(

o

k

)

\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}

y^?j?=∑k?exp(ok?)exp(oj?)?, where

y

^

j

\hat y_j

y^?j? is the predicted probability distribution.

o

j

o_j

oj? is the unnormalized prediction

o

\mathbf{o}

o’s

j

j

j elements. if

o

k

o_k

Some values in ok? are very large, then

exp

?

(

o

k

)

\exp(o_k)

exp(ok?) may be greater than the maximum number allowed by the data type, that is, overflow (overflow). This will change the denominator or numerator to inf (infinity) and end up with 0, inf or nan (not a number)

y

^

j

\hat y_j

y^?j?. In these cases, a well-defined cross-entropy value cannot be obtained.

One trick to solve this problem is:

Before continuing the softmax calculation, start with all

o

k

o_k

OK? Subtract from

max

?

(

o

k

)

\max(o_k)

max(ok?). Here you can see each

o

k

o_k

ok? Moving by a constant will not change the return value of softmax:

y

^

j

=

exp

?

(

o

j

?

max

?

(

o

k

)

)

exp

?

(

max

?

(

o

k

)

)

k

exp

?

(

o

k

?

max

?

(

o

k

)

)

exp

?

(

max

?

(

o

k

)

)

=

exp

?

(

o

j

?

max

?

(

o

k

)

)

k

exp

?

(

o

k

?

max

?

(

o

k

)

)

.

\begin{aligned} \hat y_j & amp; = \frac{\exp(o_j – \max(o_k))\exp(\max(o_k))}{\sum_k \exp (o_k – \max(o_k))\exp(\max(o_k))} \ & amp; = \frac{\exp(o_j – \max(o_k))}{\ \sum_k \exp(o_k – \max(o_k))}. \end{aligned}

y^?j=∑k?exp(okmax(ok?))exp(max(ok?))exp(ojmax(ok?))exp(max(ok?))?= ∑k?exp(okmax(ok?))exp(ojmax(ok?))?.?

After the subtraction and normalization steps, there may be some

o

j

?

max

?

(

o

k

)

o_j – \max(o_k)

ojmax(ok?) has a large negative value. Due to limited accuracy,

exp

?

(

o

j

?

max

?

(

o

k

)

)

\exp(o_j – \max(o_k))

exp(ojmax(ok?)) will have a value close to zero, that is, underflow (underflow). These values may be rounded to zero, making

y

^

j

\hat y_j

y^?j? is zero, and makes

log

?

(

y

^

j

)

\log(\hat y_j)

The value of log(y^?j?) is -inf. After a few steps of backpropagation, you may find yourself faced with a screen of horrible nan results.

Although we are computing exponential functions, we end up taking their logarithms when computing the cross-entropy loss. By combining softmax and cross-entropy, we avoid numerical stability issues that can plague us during backpropagation.

As shown in the equation below, avoid calculating

exp

?

(

o

j

?

max

?

(

o

k

)

)

\exp(o_j – \max(o_k))

exp(ojmax(ok?)), which can be used directly

o

j

?

max

?

(

o

k

)

o_j – \max(o_k)

ojmax(ok?), because

log

?

(

exp

?

(

?

)

)

\log(\exp(\cdot))

log(exp(?)) is canceled out.

log

?

(

y

^

j

)

=

log

?

(

exp

?

(

o

j

?

max

?

(

o

k

)

)

k

exp

?

(

o

k

?

max

?

(

o

k

)

)

)

=

log

?

(

exp

?

(

o

j

?

max

?

(

o

k

)

)

)

?

log

?

(

k

exp

?

(

o

k

?

max

?

(

o

k

)

)

)

=

o

j

?

max

?

(

o

k

)

?

log

?

(

k

exp

?

(

o

k

?

max

?

(

o

k

)

)

)

.

\begin{aligned} \log{(\hat y_j)} & amp; = \log\left( \frac{\exp(o_j – \max(o_k))}{\sum_k \exp(o_k – \max(o_k))}\right) \ & amp; = \log{(\exp(o_j – \max(o_k)))}-\log {\left( \sum_k \exp(o_k – \max(o_k)) \right)} \ & amp; = o_j – \max(o_k) -\log{\left ( \sum_k \exp(o_k – \max(o_k)) \right)}. \end{aligned}

log(y^?j?)?=log(∑k?exp(okmax(ok?))exp(ojmax(ok?))?)=log(exp(ojmax(ok ?)))?log(k∑?exp(okmax(ok?)))=ojmax(ok?)?log(k∑?exp(okmax(ok?))). ?

loss = nn.CrossEntropyLoss(reduction='none')

1.4 Optimization Algorithm

Here, mini-batch stochastic gradient descent with a learning rate of 0.1 is used as the optimization algorithm.

trainer = torch.optim.SGD(net.parameters(), lr=0.1)

1.5 training

Next, call the training function defined in Section 3.5 to train the model.

num_epochs = 10
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

2. Quote

To quote the original book:

@book{zhang2019dive,
    title={Dive into Deep Learning},
    author={Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola},
    note={\url{http://www.d2l.ai}},
    year={2020}
}