Taking MNIST handwritten digit recognition as an example to customize gradient descent (MindSpore framework)

1. Principle introduction

The formula that defines gradient descent is as follows:

$w^{t + 1} = w^t - learning\_rate * \\ abla J(w^t)$

Where w represents the parameter value to be updated, t represents the number of iteration rounds, learning_rate represents the learning rate, and J(w) represents the objective function. Assume that in the t-th iteration, w is located at the position pointed by the red arrow in the above figure. At this time, the gradient direction of the objective function is shown by the green arrow, and the gradient value of the objective function is a negative number. Combined with the formula of gradient descent, we can clearly know that in the t + 1th iteration, the value of w increases. Intuitively, it is closer to the minimum value of the objective function. This is how gradient descent works. For the learning rate lr, if the learning rate is set too high, the process of parameter update using gradient descent will continue to oscillate near the minimum value and fail to converge, resulting in poor performance of the trained neural network. If the learning rate setting is too small, there will be problems such as too slow training speed.

2. Load external libraries

import mindspore
from mindspore import nn
from mindspore.dataset import vision, transforms
from mindspore.dataset import MnistDataset
import mindspore.context as context
from mindspore import ops
from download import download
import matplotlib.pyplot as plt
import time

Import the modules required for the experiment through the import command. The mindspore module is used to construct a fully connected neural network classifier to complete the MNIST handwritten digit recognition task. The download module is used to download the MNIST data set. The matplotlib module is used to visualize the training process. The time module is used for DeBug.

3. Device settings

context.set_context(mode = context.GRAPH_MODE, device_target = 'GPU')

In the experiment, I deployed the training process of the neural network model to the GPU. Compared with the CPU, the model training speed is faster.

4. Download data set

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/" \
      "notebook/datasets/MNIST_Data.zip"
path = download(url, "./", kind = "zip", replace = True)

Download the MNIST data set through the download module. The MNIST data set contains 10 data types, corresponding to Arabic numerals 0 to 9.

5. Hyperparameter settings

batch_size = 16
lr = 1e-3 # 1e-2
epochs = 5
weight_decay = 1e-2

For batch_size, if the batch_size is too large, the number of parameter updates will be reduced. Although the model training time will be shortened, the classification effect of the neural network will be reduced. If the batch_size is too small, the model training time will be greatly increased. increase, and at the same time, oscillation is prone to occur during the convergence process. For lr, if the learning rate is set too high, the process of parameter update using gradient descent will continue to oscillate near the minimum point and fail to converge, resulting in poor performance of the trained neural network. If the learning rate is set too high, If it is low, there will be problems such as too slow training speed. For epoch, if the epoch is smaller, the number of parameter updates will be less, and the classification effect and generalization ability of the trained neural network will be weakened. If the epoch is larger, the training time will be greatly increased. , and the classification effect of the trained neural network will be significantly improved at first, but it may not change much in the future. It is worth noting that weight_decay is the coefficient placed in front of the L2 regular term. The larger the weight_decay, the closer the model parameters are to 0. weight_decay can control the change range of parameter values. If the model parameter values change greatly, it indicates that the model itself changes greatly and it is easy to overfit the training samples. At the same time, weight_decay can ensure that the model parameter values are small, thereby avoiding gradient explosion.

6. Data set loading

train_dataset = MnistDataset('MNIST_Data/train')
test_dataset = MnistDataset('MNIST_Data/test')

Load the MNIST data set downloaded locally through the mindspore.dataset.MnistDataset API interface provided by the MindSpore framework. The generated dataset has two columns [image, label]. The data type of the image column is unit8, and the data type of the label column is unit32. Among the input parameters of the API interface, ‘MNIST_Data/train’ represents the root directory path containing the data set file.

7. Data preprocessing

def data_pre_processing(dataset):
    image_transforms = [
        vision.Rescale(1.0 / 255.0, 0), # normalization
        vision.Normalize(mean=(0.1307,), std=(0.3081,)), # Regularization
        vision.HWC2CHW()
    ]
    label_transform = transforms.TypeCast(mindspore.int32)

    dataset = dataset.map(image_transforms, 'image')
    dataset = dataset.map(label_transform, 'label')
    return dataset

#Data preprocessing
train_dataset = data_pre_processing(train_dataset)
test_dataset = data_pre_processing(test_dataset)

Data enhancement is applied to the dataset object sequentially through the mindspore.dataset.Dataset.map API interface provided by the MindSpore framework. Among the parameters passed in, image_transformers / label_transformers are user-defined data enhancement operations, and ‘image’ / ‘label’ specifies the data columns on which the data enhancement operation operates.

Adjust the pixel value size of the image based on the given scaling factor and translation factor through the mindspore.dataset.vision.Rescale API interface provided by the MindSpore framework. Among the parameters passed in, 1.0 / 255.0 is the scaling factor and 0 is the translation factor. This data augmentation operation implements normalization.

The input image is normalized according to the given mean and standard deviation through the mindspore.dataset.vision.Normalize API interface provided by the MindSpore framework. Among the parameters passed in, mean is the mean and std is the standard deviation.

Convert the shape of the input image from to through the mindspore.dataset.vision.HWC2CHW API interface provided by the MindSpore framework, where H represents the height of the image and W represents the width of the image. C represents the number of channels of the image.

Convert the input Tensor to the specified data type through the mindspore.dataset.transforms.TypeCast API interface provided by the MindSpore framework. In the passed parameters, int32 represents the target data type.

8. Data shuffling and partitioning

train_dataset = train_dataset.shuffle(60000)
train_dataset = train_dataset.batch(batch_size)
test_dataset = test_dataset.shuffle(10000)
test_dataset = test_dataset.batch(batch_size)

Create a cache of buffer_size size to shuffle the data set through the .shuffle method provided by the MindSpore framework. Global shuffling occurs when buffer_size is set to the dataset size. The shuffling steps are as follows: 1. Generate a shuffling buffer containing buffer_size data lines; 2. Randomly select a data line from the shuffling buffer and pass it to the next operation; 3. Get the next data line from the previous operation (if any) and put it into the shuffle buffer; 4. Repeat steps 2 and 3 until there are no data rows in the shuffle buffer. If shuffle is not performed, the neural network model will easily overfit to a certain label data for a period of time, resulting in reduced model performance and poor generalization ability.

Through the .batch method provided by the MindSpore framework, consecutive batch_size pieces of data in the data set are combined into one batch of data. The batch operation requires that the data in each column have the same shape.

9. Construct a fully connected neural network

class Network(nn.Cell):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.Dense1 = nn.Dense(28*28, 1024)
        self.ReLU1 = nn.ReLU()
        self.Dense2 = nn.Dense(1024, 512)
        self.ReLU2 = nn.ReLU()
        self.Dense3 = nn.Dense(512, 10)
        

    def construct(self, x):
        x = self.flatten(x)
        x = self.Dense1(x)
        x = self.ReLU1(x)
        x = self.Dense2(x)
        x = self.ReLU2(x)
        x = self.Dense3(x)
        return x

model = Network()

A four-layer fully connected neural network was constructed. The original low-dimensional input features are first mapped to a high-dimensional space through a layer of neural network, which is helpful for feature interaction and feature extraction. Afterwards, the dimensionality is continuously reduced through the two-layer neural network, trying to extract valuable features and increase the performance of the fully connected neural network. The output of each layer of neural network needs to be nonlinearly transformed through an activation function, which can solve problems that cannot be handled by linear models and increase the expressive ability of fully connected neural networks. I chose ReLU as the activation function. For the Sigmoid activation function, there are three problems: 1. The maximum derivative of the Sigmoid function is 0.25 and the saturated area accounts for a large proportion. The derivative of the saturated area approaches 0, which can easily cause the gradient to disappear during the backpropagation process. ; 2. The output value of the Sigmoid function is always greater than 0, rather than centered on 0, which will cause the convergence speed of the neural network model to slow down; 3. The Sigmoid function performs a power operation, which takes more time. In comparison, the ReLU activation function solves the vanishing gradient problem in the positive interval, and at the same time causes the output of some neurons to be 0, which results in the sparsity of the neural network, reduces the interdependence between parameters, and alleviates Overfitting problems occur, and the ReLU activation function converges faster and has higher computational efficiency.

10. Construct training function

# Define loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = nn.Adam(model.trainable_params(), learning_rate = lr) # nn.Adagrad nn.RMSProp nn.Adam

# Forward propagation does not calculate gradients
def forward(data, label):
    logits = model(data)
    loss = loss_fn(logits, label)
    return loss, logits

# Generate a derivation function used to calculate the forward propagation result and gradient of a given function
grad_fn = mindspore.value_and_grad(forward, None, optimizer.parameters, has_aux=True)

#one-step
def train_step(data, label):
    (loss, logits), grads = grad_fn(data, label)
    
    # Call the API to implement the optimizer
    optimizer(grads)

    # Custom implementation of optimizer
    # Attempt 1: Unable to assign value at runtime
    # for i, parameter in enumerate(model.get_parameters()):
    # parameter -= lr * grads[i]
    # Attempt 2: Official documents provide runtime assignment method
    # index = 0
    # for (name, _) in model.parameters_and_names():
    # lay_name, attr_name = name.split('.')
    # lay = getattr(model, lay_name)
    # attr = getattr(lay, attr_name)
    # ops.assign(attr, attr - lr * grads[index])
    # index + = 1

    return loss, logits

def train(model, dataset):
    num_batches = dataset.get_dataset_size()
    model.set_train()
    total, train_loss, accuracy = 0, 0, 0
    for batch, (data, label) in enumerate(dataset.create_tuple_iterator()):
        total + = len(data)
        loss, pred = train_step(data, label)
        train_loss + = loss.asnumpy()
        accuracy + = (pred.argmax(1) == label).asnumpy().sum()
    train_loss /= num_batches
    accuracy /= total
    print(f"Train: \\
 Accuracy: {(100*accuracy):>0.1f}%, Avg loss: {train_loss:>8f} \\
")

    return accuracy

First, the forward propagation function forward is defined. In PyTorch, by default, the gradient information required for backpropagation is recorded when performing forward propagation calculations. During the inference phase, this operation is redundant and takes extra time, so PyTorch provides torch.no_grad to cancel this process. MindSpore will only construct a reverse graph based on the forward graph structure when calling grad. It will not record any gradient information during forward propagation, that is, MindSpore’s forward calculations are all performed under torch. no_grad. Therefore, we need to call mindspore.value_and_grad to obtain the differential function used to calculate the forward propagation result and gradient. Then, we can call the API provided by the MindSpore framework to implement the optimizer, or we can customize the gradient descent optimizer according to the network parameter runtime assignment method provided by the official documentation to complete the update of the network parameters. Next, by calling the API interface create_tuple_iterator provided by the MindSpore framework, a batch iterator is created based on the data set object, and the fully connected neural network is trained in a loop and iteration.

Loss function We chose the cross-entropy loss function. In comparison, the partial derivative results of the mean square error on parameters are multiplied by the derivative of the activation function, and the saturation area of some activation functions is too large, which can easily cause the partial derivative to be 0 and the parameters are not updated. The cross-entropy loss describes the difference between the predicted data distribution and the real distribution, and can model the model performance well, and its partial derivatives of parameters do not have the derivatives of the activation function, which does not have the above problems and is easier to carry out. Gradient calculation. Experimentally verified, the effect of the cross-entropy loss function on the current task is to optimize the mean square error loss function.

It is worth mentioning that the cross-entropy loss function API interface nn.CrossEntropyLoss provided by the MindSpore framework has almost the same function as PyTorch. It does not require the input of one-hot encoding vectors and also has a built-in soft-max function.

11. Construct test function

def test(model, dataset, loss_fn):
    num_batches = dataset.get_dataset_size()
    model.set_train(False)
    total, test_loss, accuracy = 0, 0, 0
    for data, label in dataset.create_tuple_iterator():
        pred = model(data)
        total + = len(data)
        test_loss + = loss_fn(pred, label).asnumpy()
        accuracy + = (pred.argmax(1) == label).asnumpy().sum()
    test_loss /= num_batches
    accuracy /= total
    print(f"Test: \\
 Accuracy: {(100*accuracy):>0.1f}%, Avg loss: {test_loss:>8f} \\
")

    return accuracy

The implementation idea is roughly the same as the idea when constructing the training function, so I won’t go into details here. It is worth noting that before inputting the test set data into the model, we need to call set_train(False). After the call, the Dropout layer and batch normalization layer BatchNorm of the model will be turned off. Different from PyTorch, PyTorch needs to call torch.no_grad() during inference to ensure that gradients are not calculated during forward propagation to save video memory and speed up calculations. MindSpore does not record any gradient information during forward propagation. When inferring the category of input test data, the soft-max function can convert the output value into probability and expand the difference, but does not affect the classification judgment. Therefore we can directly regard the subscript of the largest component in the output vector as the inference result.

12. Model training and testing

Train_Accuracy_List = []
Test_Accurary_List = []
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}\\
----------------------------------")
    train_accuracy = train(model, train_dataset)
    Train_Accuracy_List.append(train_accuracy)
    test_accuracy = test(model, test_dataset, loss_fn)
    Test_Accurary_List.append(test_accuracy)
print("Done!")

13. Experimental results

14. Summary

I believe that by reading this blog, novices can enhance their theoretical knowledge and also understand clearly how to use the MindSpore framework to build neural network models and complete the training and inference of neural network models. At the same time, readers can learn from this blog how to customize the gradient descent algorithm to update neural network parameters .