DNN: Deep Neural Network

DNN basic structure

? The deep neural network is based on the extension of the above MLP perceptron. DNN can be understood as a neural network with many hidden layers. MLP can be regarded as the predecessor of DNN, which can be roughly divided into three layers: input layer, hidden layer, and output layer. MLP is usually very shallow, and the hidden layer is only one or two layers. DNN generally has more than two hidden layers, and there are more types of activation functions.

img

DNN training process

As shown in the figure, here we divide the DNN training into four processes, which are Forward-Propogation (FP), Backward-Propogation (Backward-Propogation, BP ), Weight Gradient Calculation (Weight Gradient , WG), (for the convenience of expression, BP and WG will be collectively referred to as the reverse process) and weight update (Weight Update, WU).

First, the training data is sent to the network in batches, and the forward calculation is performed layer by layer until the output layer, and then the current network output is compared with the real label, and the loss is calculated using the loss function (Loss function). Common loss functions include Square error loss (Mean square error, MSE), cross entropy loss (Cross Entropy), etc.; taking the typical classification task, Cross Entropy loss function as an example, the forward calculation is as follows:

the y

^

i

=

f

L

(

.

.

.

f

2

(

f

1

(

x

i

)

)

)

p

i

=

the s

o

f

t

m

a

x

(

the y

^

i

)

L

o

the s

the s

=

?

1

N

i

N

c

C

the y

i

c

l

o

g

(

p

i

c

)

\hat{y}_i=f^L(…f^2(f^1(x_i))) \ p_i=softmax(\hat{y}_i) \ Loss=- \frac{1}{N} \sum_i^N \sum_c^Cy_{ic}log(p_{ic})

y^?i?=fL(…f2(f1(xi?)))pi?=softmax(y^?i?)Loss=?N1?i∑N?c∑C?yic?log(pic? )
where N represents the number of samples and C represents the number of categories.

The reverse process is to calculate the gradient of the loss function on each layer layer by layer according to the chain rule. BP and WG are two calculation branches of the reverse process, which are used to calculate the gradient of the loss function to the activation value (in the literature Usually called error, that is, δ1, δ2, δ3 in Figure 1), and the gradient for the weight ( Wg0, Wg1, Wg2 ). Their calculation principles are as follows:

σ

1

=

?

L

?

alpha

1

=

?

L

?

alpha

2

?

alpha

2

?

alpha

1

=

σ

2

W

1

T

W

g

1

=

?

L

?

W

1

=

?

L

?

alpha

2

?

alpha

2

?

W

1

=

σ

2

alpha

1

\sigma_1=\frac{\partial L}{\partial \alpha_1}=\frac{\partial L}{\partial \alpha_2}\frac{\partial \alpha_2 }{\partial \alpha_1}=\sigma_2 W_1^T \ W _{g1}=\frac{\partial L}{\partial W_1}=\frac{\partial L }{\partial \alpha_2}\frac{\partial \alpha_2}{\partial W_1} =\sigma_2 \alpha_1 \

σ1?=?α1L?=?α2Lα1α2=σ2?W1T?Wg1?=?W1L?=?α2LW1α2? ?=σ2?α1?
In pytorch, the reverse process is done directly through loss.backward()

Finally, the weights are updated according to the weight gradient obtained in the reverse process. Basic stochastic gradient descent algorithm (SGD): W1′←W1?ηWg1

Supplementary note:

①DNN forward propagation algorithm

Using the same idea as the perceptron, we can use the output of the previous layer to calculate the output of the next layer

img

DNN backpropagation algorithm

Suppose we have m training samples {(x1,y1),(x2,y2),…,(xm,ym)}, where x is the input vector, the feature dimension is n_in, and y is the output vector, the feature dimension is n_out . We need to use these m samples to train a model that can predict the output of the ytest vector when there is a new test sample (xtest,?).

If we adopt the DNN model, that is, we make the input layer n_in neurons, and the output layer has n_out neurons. Plus some hidden layers with several neurons. At this time, it is necessary to find the appropriate linear coefficient matrix W and bias vector b corresponding to all hidden layers and output layers, so that the output calculated by all training sample inputs is as equal to or as close to the sample output as possible. How to find the right parameters?

A suitable loss function can be used to measure the output loss of the training sample, and then the loss function is optimized to find the minimized extreme value. The corresponding series of linear coefficient matrix W and bias vector b are our final results. In DNN, The process of optimizing the extreme value of the loss function is the most common way to iterate step by step through the gradient descent method, and it can also be other iterative methods such as the Newton method and the quasi-Newton method.

Before performing the DNN backpropagation algorithm, we need to select a loss function to measure the loss between the output calculated by the training sample and the real training sample output. There are many loss functions that DNN can choose. In order to focus on the algorithm, the most common mean square error is used here to measure the loss. That is, for each sample, we expect to minimize the following:

img

With the loss function, the gradient descent method is used to iteratively solve the w and b of each layer.

**First is output layer L. **Note that the W and b of the output layer satisfy the following formula:

img

For the parameters of the output layer, the loss function becomes:

img

Find the gradient of W,b:

img

img

img

DNN advantages and disadvantages

① Advantages of DNN

? Since DNN can fit almost any function, the nonlinear fitting ability of DNN is very strong.

②DNN disadvantages

? 1) Expansion of the number of parameters. Since DNN adopts the form of full connection, the connection in the structure brings weight parameters of magnitude, which not only easily leads to over-fitting, but also easily leads to local optimum.

? 2) local optimum. With the deepening of the neural network, the optimization function is more likely to fall into the local optimum and deviate from the true global optimum. For limited training data, the performance is even worse than that of the shallow network.

? 3) Gradient disappears. Using the sigmoid activation function (transfer function), when the BP backpropagates the gradient, the gradient will attenuate. As the number of neural network layers increases, the attenuation accumulates, and the gradient is basically 0 when it reaches the bottom layer.

? 4) Unable to model changes in time series. The temporal order of samples is very important for applications such as natural language processing, speech recognition, and handwriting recognition.

DNN code implementation (handwritten digit recognition)

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets,transforms
from torch.autograd import Variable

#training setting
batch_size = 16
#MNIST Dataset
train_dataset = datasets.MNIST(root='./mnist_data/',
                               train=True,
                               transform = transforms.ToTensor(),
                               download=True)

test_dataset = datasets.MNIST(root='./mnist_data/',
                              train=False,
                              transform = transforms.ToTensor())

# Data Loader (Input Pipeline)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                           batch_size=batch_size,
                                           shuffle=False)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.l1 = nn.Linear(784,520)
        self.l2 = nn.Linear(520, 320)
        self.l3 = nn.Linear(320, 240)
        self.l4 = nn.Linear(240, 120)
        self.l5 = nn.Linear(120, 10)

    def forward(self, x):
        x = x.view(-1,784) # Flattern the (n,1,28,28) to (n,784)
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = F.relu(self.l3(x))
        x = F.relu(self.l4(x))

        return self.l5(x)
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(),lr= 0.01 , momentum= 0.5)

def train(epoch):
    model. train()
    for batch_idx, (data, target) in enumerate(train_loader):
        print(len(train_loader))
        data,target = Variable(data), Variable(target)
        optimizer. zero_grad()
        output = model(data)
        loss = criterion(output,target)
        loss. backward()
        optimizer. step()
        if batch_idx % 10 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss. item()))


def test():
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data, target = Variable(data, volatile=True), Variable(target)
        output = model(data)
        # sum up batch loss
        test_loss += criterion(output, target).data.item()
        # get the index of the max
        pred = output.data.max(1, keepdim=True)[1]
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()
    test_loss /= len(test_loader.dataset)
    print('\\
Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\
'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader. dataset)))

for epoch in range(1, 10):
    train (epoch)
    test()