Activation function and gradient of loss

Activation function:
Biologists first studied the neuron mechanism of frogs and found that frog neurons have multiple inputs x0, x1, and x2, and the response value is the result of their weighting. However, if the response value is less than the threshold, it will not respond. , and only when it is greater than the threshold, there will be a fixed response.

This is similar to the step function

However, the step function is not differentiable because it is discontinuous. Scientists propose a continuously smooth activation function: sigmoid/logistic

sigmoid

The range of use is relatively wide, because it can compress the range of output values between 0-1, such as probability, and pixel values RGB are all 0-1.

When the input is small, the response is close to 0, and when the input is large, the response is close to 1.

Derivative of the sigmoid function

But the sigmoid function has flaws:
When it approaches positive infinity, the derivative of σ is close to 0, and the parameter θ will still be θ after being updated. This situation where the parameters cannot be updated for a long time is also called gradient discretization.

Implementation of sigmoid on pytorch

import torch
a = torch.linspace(-100, 100, 10)
a

torch.sigmoid(a)
# You can also use F.sigmoid(a)
# from torch.nn import function as F


tanh

This activation function is often used in RNN

Obtained by sigmoid transformation, the value range is -1~1

Derive tanh:

Similar to sigmoid, as long as there is a value of the activation function, the value of the derivative can also be obtained directly.

a = torch.linspace(-1, 1, 10)

torch.tanh(a)

ReLU

The classic activation function of deep learning, relatively simple and basic

When x<0, the gradient is 0, and when x>=0, the gradient is 1. Gradient calculation is very convenient during backward propagation, and the gradient will not be enlarged or reduced, avoiding gradient discreteness and gradient explosion.

a = torch.linspace(-1, 1, 10)

torch.relu(a)

F.relu(a)

MSE_LOSS

The mean square error MSE is generally used, and there is also an error for classification problems that is left to be learned later.

The basic form of MSE, taking linear perceptron xw + b as an example:

Use L2-norm in torch to add the square to get the loss

Derivative of loss:

Automatic derivation using pytorch
Calculate mse first
The simplest linear perceptron is used, and at this time b=0, x is initialized to 1, w dim=1, initialized to 2


Automatic derivation

If you ask for guidance directly, an error will be reported.

The w initialization is not set to require derivative information. Direct derivation will report an error, so the w information needs to be updated.

After w is updated, the picture must also be updated, so mse must also be updated, otherwise an error will still be reported.

w.requires_grad_()

mse = F.mse_loss(torch.ones(1), x*w)

torch.autograd.grad(mse, [w])


loss = (1-2)**2 = 1
grad is the derivation of loss with respect to w:
2(1 – x*w) * (-x) = 2 (1-2) (-1) = 2

Replenish:
Calling backward() on loss will complete the gradient calculation on this path from back to front, and then manually view it, such as: w.grad

# also need to update w and the picture
w.requires_grad_()

mse = F.mse_loss(torch.ones(1), x*w)

mse.backward()

w.grad

Comparison between autograd method and backward method:

Cross Entropy Loss

It is used for errors in classification, and will be understood in depth later. This time we mainly understand the activation function used closely with it – softmax.

Function: Take the three final output values of the model as an example, convert them into probability values, and the sum is 1. It can be found that the difference between the larger value and the smaller value after the transformation is larger than that before the transformation. Pyramid effect.

Derivative of softmax function:


Among them, i and j have the same value range, indicating which probability value and which input respectively.

Two situations
1. When i = j
The derivative result is a positive number Pi (1-Pj)

2. When i ≠ j
The derivative result is negative -PiPj

Summarize:

pytorch implementation

a = torch.rand(3)
a.requires_grad_()

p = F.softmax(a, dim=0)

torch.autograd.grad(p[0], [a], retain_graph=True)
#Retain the graph. If it is not retained, the derivation information will be updated again next time.
torch.autograd.grad(p[1], [a])