pytorch series 3 – dynamic calculation graph and automatic differentiation

This article is interpreted based on pytorch1.10:torch – PyTorch 1.10 documentation

Text operations are available on github at Shirley-Xie/pytorch_exercise · GitHub, and there are running results.

1. Dynamic calculation graph

Problems solved: Because deep learning increases the types and numbers of various tensor operations, it will lead to various unexpected problems, such as whether multiple operations should be executed in parallel or sequentially, and what about the underlying equipment? Collaboration, how to avoid redundant operations, etc. These problems will affect the execution efficiency of our algorithm and even cause some bugs.

Computational graph: It is composed of nodes and edges. Nodes represent tensors or functions, and edges represent the dependence between tensors and functions. For example, the figure below represents y=wx + b.


The advantage of using calculation graphs is that it not only makes the calculation look more concise, but also makes gradient derivation more convenient. The gradient accumulation rule derived from the derivation chain rule can be calculated more intuitively in the figure. The grad gradient of the tensor will not be automatically cleared and needs to be set to zero manually when needed.

The dynamics here have two main meanings:

The first meaning is: The forward propagation of the calculation graph is performed immediately. There is no need to wait for the complete calculation graph to be created. Each statement will dynamically add nodes and edges to the calculation graph, and forward propagation will be performed immediately to obtain the calculation results.

The second meaning is: The calculation graph is destroyed immediately after backpropagation. The next call requires rebuilding the calculation graph. If the backward method is used in the program to perform backpropagation, or the gradient is calculated using the torch.autograd.grad method, the created calculation graph will be destroyed immediately to release the storage space and needs to be re-created next time it is called.

This is discussed below in different contexts.

1.1 Calculating tensors in graphs

Forward propagation is performed immediately

import torch
w = torch.tensor([[3.0,1.0]],requires_grad=True)
b = torch.tensor([[3.0]],requires_grad=True)
X = torch.randn(10,2)
Y = torch.randn(10,1)
Y_hat = [email protected]() + b # After Y_hat is defined, its forward propagation is executed immediately, regardless of the loss creation statement behind it.
loss = torch.mean(torch.pow(Y_hat-Y,2))




Destroyed immediately after backpropagation

#The calculation graph is destroyed immediately after backpropagation. If you need to retain the calculation graph, you need to set retain_graph = True
loss.backward() #loss.backward(retain_graph = True)

#loss.backward() #If reverse propagation is executed again, an error will be reported

1.2 Function in calculation graph

We are already familiar with tensors in calculation graphs. Another type of node in calculation graphs is Function, which is actually various functions that operate on tensors in Pytorch. There is a big difference between these Functions and our functions in Python, that is, it includes both forward calculation logic and backpropagation logic. We can create this kind of Function that supports backpropagation by inheriting torch.autograd.Function.

class MyReLU(torch.autograd.Function):
    #Forward propagation logic, you can use ctx to store some values for reverse propagation.
    def forward(ctx, input):
        return input.clamp(min=0)

    #Backpropagation logic
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

w = torch.tensor([[3.0,1.0]],requires_grad=True)
b = torch.tensor([[3.0]],requires_grad=True)
X = torch.tensor([[-1.0,-1.0],[1.0,1.0]])
Y = torch.tensor([[2.0,3.0]])

relu = MyReLU.apply # relu can now also have forward and backward propagation functions
Y_hat = relu([email protected]() + b)
loss = torch.mean(torch.pow(Y_hat-Y,2))


# The gradient function of Y_hat is the MyReLU.backward defined by ourselves


The result is:

tensor([[4.5000, 4.5000]])
<torch.autograd.function.MyReLUBackward object at 0x1162c8798>

1.3 Leaf nodes

1.3.1 The relationship between leaf nodes and grad

When executing the above code, we found that the grads of Y_hat and loss were None, which was not what we expected. Why is this happening? Because they are not leaf node tensors. During the backpropagation process, only tensor.is_leaf is used to determine whether it is a leaf node, and only the gradient value of the leaf node can be retained. After backpropagation is completed, the gradients of non-leaf nodes are released by default. Pytorch designs such rules mainly tosave memory or video memory space.

In Pytorch neural network, we back propagate backward() to find the gradient of leaf nodes. In pytorch, the tensors with weight w in the neural network layer are all leaf nodes. Their require_grad is True, but they are all created by users, so they are all leaf nodes. And backpropagation backward() is to find their gradients.

When backward() is called, the gradient value of the node will be calculated only when requires_grad and is_leaf are both true. So in the following example, only w and b will calculate the gradient of the node. By default, the gradient of the leaf node is saved in the grad of the tensor.

w = torch.tensor([[3.0,1.0]],requires_grad=True)
b = torch.tensor([[3.0]],requires_grad=True)
X = torch.randn(10,2)
Y = torch.randn(10,1)
Y_hat = [email protected]() + b
loss = torch.mean(torch.pow(Y_hat-Y,2))

", w.is_leaf, X.is_leaf, b.is_leaf, Y.is_leaf, Y_hat.is_leaf,loss.is_leaf)
", w.grad, X.grad, b.grad, Y.grad, Y_hat.grad,loss.grad)


 True True True True False False
 tensor([[14.0855, 5.2406]]) None tensor([[8.6295]]) None None None

1.3.2 Judgment of leaf nodes

1. Tensors whose requires_grad is False are conventionally reduced to leaf tensors.

Just like the inputs we train the model, they all require_grad=False, because they do not need to calculate the gradient (what we train the network trains is the weight of the network model, without training input). They are the starting point of the computational graph. Once an operation such as flipping is performed, it is no longer a leaf node.

a= torch.randn(10,2)
a.is_leaf# True

2. Tensors that require_grad is True, if they are created by users, they are leaf tensors.

For example, various network layers, nn.Linear(), nn.Conv2d(), etc., are created by users, and their network parameters also need to be trained, so requires_grad=True. This means that they are not the result of an operation, so gra_fn is None.

However, leaf nodes can easily become non-leaf nodes.

a= torch.randn(10,2,requires_grad=True)
print(a.is_leaf) #True
c= torch.randn(10,2,requires_grad=True).reshape(4,5)
print(c.is_leaf) # False

1.3.3 Non-leaf nodes save gradients

If you need to retain the gradient of the intermediate calculation result into the grad attribute, you can use the retain_grad method. If you just want to view the gradient value for debugging the code, you can use register_hook to print the log.

If you want to retain the gradient of y above, add retain_grad() at the end.

# Forward propagation
w = torch.tensor([[3.0,1.0]],requires_grad=True)
b = torch.tensor([[3.0]],requires_grad=True)
X = torch.randn(10,2)
Y = torch.randn(10,1)
Y_hat = [email protected]() + b

#Non-leaf node gradient display control
Y_hat.register_hook(lambda grad: print('Y_hat grad: ', grad))

loss = torch.mean(torch.pow(Y_hat-Y,2))
loss.retain_grad() # retain_grad()
# Backpropagation

", w.is_leaf, X.is_leaf, b.is_leaf, Y.is_leaf, Y_hat.is_leaf,loss.is_leaf)
", w.grad, X.grad, b.grad, Y.grad, Y_hat.grad,loss.grad)


Y_hat grad: tensor([[-4.0340e-01],
        [1.6016e + 00],
        [1.4119e + 00],
 True True True True False False
 tensor([[5.0101, 1.2299]]) None tensor([[4.3500]]) None None tensor(1.)

2. Automatic differentiation

Neural networks usually rely on backpropagation to find gradients to update network parameters. The process of finding gradients is usually a very complex and error-prone thing. The deep learning framework can help us automatically complete this gradient calculation.

Pytorch generally implements this gradient calculation through the backward propagation backward method. The gradient obtained by this method will be stored under the grad attribute of the corresponding independent variable tensor. In addition, the torch.autograd.grad function can also be called to implement gradient calculation. This is Pytorch’s automatic differentiation mechanism.

2.1 Use the backward method to derive tensor derivatives

The backward method is usually called on a scalar tensor, and the gradient obtained by this method will be stored under the grad attribute of the corresponding independent variable tensor.

torch.autograd.backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)

tensors represent tensors used for derivation, such as loss.
retain_graph means saving the calculation graph. Since Pytorch uses a dynamic graph mechanism, the calculation graph will be released after each backpropagation. If we don't want to be released, we need to set this parameter to True
create_graph means creating a derivative calculation graph for high-order derivation.
grad_tensors represents multi-gradient weights. If there are multiple losses that need to calculate gradients, the weight ratio of these losses must be set.

We usually call this function when we execute backward().

w = torch.tensor([1.], requires_grad=True)
x = torch.tensor([2.], requires_grad=True)

a = torch.add(w, x)
b = torch.add(w, 1)

y0 = torch.mul(a, b) #dy0/dw = 5
y1 = torch.add(a, b) #dy1 /dw = 2
# There are multiple gradients here. Set weights for the two gradients. The final gradient of w is the sum of the two weighted gradients. Otherwise, an error is reported.
loss =[y0, y1], dim=0)

grad_tensors = torch.tensor([1.,2.])
loss.backward(gradient=grad_tensors) #5 + 2*2=9
# Result: tensor([9.])

2.2 High-order derivation of autograd.grad method

torch.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)

1. Advanced derivation

# Derivative of f(x) = a*x**2 + b*x + c

x = torch.tensor(0.0,requires_grad = True) # x needs to be differentiated
a = torch.tensor(1.0)
b = torch.tensor(-2.0)
c = torch.tensor(1.0)
y = a*torch.pow(x,2) + b*x + c

# create_graph set to True will allow creation of higher order derivatives
dy_dx = torch.autograd.grad(y,x,create_graph=True)[0]

# Find the second derivative
dy2_dx2 = torch.autograd.grad(dy_dx,x)[0]


2. Derivatives of multiple independent variables

x1 = torch.tensor(1.0,requires_grad = True) # x needs to be differentiated
x2 = torch.tensor(2.0,requires_grad = True)

y1 = x1*x2
y2 = x1 + x2

# Allows derivatives to be calculated on multiple independent variables at the same time
(dy1_dx1,dy1_dx2) = torch.autograd.grad(outputs=y1,inputs = [x1,x2],retain_graph = True)
print(dy1_dx1,dy1_dx2) # tensor(2.) tensor(1.)

# If there are multiple dependent variables, it is equivalent to summing the gradient results of multiple dependent variables.
(dy12_dx1,dy12_dx2) = torch.autograd.grad(outputs=[y1,y2],inputs = [x1,x2])
print(dy12_dx1,dy12_dx2) # tensor(3.) tensor(2.)


tensor(2.) tensor(1.)
tensor(3.) tensor(2.)

refer to:

Pytorch leaf tensor leaf tensor (leaf node) (detach)_pytorch leaf node_hxxjxw’s blog-CSDN blog

GitHub – lyhue1991/eat_pytorch_in_20_days: Pytorch is delicious, just eat it!

System Learning Pytorch Notes 2: Pytorch’s dynamic graphs, automatic derivation and logistic regression – CSDN Blog