Analysis of GPU memory usage of CNN convolutional neural network model

1. Reference materials

A brief discussion on deep learning: how to calculate the memory usage of models and intermediate variables
How to make precise use of video memory in Pytorch

2. Related introduction

0. Preliminary knowledge

For the convenience of calculation, this article performs unit conversion according to the following standards:

1G = 1000MB
1 MB = 1000 KB
1 K = 1000 Byte
1 B = 8 bits

1. Calculation method of model parameters

Reference blog: Parameter amount and calculation amount calculation method of CNN convolutional neural network model (concept version)

2. Data type of tensor

Data type	dtype	CPU tensor	GPU tensor
32-bit floating point	`torch.float32` or `torch.float`	`torch.FloatTensor`	`torch.cuda.FloatTensor`
64-bit floating point	`torch.float64` or `torch.double`	`torch.DoubleTensor`	`torch.cuda.DoubleTensor`
16-bit floating point	`torch.float16` or `torch.half`	`torch.HalfTensor`	`torch.cuda.HalfTensor`
8-bit integer (unsigned)	`torch.uint8`	`torch.ByteTensor`	`torch.cuda.ByteTensor`
8-bit integer (signed)	`torch.int8`	`torch.CharTensor`	`torch.cuda.CharTensor`
16-bit integer (signed)	`torch.int16` or `torch.short`	`torch.ShortTensor`	`torch.cuda.ShartTensor`
32-bit integer (signed)	`torch.int32` or `torch.int`	`torch.IntTensor`	`torch.cuda.IntTensor`
64-bit integer (signed)	`torch.int64` or `torch.long`	`torch.LongTensor`	`torch.cuda.LongTensor`

Typically, model training uses the following two data types:

float32 Single precision floating point type;
int32 Integer type.

8bit‘s integer occupies a space of 1B, and 32bit‘s floating-point float occupies a space of 4B . The double double-precision floating point type and long integer type are generally not used in normal model training.

Consumer-grade graphics cards are optimized for single-precision calculations, and server-grade graphics cards are optimized for double-precision calculations.

3. About `inplace=False`

We all know that the activation function Relu() has a default parameter inplace, which is set to False by default. When set to True, the new value we calculate through relu() will not occupy new space, but directly overwrite the original value. This is why some memory can be saved when the inplace parameter is set to True.

3. Introduction to video memory usage

0. Introduction

torch.FatalError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

The program crashed due to insufficient video memory. It is necessary to learn to calculate the size of the video memory occupied by the model and intermediate variables.

1. Video memory usage of images

Assuming a RGB three-channel true color picture, the length and width are 500x500, and the data type is single-precision floating point, then the size of the video memory occupied by this picture is: 500x500x3x4B=3MB. The space occupied by a (256, 3, 100, 100)-(N, C, H, W) FloatTensor is: 256x3x100x100x4B = 31MB

2. Video memory usage of the model

Usually, the video memory occupied by the model comes from two parts:

The parameters of the model itself (params), that is, the network layer with parameters.
Intermediate parameters (memory) generated during model calculation.

Generally speaking, the parameters of the model itself do not occupy a lot of video memory space. What mainly occupy the video memory space are the intermediate parameters generated during calculation.

2.1 Parameters of the model itself (params)

Network layers with parameters, including:

Convolutional layer: Conv2d(Cin, Cout, K), the parameter amount is Cin × Cout × K × K
Fully connected layer: Linear(M->N), the parameter amount is M×N
BatchNorm layer: BatchNorm(N), the parameter amount is 2N
Embedding layer: Embedding(N,W), the parameter amount is N × W

Network layers without parameters, including:

Activation layer Relu, etc.;
pooling layer;
Dropout layer;

2.2 Intermediate parameters of the model (memory)

The intermediate parameters generated by the model during calculation are the input and output generated by each layer when the input image is calculated;
Backward propagation Additional intermediate parameters generated during calculation;
The optimizer generates additional model parameters during optimization.

3. Actual video memory and theoretical video memory

Why does the actual occupied video memory space exceed the theoretical calculation?
The probably reason is some additional overhead of the deep learning framework. However, the theoretical value of video memory calculated by the above formula will not differ much from the actual value.

4. Calculate video memory usage

4.1 Method 1 (recommended)

Use the torchstat tool to calculate the video memory usage of the model. Reference blog: Parameter calculation method of CNN convolutional neural network model (experience version)

4.2 Method 2

Of course, you can also customize the function to calculate the video memory usage. The code is as follows:

# Model memory usage monitoring function
# model: input model
# input: Tensor variable that needs to be input in practice
# type_size defaults to 4 and the default type is float32

def modelsize(model, input, type_size=4):
    para = sum([np.prod(list(p.size())) for p in model.parameters()])
    print('Model {} : params: {:4f}M'.format(model._get_name(), para * type_size / 1000 / 1000))

    input_ = input.clone()
    input_.requires_grad_(requires_grad=False)

    mods = list(model.modules())
    out_sizes = []

    for i in range(1, len(mods)):
        m = mods[i]
        if isinstance(m, nn.ReLU):
            if m.inplace:
                continue
        out = m(input_)
        out_sizes.append(np.array(out.size()))
        input_ = out

    total_nums = 0
    for i in range(len(out_sizes)):
        s = out_sizes[i]
        nums = np.prod(np.array(s))
        total_nums + = nums


    print('Model {} : intermedite variables: {:3f} M (without backward)'
          .format(model._get_name(), total_nums * type_size / 1000 / 1000))
    print('Model {} : intermedite variables: {:3f} M (with backward)'
          .format(model._get_name(), total_nums * type_size*2 / 1000 / 1000))

Important note: Of course, the theoretical value of the video memory occupancy we calculated is only for reference. Because Pytorch requires additional video memory overhead when running, the actual video memory will be slightly larger than what we calculated. .

5. Video memory optimization method

Optimizing video memory in Pytorch is a necessary practice when we process large amounts of data, because we cannot have unlimited video memory. Video memory is limited, but data is unlimited. Only by optimizing the usage of video memory can we maximize the use of our data.

Optimization In addition to the optimization of the algorithm layer, the most basic video memory optimization is nothing more than the following points:

Reduce the size of the input image;
Reduce batch and reduce the number of input images each time;
Use more downsampling and pooling layers;
Some neural network layers can be slightly optimized by setting inplace in the relu layer;
Purchase a graphics card with more memory;
Optimize from the deep learning framework.

5.1 Sacrificing computing speed to reduce video memory usage

In PyTorch, if a model occupies too much video memory, a calculation process can be divided into two halves. Calculate the first half first, save the intermediate results required for the second half, and then calculate the second half.

# First set the input input=>requires_grad=True
# If not set, the resulting gradient may be 0.

# input
input = torch.rand(1, 10, requires_grad=True)

# Let's say we have a very deep network
layers = [nn.Linear(10, 10) for _ in range(1000)]


# Define the layer function to be calculated. You can see that we have defined two
# One calculates the first 500 layers, the other calculates the last 500 layers

def run_first_half(*args):
    x = args[0]
    for layer in layers[:500]:
        x = layer(x)
    return x

def run_second_half(*args):
    x = args[0]
    for layer in layers[500:-1]:
        x = layer(x)
    return x

# We introduce the new checkpoint
from torch.utils.checkpoint import checkpoint

x = checkpoint(run_first_half, input)
x = checkpoint(run_second_half, x)

#The last layer is transferred separately and executed.
x = layers[-1](x)
x.sum.backward()

For Sequential-model, because Sequential() can contain many blocks, the official provides another function package:

input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)

from torch.utils.checkpoint import checkpoint_sequential

# Split into two parts
num_segments = 2
x = checkpoint_sequential(model, num_segments, input)
x.sum().backward()

6. Track video memory usage

Let’s briefly discuss the issue of video memory utilization in Pytorch again (with improved video memory tracking code)

We borrowed the tool Pytorch-Memory-Utils to detect changes in video memory during our training process and analyze how we can correctly release excess video memory.

Through the Pytorch-Memory-Utils tool, we insert a detection function in the middle of the code that uses video memory, and we can output information similar to the following, At __main__ : line 13 Total Used Memory: 696.5 Mb means The video memory occupied by the current line of code, that is, when the 13th line of our code is executed, the video memory occupied is 695.5Mb. At __main__ : line 15 Total Used Memory:1142.0 Mb means that the video memory occupied when the program executes to line 15 is 1142.0Mb. The tensor variable between the two pieces of data represents the video memory occupied.

# 12-Sep-18-21:48:45-gpu_mem_track.txt

GPU Memory Track | 12-Sep-18-21:48:45 | Total Used Memory:696.5 Mb

At __main__ <module>: line 13 Total Used Memory:696.5 Mb

 + | 7 * Size:(512, 512, 3, 3) | Memory: 66.060 M | <class 'torch.nn.parameter.Parameter'>
 + | 1 * Size:(512, 256, 3, 3) | Memory: 4.7185 M | <class 'torch.nn.parameter.Parameter'>
 + | 1 * Size:(64, 64, 3, 3) | Memory: 0.1474 M | <class 'torch.nn.parameter.Parameter'>
 + | 1 * Size:(128, 64, 3, 3) | Memory: 0.2949 M | <class 'torch.nn.parameter.Parameter'>
 + | 1 * Size:(128, 128, 3, 3) | Memory: 0.5898 M | <class 'torch.nn.parameter.Parameter'>
 + | 8 * Size:(512,) | Memory: 0.0163 M | <class 'torch.nn.parameter.Parameter'>
 + | 3 * Size:(256, 256, 3, 3) | Memory: 7.0778 M | <class 'torch.nn.parameter.Parameter'>
 + | 1 * Size:(256, 128, 3, 3) | Memory: 1.1796 M | <class 'torch.nn.parameter.Parameter'>
 + | 2 * Size:(64,) | Memory: 0.0005 M | <class 'torch.nn.parameter.Parameter'>
 + | 4 * Size:(256,) | Memory: 0.0040 M | <class 'torch.nn.parameter.Parameter'>
 + | 2 * Size:(128,) | Memory: 0.0010 M | <class 'torch.nn.parameter.Parameter'>
 + | 1 * Size:(64, 3, 3, 3) | Memory: 0.0069 M | <class 'torch.nn.parameter.Parameter'>

At __main__ <module>: line 15 Total Used Memory:1142.0 Mb

 + | 1 * Size:(60, 3, 512, 512) | Memory: 188.74 M | <class 'torch.Tensor'>
 + | 1 * Size:(30, 3, 512, 512) | Memory: 94.371 M | <class 'torch.Tensor'>
 + | 1 * Size:(40, 3, 512, 512) | Memory: 125.82 M | <class 'torch.Tensor'>

At __main__ <module>: line 21 Total Used Memory:1550.9 Mb

 + | 1 * Size:(120, 3, 512, 512) | Memory: 377.48 M | <class 'torch.Tensor'>
 + | 1 * Size:(80, 3, 512, 512) | Memory: 251.65 M | <class 'torch.Tensor'>

At __main__ <module>: line 26 Total Used Memory:2180.1 Mb

- | 1 * Size:(120, 3, 512, 512) | Memory: 377.48 M | <class 'torch.Tensor'>
- | 1 * Size:(40, 3, 512, 512) | Memory: 125.82 M | <class 'torch.Tensor'>

At __main__ <module>: line 32 Total Used Memory:1676.8 Mb

Of course, this detection tool is not only applicable to Pytorch, but also to other deep learning frameworks, but you need to pay attention to the difference between static images and dynamic images in the actual running process.