[GPU activation: Explore the reasons behind CUDA out of memory, how to release GPU memory?

Directory
  • 1 Problem background
  • 2 Question exploration
    • 2.1 CUDA inherent video memory
    • 2.2 Memory activation and deactivation
    • 2.3 Release GPU memory
  • 3 Problem summary
  • 4 Say goodbye to bugs

1 Problem background

Students who have studied deep learning must be familiar with CUDA memory overflow errors like the one below.

RuntimeError: CUDA out of memory. Tried to allocate 916.00 MiB (GPU 0; 6.00 GiB total capacity; 4.47 GiB already allocated; 186.44 MiB free; 4.47 GiB reserved in total by PyTorch)

This article explores the memory management mechanism of CUDA and summarizes the solutions to this problem

2 Problem Exploration

2.1 CUDA inherent video memory

Before starting the experiment, clear the environment and enter nvidia-smi in the terminal.


Next, store a small tensor to the GPU

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.randn((2, 3), device=device)
</code>
   
   
   
   

The occupied video memory is as follows, totaling 448M


And when we increase the size of the tensor, for example

torch.randn((200, 300, 200, 20), device=device)
</code>
   
   
   
   

At this time, the GPU usage also increased, totaling 1362M

This shows that: GPU video memory usage is positively related to the size of the stored data. The larger the data, the more video memory is occupied. This is actually Nonsense, but reverse this sentence: the smaller the data, the smaller the memory occupied by the video? do an experiment

torch.randn((1, 1), device=device)
</code>
   
   
   
   

Still occupying 448M


In fact, this is because when CUDA is running, its firmware will occupy a certain amount of video memory. In the local software and hardware environment, it is 448M. Different CUDA versions or graphics card models have different firmware memory. In other words, as long as the GPU is used, it will take up at least

 x
      
     
    
      x
     
    
  </span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.4306em;"></span ><span class="mord mathnormal">x</span></span></span></span></span> M video memory, and this part of the video memory cannot be released</strong></ font>.</p>

2.2 Memory activation and deactivation

Given the following codes, which one will report an error?

  • Code A
    x1 = torch.randn((200, 300, 200, 20), device=device)
    x2 = torch.randn((200, 300, 200, 20), device=device)
    x3 = torch.randn((200, 300, 200, 20), device=device)
    x4 = torch.randn((200, 300, 200, 20), device=device)
    x5 = torch.randn((200, 300, 200, 20), device=device)
    x6 = torch.randn((200, 300, 200, 20), device=device)
    </code>
       
       
       
       
  • Code B
    x = torch.randn((200, 300, 200, 20), device=device)
    x = torch.randn((200, 300, 200, 20), device=device)
    x = torch.randn((200, 300, 200, 20), device=device)
    x = torch.randn((200, 300, 200, 20), device=device)
    x = torch.randn((200, 300, 200, 20), device=device)
    x = torch.randn((200, 300, 200, 20), device=device)
    </code>
       
       
       
       

The answer can be guessed, code A reported an error, which is related to the activation mechanism of CUDA memory. The current data space of CUDA can be regarded as a queue. There are two kinds of memory in the queue – Activate Memory (Activate Memory) and < font color="#4a86e8">Unactivate Memory. When a piece of memory is no longer referenced by a variable, the memory is converted from active memory to inactive memory, but it still exists in the data queue.

Next, when a new piece of data is added, CUDA will release part of the deactivated memory to store the new data. If the new data occupies more space than all the deactivated memory in the queue, some space will be applied for from the video memory and added to the queue, which is equivalent to the capacity of the queue being expanded; if the new data occupies space approximately equal to the deactivated memory in the queue , then the occupancy rate of CUDA memory will be almost unchanged

Can be experimentally verified and run

x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((200, 300), device=device)
</code>
 
 
 
 

The video memory occupied is 1364M, and running alone

x = torch.randn((200, 300, 200, 20), device=device)
</code>
 
 
 
 

1362M is almost the same, but the new data takes up more space than all the deactivated memory in the queue.

x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((300, 300, 300, 20), device=device)
</code>
 
 
 
 

The video memory usage soared to 3422M. When the data queue reaches a certain threshold, CUDA will trigger the garbage collection mechanism to clean up the inactive memory.

The above experiment explains a very common code in deep learning

for images, labels in train_bar:
images, labels = images.to(config.device), labels.to(config.device)
# Clear gradient
opt.zero_grad()
# Forward propagation
outputs = model(images)
# Calculate loss
loss = F.cross_entropy(outputs, labels)
# Backpropagation
loss.backward()
#Model update
opt.step()
</code>
 
 
 
 

Why can the GPU memory be kept unchanged? Essentially, this is what code B above does.

2.3 Release GPU memory

Run the following command to manually clear the deactivated memory in the GPU data queue

torch.cuda.empty_cache()
</code>
 
 
 
 

It should be noted that the above command may need to be run multiple times before the space is released, such as

x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((200, 300, 200, 20), device=device)
x = torch.randn((200, 300, 200, 20), device=device)
x = 1
</code>
 
 
 
 

At this time, x points to the int type, so the space in the GPU data queue is not referenced by variables, indicating that all the queues are inactive memory, but at this time, running nvidia-smi still occupies 2278M. It can be restored to 448M after further running torch.cuda.empty_cache() > Base occupation – Although there is no data on the GPU now, the firmware has started running, so the occupation cannot be released.

3 Problem summary

Summary of CUDA GPUVideo Memory Management:

  • GPU memory usage is positively related to the size of the stored data. The larger the data, the more memory it takes up.
  • As long as the GPU is used, it will occupy at least
    x
          
         
        
          x
         
        
      </span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.4306em;"></span ><span class="mord mathnormal">x</span></span></span></span></span> M video memory, and this part of the video memory cannot be released</li><li>When a piece of memory is no longer referenced by a variable, the memory is converted from active memory to inactive memory, but it still exists in the data queue.</li><li>When the data queue reaches a certain threshold, CUDA will trigger the garbage collection mechanism to clean up the inactive memory.</li><li>Run <code onclick="mdcp.copyCode(event)" style="user-select: auto;">torch.cuda.empty_cache()</code> to manually clean up the dead memory</li></ul>
    

Then according to the above theory, we can get the corresponding solution to the problem

  • Reduce batch_size

    Essentially, it prevents the GPU data queue from requesting more space from the video memory than the video memory itself.

  • Check whether there is data persisted to the GPU but not released

    for example:

    app = []
    for _ in range(1000):
    app.append(torch.randn((200, 300, 200, 20), device=device))
    </code>
       
       
       
       

    Here the append function is equivalent to obtaining a copy of the tensor torch.randn((200, 300, 200, 20), device=device) and storing it in the list, so each time it is saved The input tensors will be implicitly referenced, and the GPU will continue to increase the activated memory without being released, resulting in a crash.

  • Insert code with torch.no_grad() before the test phase and verification phase during the training process

    The principle is that gradients are not calculated, so there is no need for GPU acceleration and data will not be added to the data queue.