[GPU activation: Explore the reasons behind CUDA out of memory, how to release GPU memory? Directory 1 Problem background 2 Question exploration 2.1 CUDA inherent video memory 2.2 Memory activation and deactivation 2.3 Release GPU memory 3 Problem summary 4 Say goodbye to bugs 1 Problem background Students who have studied deep learning must be familiar with CUDA memory overflow errors like the one below. RuntimeError: CUDA out of memory. Tried to allocate 916.00 MiB (GPU 0; 6.00 GiB total capacity; 4.47 GiB already allocated; 186.44 MiB free; 4.47 GiB reserved in total by PyTorch) This article explores the memory management mechanism of CUDA and summarizes the solutions to this problem 2 Problem Exploration 2.1 CUDA inherent video memory Before starting the experiment, clear the environment and enter nvidia-smi in the terminal. Next, store a small tensor to the GPU import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') torch.randn((2, 3), device=device) </code> The occupied video memory is as follows, totaling 448M And when we increase the size of the tensor, for example torch.randn((200, 300, 200, 20), device=device) </code> At this time, the GPU usage also increased, totaling 1362M This shows that: GPU video memory usage is positively related to the size of the stored data. The larger the data, the more video memory is occupied. This is actually Nonsense, but reverse this sentence: the smaller the data, the smaller the memory occupied by the video? do an experiment torch.randn((1, 1), device=device) </code> Still occupying 448M In fact, this is because when CUDA is running, its firmware will occupy a certain amount of video memory. In the local software and hardware environment, it is 448M. Different CUDA versions or graphics card models have different firmware memory. In other words, as long as the GPU is used, it will take up at least x x </span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.4306em;"></span ><span class="mord mathnormal">x</span></span></span></span></span> M video memory, and this part of the video memory cannot be released</strong></ font>.</p> 2.2 Memory activation and deactivation Given the following codes, which one will report an error? Code A x1 = torch.randn((200, 300, 200, 20), device=device) x2 = torch.randn((200, 300, 200, 20), device=device) x3 = torch.randn((200, 300, 200, 20), device=device) x4 = torch.randn((200, 300, 200, 20), device=device) x5 = torch.randn((200, 300, 200, 20), device=device) x6 = torch.randn((200, 300, 200, 20), device=device) </code> Code B x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) </code> The answer can be guessed, code A reported an error, which is related to the activation mechanism of CUDA memory. The current data space of CUDA can be regarded as a queue. There are two kinds of memory in the queue – Activate Memory (Activate Memory) and < font color="#4a86e8">Unactivate Memory. When a piece of memory is no longer referenced by a variable, the memory is converted from active memory to inactive memory, but it still exists in the data queue. Next, when a new piece of data is added, CUDA will release part of the deactivated memory to store the new data. If the new data occupies more space than all the deactivated memory in the queue, some space will be applied for from the video memory and added to the queue, which is equivalent to the capacity of the queue being expanded; if the new data occupies space approximately equal to the deactivated memory in the queue , then the occupancy rate of CUDA memory will be almost unchanged Can be experimentally verified and run x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300), device=device) </code> The video memory occupied is 1364M, and running alone x = torch.randn((200, 300, 200, 20), device=device) </code> 1362M is almost the same, but the new data takes up more space than all the deactivated memory in the queue. x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((300, 300, 300, 20), device=device) </code> The video memory usage soared to 3422M. When the data queue reaches a certain threshold, CUDA will trigger the garbage collection mechanism to clean up the inactive memory. The above experiment explains a very common code in deep learning for images, labels in train_bar: images, labels = images.to(config.device), labels.to(config.device) # Clear gradient opt.zero_grad() # Forward propagation outputs = model(images) # Calculate loss loss = F.cross_entropy(outputs, labels) # Backpropagation loss.backward() #Model update opt.step() </code> Why can the GPU memory be kept unchanged? Essentially, this is what code B above does. 2.3 Release GPU memory Run the following command to manually clear the deactivated memory in the GPU data queue torch.cuda.empty_cache() </code> It should be noted that the above command may need to be run multiple times before the space is released, such as x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = torch.randn((200, 300, 200, 20), device=device) x = 1 </code> At this time, x points to the int type, so the space in the GPU data queue is not referenced by variables, indicating that all the queues are inactive memory, but at this time, running nvidia-smi still occupies 2278M. It can be restored to 448M after further running torch.cuda.empty_cache() > Base occupation – Although there is no data on the GPU now, the firmware has started running, so the occupation cannot be released. 3 Problem summary Summary of CUDA GPUVideo Memory Management: GPU memory usage is positively related to the size of the stored data. The larger the data, the more memory it takes up. As long as the GPU is used, it will occupy at least x x </span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.4306em;"></span ><span class="mord mathnormal">x</span></span></span></span></span> M video memory, and this part of the video memory cannot be released</li><li>When a piece of memory is no longer referenced by a variable, the memory is converted from active memory to inactive memory, but it still exists in the data queue.</li><li>When the data queue reaches a certain threshold, CUDA will trigger the garbage collection mechanism to clean up the inactive memory.</li><li>Run <code onclick="mdcp.copyCode(event)" style="user-select: auto;">torch.cuda.empty_cache()</code> to manually clean up the dead memory</li></ul> Then according to the above theory, we can get the corresponding solution to the problem Reduce batch_size Essentially, it prevents the GPU data queue from requesting more space from the video memory than the video memory itself. Check whether there is data persisted to the GPU but not released for example: app = [] for _ in range(1000): app.append(torch.randn((200, 300, 200, 20), device=device)) </code> Here the append function is equivalent to obtaining a copy of the tensor torch.randn((200, 300, 200, 20), device=device) and storing it in the list, so each time it is saved The input tensors will be implicitly referenced, and the GPU will continue to increase the activated memory without being released, resulting in a crash. Insert code with torch.no_grad() before the test phase and verification phase during the training process The principle is that gradients are not calculated, so there is no need for GPU acceleration and data will not be added to the data queue.