[Resolved] RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronous

Problem description

The specific error message is

./aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 & amp; & amp ; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 & amp; & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 & & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 & & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 & & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 & & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 & & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 & & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 & amp; & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 & amp; & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 & amp; & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 & amp; & amp; t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 & amp; & &t

./aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 & amp; & amp ; t < n_classes` failed.
Traceback (most recent call last):
File “/home/visionx/EXT-3/qfy/project/temp/SimCLR/linear_evaluation.py”, line 207, in
loss_epoch, accuracy_epoch = train(
^^^^^^
File “/home/visionx/EXT-3/qfy/project/temp/SimCLR/linear_evaluation.py”, line 75, in train
acc = (predicted == y).sum().item() / y.size(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Record of pitfalls:

Pit 1:

I tried, but it doesn’t work!

Pit 2:

This is also very general, and I still haven’t solved it yet. But it still sparked some thinking!

So in the final analysis, you have to analyze it yourself! ! !

Cause analysis and solutions

1.1. Cause analysis

This is not common. Why is this error reported? I braved the search and got the following results:

“RuntimeError: CUDA error: device-side assert triggered” is a runtime error caused by an assertion error triggering in the CUDA kernel. Assertions in the CUDA kernel are often used to detect errors or inconsistencies in the code, and when these assertions fail, the CUDA runtime throws such errors. This error can occur for a variety of reasons, including the following possibilities:

Program error: The most common cause is an error in your CUDA program. This may be due to incorrect device memory access, out-of-bounds access, incorrect synchronization of threads, etc. You need to check your CUDA code and make sure it is correct.

GPU Hardware Issue: This error can also be caused by GPU hardware issues, such as GPU memory failure or other hardware failures. If other CUDA applications are also having problems on the same GPU, a hardware issue may be the cause.

Unstable GPU driver: Some GPU driver versions may be unstable, causing CUDA errors. Try updating or rolling back the GPU driver to see if that resolves the issue.

Memory Exhaustion: If your CUDA program overuses GPU memory, it may cause CUDA errors. Make sure your program doesn’t throw errors when the GPU runs out of memory.

For better debugging and determining the source of the problem, some options are mentioned in the error message. You can try the following for further debugging:

Setting CUDA_LAUNCH_BLOCKING=1: This will cause CUDA to block when an error is reported to more accurately determine the location of the problem.

Compile with TORCH_USE_CUDA_DSA enabled: This enables device-side assertions, which can provide more information about the problem but may reduce performance.

Most importantly, double check your CUDA code to make sure it is correct and doesn’t throw errors. If the problem persists, you may need more detailed debugging and diagnostics to determine the root cause of the problem. If hardware failure is the likely cause, consider checking the health of your GPU.

1.2. Further thoughts

But unfortunately, I don’t understand what this means, because the problem mentioned in 1234 will most likely not exist on my side, so this content is definitely a software problem, but why is it cuda? What about problems caused by the kernel? Think about the difference from before:

For my program, it is the difference between cifar10 and cifar100, that is, the difference in the number of categories

In other words, my two operations before and after were only different in category, but I did not change the category, so an error occurred. I took this idea to find a new solution, so I found this

(1) At first, I looked for solutions on the Internet, and most of the netizens’ solutions were similar to this:

Some people say that the reason for this problem is that when doing classification tasks, there are more labels than the number of categories in the training data. For example: If you set a total of 8 categories, but 9 appears in the label in the training data, this error will be reported. So here comes the problem, there is a trap here. If the label in the training data contains 0, the above error will also be reported. This is very strange. Generally, we start counting from 0, but in pytorch, category labels below 0 will report an error. So if the category labels start from 0, add 1 to all category labels.

pytorch will scan every folder under train_path (each type of image is located in the folder of its category) and map each category into a numerical value. For example, if there are 4 categories, the category label is [0,1,2,3 ]. When performing two classifications, the labels are indeed mapped to [0,1], but when performing four classifications, the labels are mapped to [1,2,3,4], so an error will be reported.

(2) In fact, it is useless for me to solve it according to this idea. I still get the same error. Later, I searched the code carefully and found that it was not that the labels and classification categories did not match, but that there was a problem with the code of the last layer of the network. You should fill in the categories you want to output.
 self.outlayer = nn.Linear(256 * 1 * 1, 3) #The final fully connected layer
 
# Others are in 3 categories, but mine are in 5 categories. Correction here will solve the problem.
 
 self.outlayer = nn.Linear(256 * 1 * 1, 5) #The last is the fully connected layer
(3) In fact, it is a small problem, but I have been working on it for a long time, so I will record it here.
—————-
Copyright statement: This article is an original article by CSDN blogger “Penta_Kill_5”

1.3. Hands-on solution

This part of the content inspired me, so I made the following changes in my code

n_classes = 10 # CIFAR-10 / STL-10

Change the above code to the following code

n_classes = 100 # CIFAR-10 / STL-10

If you run again, it will run.

Finished with flowers

As I write this, I suddenly feel a lot of emotion. I have been recording my bug career for almost four months now. Maybe because I am too new, not many people read it. Of course, there are also people who say that technical blogs do not come to csdn. , there are almost no members around me. Compared with the feedback mechanisms of other platforms, there is no way to get them here, but I still insist on writing a lot and have several high-quality blogs, which is rare.

But my own growth cannot be ignored. I am getting better and better at handling bugs. At least I have found the solution and what the solution process should be.

The shortcoming is that I am still unfamiliar with solving some problems involving the kernel and importing packages. I have only solved the problem once or twice by changing the code in the package. These two times are very refreshing: it turns out that everything can be changed!

That’s probably all, see you next time!