CUDA Programming (4) Program Error Detection

Some errors can be caught by the compiler during the compilation process, which are called compilation errors.

Some errors are not found during compilation, but appear when running, which are called run-time errors.

Runtime errors are harder to troubleshoot. Then how to detect errors is very important.

1. A macro function to detect CUDA runtime errors

The functions I learned before include cudaMalloc(), cudaFree(), and cudaMemcpy(), which all start with the cuda prefix, and all have a return value of type cudaError_t, which represents an error message. Only when the return value is cudaSucess It means that the API function was successfully called.

Next, define a cuda macro file in vs. The file suffix is .cuh and the content is as follows:

#pragma once //Preprocessing instruction, its function is to ensure that the current file is not included repeatedly in a compilation unit
//Related to the following ifndef, define, endif preprocessing instructions
//#ifndef ERROR_CUH
//#define ERROR_CUH


//Required header files
#include <cuda_runtime.h>
#include <stdio.h>


//Define macro function CHECK, call is the API function of cuda runtime.
//When defining a macro, if you cannot write one line, you need to add \ at the end to indicate that line continuation is very important. If you do not write it, an error will be reported.


#define CHECK(call) \
do { \
    const cudaError_t error_code = call; \
    if (error_code != cudaSuccess) { \
        printf("CUDA Error:\\
"); \
        printf("File: %s\\
", __FILE__); \
        printf("Line: %d\\
", __LINE__); \
        printf("Error code: %d\\
", error_code); \
        printf("Error text: %s\\
", cudaGetErrorString(error_code)); \
        exit(1); \
    } \
} while (0)

//#endif

//CHECK function first uses a do while
//Define cudaError_t type variable error_code and initialize the return value of function call
//Here it is judged whether the return is successful. If not, an error message is output.
//cudaGetErrorString is used to convert the error code into a text description of the error.


Some precautions have been marked. When I was learning, I kept getting errors. The problem was that “” was not added to the macro function name. Small problem, but really hard to spot.

With the ERROR.cuh file, next, use the macro function in the sum of two numbers in the code in the previous section to check for errors. An error has been corrected here and there is a comment mark in the code.

This is the main code .cu file.

#include "error.cuh"

#include <stdio.h>

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include<math.h>

#include<stdlib.h>


const double EPSILON = 1.0e-15;
const double a = 1.23;
const double b = 2.34;
const double c = 3.75;
void __global__ add(const double* x, const double* y, double* z, int N);
void check(const double* z, const int N);


int main(void) {
//Host variables, arrays and initialization
const int N = 100000;
const int M = sizeof(double) * N;
double* h_x = (double*)malloc(M);
double* h_y = (double*)malloc(M);
double* h_z = (double*)malloc(M);

for (int n = 0; n < N; + + n) {
h_x[n] = a;
h_y[n] = b;
}

double* d_x, * d_y, * d_z;
//Use macro CHECK (cuda function)
CHECK(cudaMalloc((void**) & amp;d_x, M));
CHECK(cudaMalloc((void**) & amp;d_y, M));
CHECK(cudaMalloc((void**) & amp;d_z, M));


// CHECK(cudaMemcpy(d_x, h_x, M, cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(d_x, h_x, M, cudaMemcpyDeviceToHost)); //This is a modification error,

CHECK(cudaMemcpy(d_y, h_y, M, cudaMemcpyHostToDevice));

//Kernel function execution configuration parameters
const int block_size = 128; //128 one-dimensional thread blocks
const int grid_size = (N + block_size - 1) / block_size; //10^8/128 thread blocks
//Call kernel function calculation
add << <grid_size, block_size >> > (d_x, d_y, d_z, N);


CHECK(cudaMemcpy(h_z, d_z, M, cudaMemcpyDeviceToHost));
check(h_z, N);

free(h_x);
free(h_y);
free(h_z);


CHECK(cudaFree(d_x));
CHECK(cudaFree(d_y));
CHECK(cudaFree(d_z));

return 0;

}




void __global__ add(const double* x, const double* y, double* z, int N) {
//The correspondence between data and threads in the kernel function
//Single instruction - multi-threading
//n represents the array element index
const int n = blockDim.x * blockIdx.x + threadIdx.x;
if (n < N) {
z[n] = x[n] + y[n];
}

}


void check(const double* z, const int N) {
bool has_error = false;
for (int n = 0; n < N; + + n) {
if (fabs(z[n] - c) > EPSILON) {
has_error = true;
}
}
printf("%s\\
", has_error ? "Has error" : "No errors");

}

operation result:

I can only say that it is very powerful. If you run it in a Linux terminal, such an error checking program is very important. This way you can check where the code went wrong and what the problem was.

This block is to change cudaMemcpyHostToDevice to cudaMemcpyDeviceToHost, but the target address is still the device, and the original address is the host, so the data transmission is just the opposite.

There is one more question. From here on, we will stick to using this macro function to wrap most of the CUDA runtime API functions. One exception is the cudaEventQuery function, because it is likely to return cudaErrorNotReady, but it does not mean that the program has gone wrong.

So there is another question here. How does the kernel function check for errors? The kernel function does not return a value. Here is a way to capture errors that may occur when calling the kernel function, that is, add the following two statements after the kernel function value;

CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
//Among them, the function of the first statement is to capture the last error before the second statement,
//The second statement is used to synchronize the host and device.
//The reason why the host and device need to be synchronized is because the kernel function call is asynchronous.
//That is, after the host issues the command to call the kernel function, it will immediately execute the following statements and will not wait for the kernel function to complete execution. 

Using the above synchronized function is not called a function. If it is called cyclically in the inner layer of the program, it is likely to seriously reduce the performance of the program.

Therefore, the above synchronization functions are generally not called cyclically in the inner layers of the program. As long as there are synchronous calls to any other API functions that may return error values after the kernel function call, synchronization between the host and the device can be initiated and errors that may occur in the kernel function call can be captured.

Question, why does it need to ensure that the host and device are synchronized before checking kernel function errors in CUDA? (GPT below)

In CUDA, to properly check for kernel errors, you usually need to ensure synchronization between the host and the device, i.e. wait for the device to complete its work before checking for errors. This is because of CUDA’s asynchronous execution model, where the host and device can execute in parallel without waiting for device operations to complete.

Here are some key reasons why synchronization is necessary:

  1. Asynchronous execution model: CUDA allows parallel execution between the host and the device, which means that the host can continue to perform other operations before the device executes the kernel function. If you check for device errors immediately on the host, you may be trying to check for errors while the device has not yet completed kernel execution, which will lead to inaccurate results.

  2. Error reporting delay: CUDA runtime errors usually occur asynchronously, which means that the error may not be triggered until after the kernel function has finished executing. If you check for an error immediately after the kernel function, you may not catch the error because it may not have occurred yet. Synchronously waiting for the device to execute ensures that the error has occurred and has been triggered.

  3. Error message validity: During kernel function execution, error messages may not be fully ready. Synchronous waits ensure that error messages are valid at the time of checking.

To synchronize host and device in CUDA, you can use the cudaDeviceSynchronize() function. This function will wait for all tasks on the current device to complete, including the execution of kernel functions, and then return. After syncing, you can safely check for CUDA runtime errors to make sure you caught any errors that may have occurred.

In summary, because the CUDA program is executed asynchronously, the host will not wait for the device to finish running before executing the next one. If you want to check for errors here, then you have to let the host wait for the device, the device runs, and the host checks, just like doing When eating, you can tell when and where something went wrong by looking at whether the things in the pot are burnt. If you don’t look at the pot and wait for it to cook by itself, then you won’t know what went wrong, causing the dish to fail. broken.

Next, change the kernel function execution configuration parameters, thread block size block_size to 1280, and the general maximum thread block size is 1024.

Running can get:

invalid configuration argument, illegal configuration.

2. Use the CUDA-MEMCHCECK tool set to check for memory errors.

The above is the linux code.

Later, when I am learning Linux, I will give another section.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56912 people are learning the system