The thread hierarchy of CUDA threads, and how a single thread threadIdx uses stride to perform skip operations and calculate multiple data at the same time

The concept of thread level:

Simply put, a grid has multiple blocks, and a block has multiple threads.

How big is the grid, and gridDim is used to indicate how many blocks it has, which are divided into gridDim.x, gridDim.y, and gridDim.z.

How big is the block? Use blockDim to indicate how many threads it has. Specifically, it is divided into blockDim.x, blockDim.y, and blockDim.z.

How to express the relative position of the thread in the block? Represented by threadIdx.x, threadIdx.y, threadIdx.z.

How to express the relative position of the block in the grid? Represented by blockIdx.x, blockIdx.y, and blockIdx.z.

By the way, what do x and y mean in hello_from_gpu<<>>(); in Huawei Cloud Forum_Cloud Computing Forum_Developer Forum_Technology Forum-Huawei Cloud? They represent gridDim and blockDim, respectively.

For this function:

Indicates that gridDim is 1, indicating that the grid has 1 block, and blockDim is 4. Indicates that the block has 4 threads.

So for the above kernel function, it is equivalent to having 4 threads perform the operation of c[n]=a[n] + b[n] respectively, n=threadIdx.x

When calling, all CUDA cores execute the same function. This is different from CPU multithreading which may perform different tasks.

As shown in the figure above, Thread is executed in CUDA core, Block is executed in SM, and Grid is executed in Device.

So, how does CUDA perform it? Look at the picture below:

If there is no concept of block, when synchronization, communication, and collaboration are to be performed at the same time, the overall core must generate waiting behavior. If expansion is required, the more the expansion, the more the waiting will be. So performance will suffer.

But with the concept of block, scalability can be achieved. Scalability can be easily achieved with blocks or warps.

How to find where is the data that the thread should process? This is to mention the concept of thread index.

Above: Assume one block per 8 threads.

The specific formula is as follows:

Specific index position index = blockDim.x * blockIdx.x + threadIdx.x

So how should a CUDA program be written?

Take converting a CPU-implemented code to GPU as an example:

The implementation process of the CPU is roughly as follows:

(1) The main program main:

First allocate source address space a, b and destination address space c, and generate random numbers for a and b. Then call the CPU function of one-dimensional matrix addition.

(2) CPU function of one-dimensional matrix addition:

Traverse a, b address space, respectively add a[i] and b[i], and write to c[i] address.

At this time, please note that it is necessary to explicitly perform the for loop traversal.

So, how does the GPU do it?

(1) The main program main:

Because the GPU has Host and Device memory, first apply for host memory h_a, h_b to store the content of the one-dimensional matrix of a and b (random numbers can also be generated), and apply for host memory h_c to store the calculation result of c.

Then apply for device memory. At this time, you need to apply for the two source device memory (cudaMalloc) of d_a and d_b, and the destination device memory (cudaMalloc) of d_c. Copy the contents of h_a and h_b to d_a and d_b (obviously need to use cudaMemcpyHostToDevice);

Then call the kernel function to complete the parallel calculation of the GPU, and write the result to h_c;

Finally, write the device memory of d_c back to h_c (cudaMemcpyDeviceToHost), and release all host memory (using free) and device memory (using cudaFree).

(2) Kernel function

Here is the point. The kernel function only needs to remove the outermost loop, and replace i with the wording of index according to the previous index wording.

How to set Gridsize and blocksize?

For the one-dimensional case:

block_size=128;

grid_size = (N + block_size-1)/block_size;

(No value is set to be the best)

How many threads can each block apply for?

The total is also 1024. Such as (1024, 1, 1) or (512, 2, 1)

There is no limit to the grid size.

The bottom layer is applied in units of warp. If blockDim is 160, exactly 5 warps are applied. If blockDim is 161, you have to apply for 6 warps.

What if the data is too large and the threads are not enough?

In this way, each thread needs to process multiple data.

For example, for the above figure, thread 0 needs to process four data of 0, 8, 16, and 24. The kernel function needs to run every big block again. code show as below:

The concept of a stride is introduced here, and its size is blockDim.x X gridDim.x . The kernel function needs to complete the calculation of each relevant address that satisfies index = index + stride * count.

Example 1: Experience the index

Index_of_thread.cu

#include <stdio.h>
 
__global__ void hello_from_gpu()
{
   //Just print blockIdx.x and threadIdx.x based on the original code
    const int bid = blockIdx.x;
    const int tid = threadIdx.x;
    printf("Hello World from block %d and thread %d!\
", bid, tid);
}
 
int main(void)
{
    hello_from_gpu<<<5, 5>>>();
    
    //Remember to add synchronization, otherwise the result will not come out.
    cudaDeviceSynchronize();
    return 0;
}

Makefile:

TEST_SOURCE = Index_of_thread.cu
 
TARGETBIN := ./Index_of_thread
 
CC = /usr/local/cuda/bin/nvcc
 
$(TARGETBIN):$(TEST_SOURCE)
$(CC) $(TEST_SOURCE) -o $(TARGETBIN)
 
.PHONY:clean
clean:
-rm -rf $(TARGETBIN)

Compile and execute:

Example 2: Complete one-dimensional vector calculation: add

vectorAdd.cu

#include <math.h>
#include <stdio.h>
 
void __global__ add(const double *x, const double *y, double *z, int count)
{
    const int n = blockDim.x * blockIdx.x + threadIdx.x;
    
    //The judgment here is to prevent overflow
if( n < count)
{
z[n] = x[n] + y[n];
}
 
}
void check(const double *z, const int N)
{
    bool error = false;
    for (int n = 0; n < N; + + n)
    {
        //Check if two values are equal, if not, error=true.
        if (fabs(z[n] - 3) > (1.0e-10))
        {
            error = true;
        }
    }
    printf("%s\
", error ? "Errors" : "Pass");
}
 
 
int main(void)
{
    const int N = 1000;
    const int M = sizeof(double) * N;
    
    //Allocate host memory
    double *h_x = (double*) malloc(M);
    double *h_y = (double*) malloc(M);
    double *h_z = (double*) malloc(M);
 
    //Initialize the value of the one-dimensional vector
    for (int n = 0; n < N; + + n)
    {
        h_x[n] = 1;
        h_y[n] = 2;
    }
 
    double *d_x, *d_y, *d_z;
 
    //Allocate device memory
    cudaMalloc((void **) & d_x, M);
    cudaMalloc((void **) & d_y, M);
    cudaMalloc((void **) & d_z, M);
    
    //host->device
    cudaMemcpy(d_x, h_x, M, cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, h_y, M, cudaMemcpyHostToDevice);
 
    //This is the formula. Just remember it.
    const int block_size = 128;
    const int grid_size = (N + block_size - 1) / block_size;
    
    // Kernel function calculation
    add<<<grid_size, block_size>>>(d_x, d_y, d_z, N);
 
    //device->host
    cudaMemcpy(h_z, d_z, M, cudaMemcpyDeviceToHost);
    
    //test result
    check(h_z, N);
 
    //Release host memory
    free(h_x);
    free(h_y);
    free(h_z);
    
    //Release device memory
    cudaFree(d_x);
    cudaFree(d_y);
    cudaFree(d_z);
    return 0;
}

Makefile-add

TEST_SOURCE = vectorAdd.cu
 
TARGETBIN := ./vectorAdd
 
CC = /usr/local/cuda/bin/nvcc
 
$(TARGETBIN):$(TEST_SOURCE)
$(CC) $(TEST_SOURCE) -o $(TARGETBIN)
 
.PHONY:clean
clean:
-rm -rf $(TARGETBIN)

Execute after compilation: