CUDA Programming (9) Reasonable use of atomic functions

Functions that implement atomic operations are referred to as atomic functions.

In CUDA, an atomic operation of one thread can complete a set of “read-modify-write” operations on a certain data (in global memory or shared memory) without being affected by any operations of other threads. This set of operations can also be said to be inseparable.

1. Complete reduction in GPU

In the previous array reduction function, the kernel function did not do all the calculations, but just changed a longer array d_x into a shorter array d_y. Each element in the latter is the result of several elements in the former. and. After calling the kernel function, the shorter array is copied to the host, where the remaining summation is done. All these operations take approximately 5.8ms (single precision). If you use nvprof to test, you will find that the kernel functions of the above versions take about 3.2ms, which is about half of the total time. If the final result can be calculated in the GPU, it is expected to significantly reduce the overall calculation time and improve program performance. There are two ways to get the final result in the GPU. One is to use another kernel function to further reduce the shorter array to get the final result (a numerical value); the other is to use an atomic function at the end of the previous kernel function. Reduce and get the final result directly.

1. Use atomic functions for reduction

if (tid == 0)
 {
     d_y[bid] = s_y[0];
 }

The function of this if statement block is to copy the reduction result in each thread block from the shared memory s_y[0] to the global memory d_y[bid]. In order to accumulate the parts of different thread blocks and s_y[0] and store them in a global memory address, we try to rewrite the above code as follows:

if (tid == 0)
 {
     d_y[0] + = s_y[0];
 }

Unfortunately, this statement does not serve our purpose. This statement will be executed on thread No. 0 of each thread block, but the order in which they are executed is undefined. In each thread, this statement can actually be decomposed into two operations: first, take the data from d_y[0] and add it to s_y[0], and then write the result to d_y[0]. Regardless of the order, correct results can only be obtained when the “read-write” operation of one thread is not interfered by other threads. If one thread reads d_y[0] before another thread writes the result to d_y[0], then the d_y[0] read by the two threads will be the same, which will inevitably lead to incorrect results. Consider thread 0 in two thread blocks with bids 0 and 1. They will both perform the following calculations:

d_y[0] + = s_y[0]

If the thread with bid equal to 0 first reads the value of d_y[0], then calculates d_y[0] + s_y[0], and obtains a number, but before it has time to write the result to d_y[0], bid The thread with 1 also read the value of d_y[0]. No matter which thread writes the calculation result to d_y[0] first, the value of d_y[0] is not correct. To get the sum of s_y[0] in all thread blocks, you must use the atomic function, which is used as follows:

if (tid == 0){
    atomicAdd(&d_y[0],s_y[0]);
}

The first parameter of the atomic function atomicAdd(address, val) is the address of the variable to be accumulated, and the second parameter is the accumulated value val. The function of this function is to read the old value old in the address address, calculate old + val, and then store the calculated value into the address address. These operations are completed in an atomic transaction (atomictransaction) and will not be interfered by atomic operations in other threads. Atomic functions cannot guarantee that the execution of each thread has a specific order, but it can ensure that the operation of each thread is completed in one go and will not be interfered by other threads, so it can guarantee the correct results. Note that the if statement here ensures that the data s_y[0] of each thread block is only accumulated once. Removing this if statement will magnify the final output by 128 times. Another thing to note is that the first parameter of the atomic function atomicAdd is the pointer of the variable to be accumulated, so &d_y[0] can be written as d_y.

if (tid == 0){
    atomicAdd(d_y,s_y[0]);
}

A final note is thatthe value of d_y[0] must be initialized to 0 before calling this version of the kernel function. In the program reduce.cu in this chapter, we use the “wrapper function” shown in Listing9.1 to call the kernel function and do the corresponding pre-processing and post-processing work. Readers should be able to understand the code in this function. There are two points worth noting:

?The host array h_y here uses stack memory instead of heap memory. In other words, the host memory passed to the cudaMemcpy function can be stack memory. This is possible when transferring small amounts of data; it is not safe when transferring large amounts of data because the stack size is very limited.

?The “wrapper function” has the same name as the kernel function (both are reduce), which uses overloading of functions in C++. Features: Two functions can have the same name, as long as their parameter lists are not exactly the same.

#include "error.cuh"
#include <stdio.h>
#include "cuda_runtime.h"
#include <cuda.h>
#include "device_launch_parameters.h"
#include<math.h>


typedef float real;


const int NUM_REPEATS = 100; //Number of program runs
const int N = 1000; //Number of data
const int M = sizeof(real) * N; //Data size
const int BLOCK_SIZE = 128; //Block size

void timing(const real* d_x);

int main(void) {

//Host variable initialization
real* h_x = (real*)malloc(M);
\t
for (int n = 0; n < N; + + n) {
h_x[n] = 1.23;
}

//Device variable initialization
real* d_x;

CHECK(cudaMalloc( & amp;d_x, M));
CHECK(cudaMemcpy(d_x, h_x, M, cudaMemcpyHostToDevice));

//Kernel function runs

printf("\\
 using atomicAdd :\\
");
timing(d_x);

//Release space
free(h_x);
CHECK(cudaFree(d_x));
return 0;

}

void __global__ reduce(const real* d_x, real* d_y, const int N) {

const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int n = bid * blockDim.x + tid;

//Dynamic shared memory
extern __shared__ real s_y[];
s_y[tid] = (n < N) ? d_x[n] : 0.0;
__syncthreads();

//Reduce by half

for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1) {
if (tid < offset) {
s_y[tid] + = s_y[tid + offset];
}
__syncthreads();
}

//Use atomic operations to accumulate the final result after the reduction.
if (tid == 0) {
atomicAdd(d_y, s_y[0]);
}

}

real reduce(const real* d_x)
{
const int grid_size = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
const int smem = sizeof(real) * BLOCK_SIZE;

real h_y[1] = { 0 }; //Initialization
real* d_y;
CHECK(cudaMalloc( & amp;d_y, sizeof(real)));
CHECK(cudaMemcpy(d_y, h_y, sizeof(real), cudaMemcpyHostToDevice));

reduce << <grid_size, BLOCK_SIZE, smem >> > (d_x, d_y, N);

CHECK(cudaMemcpy(h_y, d_y, sizeof(real), cudaMemcpyDeviceToHost));
CHECK(cudaFree(d_y));

return h_y[0];
}

void timing(const real* d_x)
{
real sum = 0;

for (int repeat = 0; repeat < NUM_REPEATS; + + repeat)
{
cudaEvent_t start, stop;
CHECK(cudaEventCreate( & amp;start));
CHECK(cudaEventCreate( & amp;stop));
CHECK(cudaEventRecord(start));
cudaEventQuery(start);

sum = reduce(d_x);

CHECK(cudaEventRecord(stop));
CHECK(cudaEventSynchronize(stop));
float elapsed_time;
CHECK(cudaEventElapsedTime( & amp;elapsed_time, start, stop));
printf("Time = %g ms.\\
", elapsed_time);

CHECK(cudaEventDestroy(start));
CHECK(cudaEventDestroy(stop));
}

printf("sum = %f.\\
", sum);
}

It is easy to cause problems when running on win. The previous __sycnthreads(); function was not declared, and now the atomicAdd() function is not declared either.

Although it can be run, there is a problem. It can only be run under float conditions, and double will report an error. Is there anyone who knows how to ask for help?

Moreover, the results of the float type are also much different.

(The following is an explanation of GPT)

In many programming languages and operating systems, atomic operations are usually designed to handle integer type data, such as ints or longs. The purpose of atomic operations is to ensure that multiple threads or processes can access and modify shared data simultaneously without race conditions, thereby preventing data corruption or inconsistency.

For floating point types, such as double, it is generally not recommended to use atomic operations, because floating point operations usually involve multiple bytes of data, and atomic operations are usually designed to process fixed-size data units (usually bytes, words or word length). Additionally, floating point operations may not be atomic as they often require multiple CPU instructions to execute, which can lead to race conditions.

If you need to process floating point data in a multi-threaded environment, it is usually better to use locks or other synchronization mechanisms to ensure safe access between threads. This avoids race conditions and ensures data integrity. However, it should be noted that locking and synchronization mechanisms will introduce certain overhead, so performance and data consistency needs to be weighed.

In short, in general, floating point type data is not suitable for atomic operations, and other synchronization methods should be used to ensure data consistency in a multi-threaded environment.

2. Atomic function

An atomic function performs an atomic “read-modify-write” operation on the data pointed to by its first parameter, which is a one-stop and indivisible operation. The first parameter can point to global memory or shared memory. For all participating threads, the “read-modify-write” atomic operation is performed one thread at a time, but in no clear order. In addition, atomic functions have no synchronization capabilities.

Starting from the Pascal architecture, two new types of atomic functions are introduced based on the original atomic functions. For example, for the atomic function atomicAdd, two other atomic functions have been introduced since the Pascal architecture, namely atomicAdd_system and atomicAdd_block. The former extends the scope of the atomic function to the entire heterogeneous system of the same node (including the host and all devices) , which reduces the scope of the atomic function to a thread block.

Below, we list the prototypes of all atomic functions and introduce their functionality. We agree that for each thread, the value of the variable pointed by address is old before the atomic function corresponding to the thread is implemented, and is new after the atomic function corresponding to the thread is implemented. For every atomic function, the return value is old. Also note that the atomic functions introduced here are all __device__ functions and can only be used in kernel functions.

Addition T atomicAdd(T *address,T val);
Function new = old + val;


Subtraction T atomicSub(T *address,T val);
Function new = old - val;


Exchange T atomicExch(T *address,T val);
function new = val;

Minimum value T atomicMin(T *address,T val);
Function new = (old < val) ? old : val;


Maximum value T atomicMax(T *address,T val);
Function new = (old < val) ? old : val;

Increment T atomicInc(T *address,T val);
Function new = (old >= val) ? 0 : (old + 1);

Decrement T atomicDec(T *address,T val);
Function new = ((old == 0) || ( old > val )) ? val : ( old - 1 );


Compare-exchange T atomicCAS(T *address,T compare,T val);
Function new = (old == compare) ? val : old;


Bitwise AND T atomicAnd(T *address,T val);
Function new = old & val;


Bitwise OR T atomicOr(T *address,T val);
Function new = old | val;


Bitwise XOR T atomicXor(T *address,T val);
Function new = old ^ val;

In the functions listed above, we use T to represent the data type of the relevant variable. The data types supported by each atomic function are shown in Table 9.1. It can be seen that atomicAdd has the most comprehensive support for data types. Among them, atomicAdd’s support for the double-precision floating-point type double began with the Pascal architecture, and support for the structure type __half2 containing two half-precision floating-point variables also began with the Pascal architecture (actually starting from Compute Capability 5.2, but not in Shown in the CUDA runtime API), support for the half-precision floating point type __half begins with the Volt architecture.

Among all atomic functions, the atomicCAS function is special: all other atomic functions can be implemented with it. For example, before the emergence of the Pascal architecture, the atomicAdd function did not support double-precision floating-point numbers. Some people used the atomicCAS function to implement an atomicAdd function that supported double-precision floating-point numbers. This implementation comes from “CUDAC++ ProgrammingGuide”, see Listing9.2 . We won’t go into detail about this implementation, but it’s worth emphasizing that this implementation is much slower than the atomicAdd function provided by Pascal and higher architectures. Therefore, this implementation should never be used in Pascal and higher architecture GPUs. Even on Kepler and Maxwell architecture GPUs, try to avoid using this implementation and consider using the single-precision version of the atomicAdd function directly (if acceptable) or changing the algorithm to avoid using atomic functions.

It probably means converting the original type to longlong for calculation, and finally converting it to double.

3. Example: Creation of neighbor list

After systematically introducing the atomic functions in CUDA, we further illustrate the use of atomic functions with an example of building a neighbor list. Neighbor lists are used in many computer simulations, such as molecular dynamics simulation and Monte Carlo simulation.

There are existing coordinates of a set of particles. Given the coordinates ri of a particle, how to determine the number of neighbors of each particle and the particle index of each neighbor? Among them, the criterion for two particles to be neighbors is that their distance is less than or equal to a given cutoff distance c.

Text file xy.txt containing all coordinate data, at https://github.com/brucefan1983/CUDA-Programming/tree/master/src/09-atomic

And the matlab drawing script plot_points.m is also included.

Our basic algorithm is: for each given particle, determine whether the corresponding particle pairs are neighbors by comparing its distance to all other particles. This is an algorithm whose calculation amount is proportional to the square of the number of particles N^2, or an O(N^2) algorithm.

There are two codes given here, C++ code and CUDA version.

The C++ code looks like this, with some comments to see for yourself.

#include "error.cuh"
#include <cmath>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#ifdef USE_DP
typedef double real;
#else
typedef float real;
#endif

int N; // total number of atoms
const int NUM_REPEATS = 20; //Number of loops
const int MN = 10; //Maximum number of neighbors for each atom
const real cutoff = 1.9; // stage distance
const real cutoff_square = cutoff * cutoff; //The square of the cutoff distance to determine whether two atoms are neighbors

void read_xy(std::vector<real> & amp; x, std::vector<real> & amp; y);
void timing(int* NN, int* NL, std::vector<real> x, std::vector<real> y);
void print_neighbor(const int* NN, const int* NL);

int main(void) {
std::vector<real>x, y;
read_xy(x, y);
N = x.size();
\t 
int* NN = (int*) malloc(N * sizeof(int)); //NN[n] represents the number of neighbors of the nth particle
int* NL = (int*)malloc(N * MN* sizeof(int)); //NL[n*MN + k] is the index of the k-th neighbor of the n-th particle

timing(NN, NL, x, y);
print_neighbor(NN,NL);

free(NN);
free(NL);
return 0;

}

void read_xy(std::vector<real> & amp; v_x, std::vector<real> & amp; v_y) {
std::ifstream infile("xy.txt");
std::string line, word;
if (!infile) {
std::cout << "Cannot open xy.txt" << std::endl;
exit(1);
}
while (std::getline(infile,line))
{
std::istringstream words(line);
if (line.length() == 0) {
continue;
}
for (int i = 0; i < 2; i + + ) {
if (words >> word) {
if (i == 0) {
v_x.push_back(std::stod(word));
}
else {
v_y.push_back(std::stod(word));
}
}
else {
std::cout << "Error for reading xy.txt" << std::endl;
exit(1);
}
}
}
infile.close();
}
void find_neighbor(int* NN, int* NL, const real* x, const real* y) {

for (int n = 0; n < N; n + + ) {
NN[n] = 0; //Initialize all elements of the array NN to zero, because the array elements will be accumulated later.
}
/*
The loop variables n1 and n2 represent two atoms that may be neighbors of each other.
We know that if n2 is a neighbor of n1, then in turn, n1 must also be a neighbor of n2.
Therefore, we only consider the case where n2>n1, thus saving half of the calculation.
*/
for (int n1 = 0; n1 < N; + + n1) {
real x1 = x[n1];
real y1 = y[n1];
for (int n2 = n1 + 1; n2 < N; + + n2) {
real x12 = x[n2]-x1;
real y12 = y[n2]-y1;
real distance_square = x12 * x12 + y12 * y12;
/*
If the square of the distance between atoms is less than the square of the cutoff distance,
Then add a neighbor atom n2 to atom n1,
And increase the number of neighbor atoms NN[n1] of n1 by 1.
Next, add a neighbor atom n1 to atom n2,
And increase the number of neighbor atoms NN[n2] of n2 by 1.
The code in lines 19-20 takes full advantage of the increment operator ++ in C++.
Note that the post-increment operation NN[n1] + + must be used here,
You cannot use the preceding auto-increment operation + + NN[n1], because NN[n1] + + means using the value of NN[n1] first,
Then add 1 to it, and + + NN[n1] means first add 1 to NN[n1], and then use its value.
*/
\t\t
if (distance_square < cutoff_square) {
NL[n1 * MN + NN[n1] + + ] = n2;
NL[n2 * MN + NN[n2] + + ] = n1;

//NL, stores the k-th indicator of the n-th particle. MN=10, there are at most 10 neighbors, the number of neighbors of the nth example of NN.
//The explanation here is that the first 10 elements of NL are the 10 neighbor index indicators of atom 0. The first 11~20 are the 10 neighbor index indicators of atom No. 1
//For some arbitrary atom n2 is a neighbor of n1, then use n1 * MN + NN[n1] to index
//Whenever there is a matching neighbor, store it in NL.
}
}
\t
}

}

void timing(int* NN, int* NL, std::vector<real> x, std::vector<real> y)
{
for (int repeat = 0; repeat < NUM_REPEATS; + + repeat)
{
cudaEvent_t start, stop;
CHECK(cudaEventCreate( & amp;start));
CHECK(cudaEventCreate( & amp;stop));
CHECK(cudaEventRecord(start));
while (cudaEventQuery(start) != cudaSuccess) {}
find_neighbor(NN, NL, x.data(), y.data());

CHECK(cudaEventRecord(stop));
CHECK(cudaEventSynchronize(stop));
float elapsed_time;
CHECK(cudaEventElapsedTime( & amp;elapsed_time, start, stop));
std::cout << "Time = " << elapsed_time << " ms." << std::endl;

CHECK(cudaEventDestroy(start));
CHECK(cudaEventDestroy(stop));
}
}

void print_neighbor(const int* NN, const int* NL)
{
std::ofstream outfile("neighbor.txt");
if (!outfile) {
std::cout << "Cannot open neighbor.txt" << std::endl;
}

for (int n = 0; n < N; + + n) {
if (NN[n] > MN) {
std::cout << "Error: MN is too small." << std::endl;
exit(1);
}
outfile << NN[n];
for (int k = 0; k < MN; + + k) {
if (k < NN[n]) {
outfile << " " << NL[n * MN + k];
}
else
{
outfile << "NaN";
}
}
outfile << std::endl;
}
outfile.close();
}

It’s quite a classic article, using a lot of knowledge about C++, including file reading, writing, auto-increment operations and clever use of indexes. It is broad and profound, and you can understand it slowly.

Now let’s take a look at the cuda version.

Using CUDA to calculate mainly rewrites find_neighbour in the CPU code into a corresponding kernel function.

We first try to write a corresponding kernel function starting from the C++ version of find_neighbor function above. Listing9.4 gives the kernel function find_neighbor_atomic in the program neighbor2gpu.cu. This kernel function is basically a translated version of the find_neighbor function in the C++ version. In this kernel function, we first change the thread indicator

blockIdx.x* blockDim.x + threadIdx.x

Corresponds to the atomic index n1. Then, we can remove the loop on the indicator n1, and change it to the judgment statement if (n1

d_NL[n1*MN + d_NN[n1] + + ]=n2;
d_NL[n2*MN + d_NN[n2] + + ]=n1;

The function of the first sentence is to record atom n2 as the d_NN[n1]th neighbor of atom n1, and then add 1 to the value of d_NN[n1]. The function of the second sentence is to record atom n1 as the d_NN[n2]th neighbor of atom n2, and then add 1 to the value of d_NN[n2]. However, we note that a thread corresponds to an atom. In the thread corresponding to n1, the second statement represents our attempt to accumulate d_NN[n2]. However, in the thread corresponding to n2, the first statement indicates that we also try to perform an accumulation operation on d_NN[n2]. Through the previous studies in this chapter, we know that we need to use atomic functions at this time, that is, change the above statement to

d_NL[n1*MN + atomicAdd( & amp;d_NN[n1],1)]=n2;
d_NL[n2*MN + atomicAdd( & amp;d_NN[n2],1)]=n1;

CUDA version without atomic operations

For a problem that uses atomic functions, sometimes the algorithm can be changed to eliminate the use of atomic functions. In the above problem, the reason why we need to use atomic functions is because we save half of the distance judgment, so that different threads may write to the same global memory address at the same time. If we do not omit the half of the distance judgment, we can only write the corresponding global memory address in a thread, so there is no need to use atomic functions. Based on this consideration, we can write a kernel function that does not use atomic functions.

Code:

#include "error.cuh"
#include <cmath>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#ifdef USE_DP
typedef double real;
#else
typedef float real;
#endif

int N; // total number of atoms
const int NUM_REPEATS = 20; //Number of loops
const int MN = 10; //Maximum number of neighbors for each atom
const real cutoff = 1.9; // stage distance
const real cutoff_square = cutoff * cutoff; //The square of the cutoff distance to determine whether two atoms are neighbors

void read_xy(std::vector<real> & amp; x, std::vector<real> & amp; y);
void timing(int*, int*, const real*, const real*, const bool);
void print_neighbor(const int*, const int*, const bool);

int main(void) {
std::vector<real>v_x, v_y;
read_xy(v_x, v_y);
N = v_x.size();

int mem1 = sizeof(int) * N;
int mem2 = sizeof(int) * N * MN;
int mem3 = sizeof(real) * N;

int* h_NN = (int*)malloc(mem1);
int* h_NL = (int*)malloc(mem2);
int* d_NN, * d_NL;
real* d_x, * d_y;

CHECK(cudaMalloc( & amp;d_NN, mem1));
CHECK(cudaMalloc( & amp;d_NL, mem2));
CHECK(cudaMalloc( & amp;d_x, mem3));
CHECK(cudaMalloc( & amp;d_y, mem3));
CHECK(cudaMemcpy(d_x, v_x.data(), mem3, cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(d_y, v_y.data(), mem3, cudaMemcpyHostToDevice));


std::cout << std::endl << "using atomicAdd:" << std::endl;
timing(d_NN, d_NL, d_x, d_y, true);
std::cout << std::endl << "not using atomicAdd:" << std::endl;
timing(d_NN, d_NL, d_x, d_y, false);

CHECK(cudaMemcpy(h_NN, d_NN, mem1, cudaMemcpyDeviceToHost));
CHECK(cudaMemcpy(h_NL, d_NL, mem2, cudaMemcpyDeviceToHost));

print_neighbor(h_NN, h_NL,false);

CHECK(cudaFree(d_NN));
CHECK(cudaFree(d_NL));
CHECK(cudaFree(d_x));
CHECK(cudaFree(d_y));
free(h_NN);
free(h_NL);
return 0;

}

void read_xy(std::vector<real> & amp; v_x, std::vector<real> & amp; v_y) {
std::ifstream infile("xy.txt");
std::string line, word;
if (!infile) {
std::cout << "Cannot open xy.txt" << std::endl;
exit(1);
}
while (std::getline(infile, line))
{
std::istringstream words(line);
if (line.length() == 0) {
continue;
}
for (int i = 0; i < 2; i + + ) {
if (words >> word) {
if (i == 0) {
v_x.push_back(std::stod(word));
}
else {
v_y.push_back(std::stod(word));
}
}
else {
std::cout << "Error for reading xy.txt" << std::endl;
exit(1);
}
}
}
infile.close();
}

void __global__ find_neighbor_atomic(
int* d_NN, int* d_NL, const real* d_x, const real* d_y,
const int N, const real cutoff_square
)
{
const int n1 = blockIdx.x * blockDim.x + threadIdx.x;
if (n1 < N) {
d_NN[n1] = 0;
const real x1 = d_x[n1];
const real y1 = d_y[n1];

for (int n2 = n1 + 1; n2 < N; + + n2) {
const real x12 = d_x[n2] - x1;
const real y12 = d_y[n2] - y1;
const real distance_square = x12 * x12 + y12 * y12;
if (distance_square < cutoff_square) {
d_NL[n1 * MN + atomicAdd( & amp;d_NN[n1], 1)] = n2;
d_NL[n2 * MN + atomicAdd( & amp;d_NN[n2], 1)] = n1;
}
\t\t
}
\t
\t
}

}

void __global__ find_neighbor_no_atomic
(
int* d_NN, int* d_NL, const real* d_x, const real* d_y,
const int N, const real cutoff_square
)
{
const int n1 = blockIdx.x * blockDim.x + threadIdx.x;
if (n1 < N)
{
int count = 0;
const real x1 = d_x[n1];
const real y1 = d_y[n1];
for (int n2 = 0; n2 < N; + + n2)
{
const real x12 = d_x[n2] - x1;
const real y12 = d_y[n2] - y1;
const real distance_square = x12 * x12 + y12 * y12;
if ((distance_square < cutoff_square) & amp; & amp; (n2 != n1))
{
d_NL[(count + + ) * N + n1] = n2;
}
}
d_NN[n1] = count;
}
}




void timing(int* d_NN, int* d_NL, const real* d_x, const real* d_y, const bool atomic)
{
for (int repeat = 0; repeat < NUM_REPEATS; + + repeat)
{
cudaEvent_t start, stop;
CHECK(cudaEventCreate( & amp;start));
CHECK(cudaEventCreate( & amp;stop));
CHECK(cudaEventRecord(start));
while (cudaEventQuery(start) != cudaSuccess) {}

int block_size = 128;
int grid_size = (N + block_size - 1) / block_size;

if (atomic) {
find_neighbor_atomic << <grid_size, block_size >> >
(d_NN, d_NL, d_x, d_y,N, cutoff_square);
}
else {
find_neighbor_no_atomic << <grid_size, block_size >> >
(d_NN, d_NL, d_x, d_y, N, cutoff_square);
}


CHECK(cudaEventRecord(stop));
CHECK(cudaEventSynchronize(stop));
float elapsed_time;
CHECK(cudaEventElapsedTime( & amp;elapsed_time, start, stop));
std::cout << "Time = " << elapsed_time << " ms." << std::endl;

CHECK(cudaEventDestroy(start));
CHECK(cudaEventDestroy(stop));
}
}

void print_neighbor(const int* NN, const int* NL, const bool atomic)
{
std::ofstream outfile("neighbor.txt");
if (!outfile) {
std::cout << "Cannot open neighbor.txt" << std::endl;
}

for (int n = 0; n < N; + + n) {
if (NN[n] > MN) {
std::cout << "Error: MN is too small." << std::endl;
exit(1);
}
outfile << NN[n];
for (int k = 0; k < MN; + + k) {
if (k < NN[n]) {
int tmp = atomic ? NL[n * MN + k] : NL[k * N + n];
outfile << " " << tmp;
}
else
{
outfile << "NaN ";
}
}
outfile << std::endl;
}
outfile.close();
}

The code is just the above. But it feels like it still takes a long time to understand.

Let’s make a big change like this first. Will continue to update later.

To sum up, atomic operations are when a single thread completes relatively complex read and write operations without being interfered by other threads. It means let me do it alone first. I’m almost done, and we’ll see.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. CUDA entry-level skills treeHomepageOverview 3688 people are learning the system