CUDA_tuned instruction-level primitives 1 floating-point operations

In order to achieve high throughput on the GPU, you need to understand what factors limit peak performance. Based on this, applications can be divided into two categories:

I/O intensive

Computationally intensive

In this chapter we focus on the computationally intensive ones.

Three major factors that significantly affect the instructions generated by the CUDA kernel: floating-point operations, built-in and standard functions, and atomic operations.

Floating-point arithmetic:

The picture above is the most standard float and double floating-point format picture.

For the float flag s occupies 1 bit, the exponent e is 8 bits, and the mantissa v is 23 bits

For the difference between single-precision and double-precision floating point numbers, we run the following complex program

#include "freshman.h"
#include <stdio.h>
#include <stdlib.h>

 /**
  * The computational kernel for single-precision floating-point
  **/
__global__ void lots_of_float_compute(float* inputs, int N, size_t niters,
    float* outputs)
{
    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
    size_t nthreads = gridDim.x * blockDim.x;

    for (; tid < N; tid += nthreads)
    {
        size_t iter;
        float val = inputs[tid];

        for (iter = 0; iter < niters; iter ++ )
        {
            val = (val + 5.0f) - 101.0f;
            val = (val / 3.0f) + 102.0f;
            val = (val + 1.07f) - 103.0f;
            val = (val / 1.037f) + 104.0f;
            val = (val + 3.00f) - 105.0f;
            val = (val / 0.22f) + 106.0f;
        }

        outputs[tid] = val;
    }
}

/**
 * The computational kernel for double-precision floating-point
 **/
__global__ void lots_of_double_compute(double* inputs, int N, size_t niters,
    double* outputs)
{
    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
    size_t nthreads = gridDim.x * blockDim.x;

    for (; tid < N; tid += nthreads)
    {
        size_t iter;
        double val = inputs[tid];

        for (iter = 0; iter < niters; iter ++ )
        {
            val = (val + 5.0) - 101.0;
            val = (val / 3.0) + 102.0;
            val = (val + 1.07) - 103.0;
            val = (val / 1.037) + 104.0;
            val = (val + 3.00) - 105.0;
            val = (val / 0.22) + 106.0;
        }

        outputs[tid] = val;
    }
}

/**
 * Runs a full test of single-precision floating-point, including transferring
 * inputs to the device, running the single-precision kernel, and copying
 * outputs back.
 **/
static void run_float_test(size_t N, int niters, int blocksPerGrid,
    int threadsPerBlock, double* toDeviceTime,
    double* kernelTime, double* fromDeviceTime,
    float* sample, int sampleLength)
{
    int i;
    float* h_floatInputs, * h_floatOutputs;
    float* d_floatInputs, * d_floatOutputs;

    h_floatInputs = (float*)malloc(sizeof(float) * N);
    h_floatOutputs = (float*)malloc(sizeof(float) * N);
    CHECK(cudaMalloc((void**) & d_floatInputs, sizeof(float) * N));
    CHECK(cudaMalloc((void**) & d_floatOutputs, sizeof(float) * N));

    for (i = 0; i < N; i ++ )
    {
        h_floatInputs[i] = (float)i;
    }

    double toDeviceStart = cpuSecond();
    CHECK(cudaMemcpy(d_floatInputs, h_floatInputs, sizeof(float) * N,
        cudaMemcpyHostToDevice));
    *toDeviceTime = cpuSecond() - toDeviceStart;

    double kernelStart = cpuSecond();
    lots_of_float_compute << <blocksPerGrid, threadsPerBlock >> > (d_floatInputs,
        N, niters, d_floatOutputs);
    CHECK(cudaDeviceSynchronize());
    *kernelTime = cpuSecond() - kernelStart;

    double fromDeviceStart = cpuSecond();
    CHECK(cudaMemcpy(h_floatOutputs, d_floatOutputs, sizeof(float) * N,
        cudaMemcpyDeviceToHost));
    *fromDeviceTime = cpuSecond() - fromDeviceStart;

    for (i = 0; i < sampleLength; i ++ )
    {
        sample[i] = h_floatOutputs[i];
    }

    CHECK(cudaFree(d_floatInputs));
    CHECK(cudaFree(d_floatOutputs));
    free(h_floatInputs);
    free(h_floatOutputs);
}

/**
 * Runs a full test of double-precision floating-point, including transferring
 * inputs to the device, running the single-precision kernel, and copying
 * outputs back.
 **/
static void run_double_test(size_t N, int niters, int blocksPerGrid,
    int threadsPerBlock, double* toDeviceTime,
    double* kernelTime, double* fromDeviceTime,
    double* sample, int sampleLength)
{
    int i;
    double* h_doubleInputs, * h_doubleOutputs;
    double* d_doubleInputs, * d_doubleOutputs;

    h_doubleInputs = (double*)malloc(sizeof(double) * N);
    h_doubleOutputs = (double*)malloc(sizeof(double) * N);
    CHECK(cudaMalloc((void**) & d_doubleInputs, sizeof(double) * N));
    CHECK(cudaMalloc((void**) & amp;d_doubleOutputs, sizeof(double) * N));

    for (i = 0; i < N; i ++ )
    {
        h_doubleInputs[i] = (double)i;
    }

    double toDeviceStart = cpuSecond();
    CHECK(cudaMemcpy(d_doubleInputs, h_doubleInputs, sizeof(double) * N,
        cudaMemcpyHostToDevice));
    *toDeviceTime = cpuSecond() - toDeviceStart;

    double kernelStart = cpuSecond();
    lots_of_double_compute << <blocksPerGrid, threadsPerBlock >> > (d_doubleInputs,
        N, niters, d_doubleOutputs);
    CHECK(cudaDeviceSynchronize());
    *kernelTime = cpuSecond() - kernelStart;

    double fromDeviceStart = cpuSecond();
    CHECK(cudaMemcpy(h_doubleOutputs, d_doubleOutputs, sizeof(double) * N,
        cudaMemcpyDeviceToHost));
    *fromDeviceTime = cpuSecond() - fromDeviceStart;

    for (i = 0; i < sampleLength; i ++ )
    {
        sample[i] = h_doubleOutputs[i];
    }

    CHECK(cudaFree(d_doubleInputs));
    CHECK(cudaFree(d_doubleOutputs));
    free(h_doubleInputs);
    free(h_doubleOutputs);
}

int main(int argc, char** argv)
{
    int i;
    double meanFloatToDeviceTime, meanFloatKernelTime, meanFloatFromDeviceTime;
    double meanDoubleToDeviceTime, meanDoubleKernelTime,
        meanDoubleFromDeviceTime;
    struct cudaDeviceProp deviceProperties;
    size_t totalMem, freeMem;
    float* floatSample;
    double* doubleSample;
    int sampleLength = 10;
    int nRuns = 5;
    int nKernelIters = 20;

    meanFloatToDeviceTime = meanFloatKernelTime = meanFloatFromDeviceTime = 0.0;
    meanDoubleToDeviceTime = meanDoubleKernelTime =
        meanDoubleFromDeviceTime = 0.0;

    CHECK(cudaMemGetInfo( & amp; freeMem, & amp; totalMem));
    CHECK(cudaGetDeviceProperties( &deviceProperties, 0));

    size_t N = (freeMem * 0.9 / 2) / sizeof(double);
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    if (blocksPerGrid > deviceProperties. maxGridSize[0])
    {
        blocksPerGrid = deviceProperties. maxGridSize[0];
    }

    printf("Running %d blocks with %d threads/block over %lu elements\
",
        blocksPerGrid, threadsPerBlock, N);

    floatSample = (float*)malloc(sizeof(float) * sampleLength);
    doubleSample = (double*)malloc(sizeof(double) * sampleLength);

    for (i = 0; i < nRuns; i ++ )
    {
        double toDeviceTime, kernelTime, fromDeviceTime;

        run_float_test(N, nKernelIters, blocksPerGrid, threadsPerBlock,
             &toDeviceTime, &kernelTime, &fromDeviceTime,
            floatSample, sampleLength);
        meanFloatToDeviceTime += toDeviceTime;
        meanFloatKernelTime += kernelTime;
        meanFloatFromDeviceTime += fromDeviceTime;

        run_double_test(N, nKernelIters, blocksPerGrid, threadsPerBlock,
             &toDeviceTime, &kernelTime, &fromDeviceTime,
            doubleSample, sampleLength);
        meanDoubleToDeviceTime += toDeviceTime;
        meanDoubleKernelTime += kernelTime;
        meanDoubleFromDeviceTime += fromDeviceTime;

        if (i == 0)
        {
            int j;
            printf("Input\tDiff Between Single- and Double-Precision\
");
            printf("------\t------\
");

            for (j = 0; j < sampleLength; j ++ )
            {
                printf("%d\t%.20e\
", j,
                    fabs(doubleSample[j] - (double)floatSample[j]));
            }

            printf("\
");
        }
    }

    meanFloatToDeviceTime /= nRuns;
    meanFloatKernelTime /= nRuns;
    meanFloatFromDeviceTime /= nRuns;
    meanDoubleToDeviceTime /= nRuns;
    meanDoubleKernelTime /= nRuns;
    meanDoubleFromDeviceTime /= nRuns;

    printf("For single-precision floating point, mean times for:\
");
    printf("Copy to device: %f s\
", meanFloatToDeviceTime);
    printf("Kernel execution: %f s\
", meanFloatKernelTime);
    printf("Copy from device: %f s\
", meanFloatFromDeviceTime);
    printf("For double-precision floating point, mean times for:\
");
    printf("Copy to device: %f s (%.2fx slower than single-precision)\
",
        meanDoubleToDeviceTime,
        meanDoubleToDeviceTime / meanFloatToDeviceTime);
    printf("Kernel execution: %f s (%.2fx slower than single-precision)\
",
        meanDoubleKernelTime,
        meanDoubleKernelTime / meanFloatKernelTime);
    printf("Copy from device: %f s (%.2fx slower than single-precision)\
",
        meanDoubleFromDeviceTime,
        meanDoubleFromDeviceTime / meanFloatFromDeviceTime);

    return 0;
}

I got the following results. It can be seen that it is completely different from the double or double difference mentioned in the book. Now it has reached a difference of 12 times.

You can see the difference

And there is a more fun place

It can be seen that the time from the GPU to the host is significantly longer. This phenomenon also appeared in the book, so I asked gpt and he said so.

This phenomenon may be related to the bandwidth and latency of the PCIe bus, because the communication between the GPU and the CPU must go through the PCIe bus on the motherboard. In hardware implementation, since the PCIe bus is shared, its bandwidth and latency may be interfered by other devices, especially when multiple devices transmit data to the CPU or GPU at the same time. This results in a longer transfer time from the GPU to the CPU than from the CPU to the GPU, because when the data is returned from the GPU, the data needs to be sent from the GPU to the CPU through the PCIe bus, and when the data is transferred from the CPU to the GPU , the data is sent directly to the GPU.

In addition, this phenomenon may also be related to the size of the data. The time complexity of transferring large amounts of data is often higher. You might see this happen if you’re processing large amounts of data while using the GPU. In this case, due to the large amount of data, more PCIe bus bandwidth may be required, and more delays may be generated during the transfer, resulting in the transfer time from the GPU back to the CPU than from the CPU to the GPU. long time.

Finally, this phenomenon may also be related to the way the operating system and applications are managed. For example, the operating system may prioritize I/O requests from the CPU, causing the GPU to experience more contention when transferring data from the CPU, causing transfer delays. Also, running applications can cause high PCIe bus usage.

Summary

The impact of floating-point operations on application performance and numerical accuracy is not unique to the GPU. When using other architectures, you will face the same problem.

Due to numerical precision, it may be more desirable to use double precision variables in iterative applications.

Great care must be taken when declaring single precision floating point numbers. Any incorrect declaration that omits the mantissa f will be automatically converted to double precision by the NVCC compiler.

The following features are unique to CUDA and GPUs:

Use of double-precision values to increase communication between the host and the device

Increased global memory I/O using double precision values

The loss of numerical precision is caused by the CUDA compiler forcing floating-point numerical optimizations

In general, double-precision values must be used if the application requires high precision. Otherwise, use single precision values for performance gains.