CUDA_Adjust Instruction Level Primitives 2 Standard Functions vs. Intrinsic Functions

Standard and intrinsic functions behave differently in terms of numerical accuracy and performance. Standard functions support most mathematical operations. However, many equivalent intrinsics achieve the same functionality using fewer instructions, improved performance, and less numerical precision.

Like this program:

#include "freshman.h"
#include <stdio.h>
#include <stdlib.h>
#include <cmath>
/**
 * This example demonstrates the relative performance and accuracy of CUDA
 * standard and intrinsic functions.
 *
 * The computational kernel of this example is the iterative calculation of a
 * value squared. This computation is done on the host, on the device with a
 * standard function, and on the device with an intrinsic function. The results
 * from all three are compared for numerical accuracy (with the host as the
 * baseline), and the performance of standard and intrinsic functions is also
 * compared.
 **/

 /**
  * Perform iters power operations using the standard powf function.
  **/
__global__ void standard_kernel(float a, float* out, int iters)
{
    int i;
    int tid = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (tid == 0)
    {
        float tmp;

        for (i = 0; i < iters; i ++ )
        {
            tmp = powf(a, 2.0f);
        }

        *out = tmp;
    }
}

/**
 * Perform iters power operations using the intrinsic __powf function.
 **/
__global__ void intrinsic_kernel(float a, float* out, int iters)
{
    int i;
    int tid = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (tid == 0)
    {
        float tmp;

        for (i = 0; i < iters; i ++ )
        {
            tmp = __powf(a, 2.0f);
        }

        *out = tmp;
    }
}

int main(int argc, char** argv)
{
    int i;
    int runs = 30;
    int iters = 1000;

    float* d_standard_out, h_standard_out;
    CHECK(cudaMalloc((void**) & d_standard_out, sizeof(float)));

    float* d_intrinsic_out, h_intrinsic_out;
    CHECK(cudaMalloc((void**) & d_intrinsic_out, sizeof(float)));

    float input_value = 8181.25;

    double mean_intrinsic_time = 0.0;
    double mean_standard_time = 0.0;

    for (i = 0; i < runs; i ++ )
    {
        double start_standard = cpuSecond();
        standard_kernel << <1, 32 >> > (input_value, d_standard_out, iters);
        CHECK(cudaDeviceSynchronize());
        mean_standard_time += cpuSecond() - start_standard;

        double start_intrinsic = cpuSecond();
        intrinsic_kernel << <1, 32 >> > (input_value, d_intrinsic_out, iters);
        CHECK(cudaDeviceSynchronize());
        mean_intrinsic_time += cpuSecond() - start_intrinsic;
    }

    CHECK(cudaMemcpy( & h_standard_out, d_standard_out, sizeof(float),
        cudaMemcpyDeviceToHost));
    CHECK(cudaMemcpy( & h_intrinsic_out, d_intrinsic_out, sizeof(float),
        cudaMemcpyDeviceToHost));
    float host_value = powf(input_value, 2.0f);

    printf("Host calculated\t\t\t%f\\
", host_value);
    printf("Standard Device calculated\t%f\\
", h_standard_out);
    printf("Intrinsic Device calculated\t%f\\
", h_intrinsic_out);
    printf("Host equals Standard?\t\t%s diff=%e\\
",
        host_value == h_standard_out ? "Yes" : "No",
        fabs(host_value - h_standard_out));
    printf("Host equals Intrinsic?\t\t%s diff=%e\\
",
        host_value == h_intrinsic_out ? "Yes" : "No",
        fabs(host_value - h_intrinsic_out));
    printf("Standard equals Intrinsic?\t%s diff=%e\\
",
        h_standard_out == h_intrinsic_out ? "Yes" : "No",
        fabs(h_standard_out - h_intrinsic_out));
    printf("\\
");
    printf("Mean execution time for standard function powf: %f s\\
",
        mean_standard_time);
    printf("Mean execution time for intrinsic function __powf: %f s\\
",
        mean_intrinsic_time);

    return 0;
}

The result is shown in the figure below:

It can be seen that for the standard function powf or the internal function __powf, it is different from the pow calculated in the host, but the error of the standard function is significantly smaller, but the calculation of the internal function is faster.

Even with numerically stable CUDA functions, the results of calculations on the GPU are still different from those of traditional CPU-only applications. Due to the inherent inaccuracies of floating-point arithmetic on both the host and the device, it is sometimes difficult to tell which output is more accurate than the other.

The allowable error range also needs to be stated on the migration from CPU to GPU.

Manipulation instruction generation:

In many cases, performance can be improved by modifying to internal functions, such as

Of course, performance can be improved in this way, but it will be very slow, so you can use nvcc’s own instructions to adjust and optimize.

For example

Compile with nvcc –fmad=true to generate ptx files. You can get an arithmetic instruction:

And if nvcc –fmad = false is used, then the result is:

You can clearly see the difference between the two, fmad being true is obviously one step less than false.

The following figure is the compiler flag statement used for instruction generation:

As a comparison to the original

Here I used __fmul_rn, you can see that there is still a deviation from the original data

Analyze with nsight system:

Disable __fmul:

Enable __fmul:

How did this instruction-level optimization increase? It may be that the __fmul_rn algorithm also involves additional floating-point conversion

Do not believe in evil, join the cycle

normal*:

__fmul_rn:

Well, it seems the conversion is really time consuming.

The experiment failed, and I will find out why after class.

Last summary post