04. Compiler optimization and SIMD instruction set

64-bit X86 adds eight new registers r8-r15

AT & amp; T (GCC) assembly language is slightly different from the learned Intel

Expressions tend to be left-to-right

Enable optimization: -O3

A function can be optimized to leal

1. Pointer index: try to use size_t

int func(int* a,std::size_t b){
    return a[b];
}

for(std::size_t i=0; ;i++ )
    a[i]*=2;

Because size_t is equivalent to uint64_t on 64-bit systems

When generating assembly code, there is no need to expand int(32bit) to int*(64bit) for calculation, which is more efficient. also

Floating-point numbers as parameters and returns: xmm series registers 128 bits (two doubles and four floats can be placed)

addss // assembly instruction

add table addition

s is a scalar, and only the lowest bit of xmm is calculated; it can also be a vector, and p is calculated for all bits of xmm.

s is a precision floating-point number (float), and d can also be a double-precision floating-point number (double)

SIDM (single-instruction multiple-data)

This approach can greatly increase the throughput of computationally intensive programs.

Compiler optimizations: not everything

Try to avoid complicating the code and avoid using containers that will cause new\delete.

The constexpr keyword: forces the compiler to evaluate during compilation

There is also no data on the heap

Compiled in separate files, calling external functions cannot be optimized by the compiler.

The locally visible function static, declared with this keyword, will not expose the function to other functions. At this time, after the compiler optimizes, the function will not be defined at all, and it will be directly inlined.

So inline is not inlined at all in modern compilers.

godbolt.org //You can do compiler experiments

Pointer aliasing phenomenon:

void func(int*a,int*b,int*c){
    *c=*a; //I am afraid of pointer aliasing, so the compiler will not optimize this statement
    *c=*b;
}

int main(){
    int a,b;
    func( &a, &b, &b);
}

//Ning slow is good

Solution: __restrict keyword, which tells the compiler that pointers will not overlap, boldly optimize!

void func(int*__restricet a,int*__restricet b,int*__restricet c){
    *c=*a; //I am afraid of pointer aliasing, so the compiler will not optimize this statement
    *c=*b;
}

Prohibition of optimization: volatile keyword (guaranteed memory visibility)

int func(int volatile *a){
    *a=42;
    return *a;//At this time, the compiler will honestly read the address instead of returning 42 directly
}

for(volatile int i=0;i<1000;i ++ )//Hey, at this time, the compiler will not optimize this empty loop, but will execute it honestly, which can achieve an effect similar to a timer. 

Compiler Optimization: Coalescing Writes

void func(int *a){
    a[0]=123;
    a[1]=456;
}

For consecutive addresses, two 32-bit writes can be directly optimized to one 64-bit write by the compiler.

Wider merged writes, 128-bit (xmm), use vectorized instructions.

SIMD accelerates multiples of 4 to write in parallel; for non-multiples of 4, the compiler will also automatically perform boundary judgments. Try to make all arrays an integer multiple of 16 to avoid special judgments by the compiler.

cycle

void func(float* __restrict a,float* __restrict b){
    for(int i=0;i<1024;i ++ ){
        a[i]=b[i] + 1;
    }
}

void func(float* a, float* b){
    for(int i=0;i<1024;i ++ ){
        a[i]=b[i] + 1;
    }
}

Similarly, in order to prevent the phenomenon of pointer aliasing, the compiler will generate two sets of assembly codes (normal version and SIMD version), and then perform pointer difference operations to determine whether there is pointer aliasing. In order to avoid the loss of efficiency caused by this operation, it should also be added __restrict keyword.

void func(float*a, float*b){
#pragma omp simd//It can also be written like this, the OpenMP instruction forces the SIMD version to be generated
    for(int i=0;i<1024;i ++ ){
        a[i]=b[i] + 1;
    }
}

For the GCC compiler

#pragma GCC ivdep//Similar operations can also be implemented for the GCC compiler (ignore vector dependency)

For the judgment statement in the loop, the compiler will move it outside, provided that the judgment statement will not change in the loop.

For constant expressions, the compiler will also move operations outside the loop to reduce the number of operations.

void func(float*__restrict a,float*__restrict b,float dt){
    for(int i=0;i<1024;i ++ ){
        a[i]=a[i] + b[i]*(dt*dt);//For the multiplication of constants, parentheses must be added, because the multiplication is left associative, and it cannot be optimized without adding ()
    }
}

Note: Calling an external file function will also cause the optimization to fail. So the thermal loop tries not to use external functions.

Loop unrolling:

void func(float*a){
#pragma GCC unroll 4//Using this instruction can mean that the loop body is expanded into 4
    for(int i=0;i<1024;i ++ ){
        a[i]=1;
    }
}


//Equivalent to the following

void func(float* a){
    for(int i=0;i<1024;i + =4){
        a[i + 0]=1;
        a[i + 1]=1;
        a[i + 2]=1;
        a[i + 3]=1;
    }
}

The purpose is to increase the proportion of actual calculation in each cycle and improve efficiency

before loop unrolling

After loop unrolling

However, if the loop body is too large, it is not necessary to unroll, which will cause pressure on the instruction cache and slow down.

Structure:

Changing the size of the structure to an integer power of 2 is conducive to SIMD optimization, and some useless four-byte variables can be added to ensure it.

Or use alignas (number of bytes to be aligned)

struct alignas(16) MyVec{//C++11 new keywords
    float x;
    float y;
    float z;
};

But it may also be slower, because it may lead to the occupation of memory bandwidth, in short, it should be considered in combination with the actual situation.

AOS conforms to object-oriented thinking, but is often not conducive to performance.

//AOS form:
struct MyVec{
    float x;
    float y;
    float z;
};

Myvec a[1024];


//SOA form, more efficient, data-oriented programming thinking (DOP)
struct MyVec{
    float x[1024];
    float y[1024];
    float z[1024];
};

MyVec a;


//AOSOA: an intermediate solution, which has both the intuitiveness of AOS and the efficiency of SOA
struct MyVec{
    float x[4];
    float y[4];
    float z[4];
};

MyVec a[1024/4];

Disadvantages of AOSOA: A double loop is required for access, and it is not conducive to random access.

STL

vector also has pointer aliasing issues:

Workaround: #pragma omp simd or #pragma GCC ivdep

Math optimization:

The compiler will convert division to multiplication for optimization.

//But in this case the compiler will give up optimization, because it does not know whether b will be 0;
void func(float *a,float b){
    for(int i=0;i<1024;i ++ ){
        a[i]/=b;
    }
}

Solution 1: Manual optimization

void func(float *a,float b){
    float inv_b=1/b;
    for(int i=0;i<1024;i ++ ){
        a[i]*=inv_b;
    }
}

Solution 2: gcc-ffast-math -O3

This instruction allows GCC to boldly try to optimize floating-point operations, but it must be guaranteed that NAN and infinity will not appear in the program

Mathematical functions leave std:: prefix! Some of the math processing functions of the c version only accept to return double to waste performance.

Course Link: [Open Class] Compiler Optimization and SIMD Instruction Set (#4)_哔哩哔哩_bilibili