64-bit X86 adds eight new registers r8-r15
AT & amp; T (GCC) assembly language is slightly different from the learned Intel
Expressions tend to be left-to-right
Enable optimization: -O3
A function can be optimized to leal
1. Pointer index: try to use size_t
int func(int* a,std::size_t b){ return a[b]; } for(std::size_t i=0; ;i++ ) a[i]*=2;
Because size_t is equivalent to uint64_t on 64-bit systems
When generating assembly code, there is no need to expand int(32bit) to int*(64bit) for calculation, which is more efficient. also
Floating-point numbers as parameters and returns: xmm series registers 128 bits (two doubles and four floats can be placed)
addss // assembly instruction
add table addition
s is a scalar, and only the lowest bit of xmm is calculated; it can also be a vector, and p is calculated for all bits of xmm.
s is a precision floating-point number (float), and d can also be a double-precision floating-point number (double)
SIDM (single-instruction multiple-data)
This approach can greatly increase the throughput of computationally intensive programs.
Compiler optimizations: not everything
Try to avoid complicating the code and avoid using containers that will cause new\delete.
The constexpr keyword: forces the compiler to evaluate during compilation
There is also no data on the heap
Compiled in separate files, calling external functions cannot be optimized by the compiler.
The locally visible function static, declared with this keyword, will not expose the function to other functions. At this time, after the compiler optimizes, the function will not be defined at all, and it will be directly inlined.
So inline is not inlined at all in modern compilers.
godbolt.org //You can do compiler experiments
Pointer aliasing phenomenon:
void func(int*a,int*b,int*c){ *c=*a; //I am afraid of pointer aliasing, so the compiler will not optimize this statement *c=*b; } int main(){ int a,b; func( &a, &b, &b); } //Ning slow is good
Solution: __restrict keyword, which tells the compiler that pointers will not overlap, boldly optimize!
void func(int*__restricet a,int*__restricet b,int*__restricet c){ *c=*a; //I am afraid of pointer aliasing, so the compiler will not optimize this statement *c=*b; }
Prohibition of optimization: volatile keyword (guaranteed memory visibility)
int func(int volatile *a){ *a=42; return *a;//At this time, the compiler will honestly read the address instead of returning 42 directly } for(volatile int i=0;i<1000;i ++ )//Hey, at this time, the compiler will not optimize this empty loop, but will execute it honestly, which can achieve an effect similar to a timer.
Compiler Optimization: Coalescing Writes
void func(int *a){ a[0]=123; a[1]=456; }
For consecutive addresses, two 32-bit writes can be directly optimized to one 64-bit write by the compiler.
Wider merged writes, 128-bit (xmm), use vectorized instructions.
SIMD accelerates multiples of 4 to write in parallel; for non-multiples of 4, the compiler will also automatically perform boundary judgments. Try to make all arrays an integer multiple of 16 to avoid special judgments by the compiler.
cycle
void func(float* __restrict a,float* __restrict b){ for(int i=0;i<1024;i ++ ){ a[i]=b[i] + 1; } } void func(float* a, float* b){ for(int i=0;i<1024;i ++ ){ a[i]=b[i] + 1; } }
Similarly, in order to prevent the phenomenon of pointer aliasing, the compiler will generate two sets of assembly codes (normal version and SIMD version), and then perform pointer difference operations to determine whether there is pointer aliasing. In order to avoid the loss of efficiency caused by this operation, it should also be added __restrict keyword.
void func(float*a, float*b){ #pragma omp simd//It can also be written like this, the OpenMP instruction forces the SIMD version to be generated for(int i=0;i<1024;i ++ ){ a[i]=b[i] + 1; } }
For the GCC compiler
#pragma GCC ivdep//Similar operations can also be implemented for the GCC compiler (ignore vector dependency)
For the judgment statement in the loop, the compiler will move it outside, provided that the judgment statement will not change in the loop.
For constant expressions, the compiler will also move operations outside the loop to reduce the number of operations.
void func(float*__restrict a,float*__restrict b,float dt){ for(int i=0;i<1024;i ++ ){ a[i]=a[i] + b[i]*(dt*dt);//For the multiplication of constants, parentheses must be added, because the multiplication is left associative, and it cannot be optimized without adding () } }
Note: Calling an external file function will also cause the optimization to fail. So the thermal loop tries not to use external functions.
Loop unrolling:
void func(float*a){ #pragma GCC unroll 4//Using this instruction can mean that the loop body is expanded into 4 for(int i=0;i<1024;i ++ ){ a[i]=1; } } //Equivalent to the following void func(float* a){ for(int i=0;i<1024;i + =4){ a[i + 0]=1; a[i + 1]=1; a[i + 2]=1; a[i + 3]=1; } }
The purpose is to increase the proportion of actual calculation in each cycle and improve efficiency
before loop unrolling
After loop unrolling
However, if the loop body is too large, it is not necessary to unroll, which will cause pressure on the instruction cache and slow down.
Structure:
Changing the size of the structure to an integer power of 2 is conducive to SIMD optimization, and some useless four-byte variables can be added to ensure it.
Or use alignas (number of bytes to be aligned)
struct alignas(16) MyVec{//C++11 new keywords float x; float y; float z; };
But it may also be slower, because it may lead to the occupation of memory bandwidth, in short, it should be considered in combination with the actual situation.
AOS conforms to object-oriented thinking, but is often not conducive to performance.
//AOS form: struct MyVec{ float x; float y; float z; }; Myvec a[1024]; //SOA form, more efficient, data-oriented programming thinking (DOP) struct MyVec{ float x[1024]; float y[1024]; float z[1024]; }; MyVec a; //AOSOA: an intermediate solution, which has both the intuitiveness of AOS and the efficiency of SOA struct MyVec{ float x[4]; float y[4]; float z[4]; }; MyVec a[1024/4];
Disadvantages of AOSOA: A double loop is required for access, and it is not conducive to random access.
STL
vector also has pointer aliasing issues:
Workaround: #pragma omp simd or #pragma GCC ivdep
Math optimization:
The compiler will convert division to multiplication for optimization.
//But in this case the compiler will give up optimization, because it does not know whether b will be 0; void func(float *a,float b){ for(int i=0;i<1024;i ++ ){ a[i]/=b; } }
Solution 1: Manual optimization
void func(float *a,float b){ float inv_b=1/b; for(int i=0;i<1024;i ++ ){ a[i]*=inv_b; } }
Solution 2: gcc-ffast-math -O3
This instruction allows GCC to boldly try to optimize floating-point operations, but it must be guaranteed that NAN and infinity will not appear in the program
Mathematical functions leave std:: prefix! Some of the math processing functions of the c version only accept to return double to waste performance.
Course Link: [Open Class] Compiler Optimization and SIMD Instruction Set (#4)_哔哩哔哩_bilibili