“SIMD instruction considered harmful” SIMD instruction considered harmful

By David Patterson and Andrew Waterman, September 18, 2017

Original link: SIMD Instructions Considered Harmful | SIGARCH

In the course of writing the RISC-V Handbook, we compared RISC-V vector code to SIMD. We were appalled by the insidiousness of the SIMD instruction extensions for ARM, MIPS, and x86. Based on Chapter 8 of this book, we decided to share these insights in this blog.

Like opioids, SIMD starts off simple. Architects divide existing 64-bit registers and ALUs into many 8-, 16-, or 32-bit sections, and then perform computations on them in parallel. The opcode provides the data width and operation. Data transfers simply load and store a single 64-bit register. How could anyone object to this?

To speed up SIMD, architects then doubled the width of the registers to compute more parts simultaneously. Because the ISA traditionally supports backward binary compatibility, and the opcode specifies the data width, extending the SIMD registers also extends the SIMD instruction set. Each subsequent step of doubling the width of SIMD registers and the number of SIMD instructions led the ISA down a path of increasing complexity, borne by processor designers, compiler writers, and assembly language programmers. Since 1978, the IA-32 instruction set has grown from 80 instructions to about 1400 instructions, largely driven by SIMD.

An older and more elegant alternative to exploit data-level parallelism is, in our opinion, vector architectures. Vector computers gather (gather) objects from main memory and place them into long, contiguous vector registers. Pipelined execution units compute very efficiently on these vector registers. The vector architecture then scatter the results from the vector registers back to main memory.

While a simple vector processor may execute one vector element at a time, element-wise operations are by definition independent, so the processor can theoretically compute all elements simultaneously. The widest data for RISC-V is 64 bits, and today’s vector processors typically execute two, four, or eight 64-bit elements per clock cycle. The hardware handles edge/remainder cases when the vector length is not an integer multiple of the number of elements executed per clock cycle. Like SIMD, the vector data width is variable. A vector processor with N 64-bit elements per register can also compute vectors with 2N 32-bit, 4N 16-bit, and 8N 8-bit elements.

Even the simple DAXPY function (below) illustrates our argument.

 void daxpy(size_t n, double a, const double x[], double y[])
    {
     for (size_t i = 0; i < n; i ++ ) {
       y[i] = a*x[i] + y[i];
     }
    }

Figure 1. Number of instructions executed and total number of instructions when n = 1000. Bookkeeping code statistics include: copying the scalar variable a for SIMD execution, edge code to handle when n is not a multiple of the SIMD register width, and skipping the main loop if n is 0. Figures 2-4 and the text in the appendix describe the compiled DAXPY code for each ISA. (bookkeeping code bookkeeping code)

Figure 1 summarizes the DAXPY instruction counts for MIPS-32 SIMD Architecture (MSA), IA-32 AVX2, and RV32V programs. (Figures 2 through 4 in the appendix show compiler output for RISC-V, MIPS-32, and IA-32.) The SIMD computation code pales in comparison to the bookkeeping code. Two-thirds to three-quarters of the MIPS-32 MSA and IA-32 AVX2 code is SIMD overhead for preparing data for the main SIMD loop or processing edge elements when n is not an integer multiple of the number of floating point numbers in SIMD registers .

The RV32V code for Figure 2 in the Appendix does not require such bookkeeping code, which reduces the static instruction count. Unlike SIMD, it has a vector length register vl, which allows vector instructions to work for any value of n.

However, the most notable difference between SIMD and vector processing is not static code size. SIMD instructions execute 10 to 20 times more instructions than RV32V because only 2 or 4 elements are executed per SIMD loop instead of 64 in the vector case. Additional instruction fetch and instruction decode means higher energy to perform the same task.

Compared to the scalar version of DAXPY, the results in Figures 3 and 4 in the Appendix show that SIMD roughly doubles the instruction and byte size of the static code, while the size of the main loop remains the same, and the number of dynamic instructions executed Reduction by a factor of 2 or 4, depending on the width of the SIMD registers. However, RV32V vector static code size increases only by 1.2x, while dynamic instruction count decreases by 43x.

There is a large difference in dynamic instruction counts, which in our opinion is the second largest difference between SIMD and vectors. Worst of all is the rapidly ballooning size of the ISA and bookkeeping code. ISAs like MIPS-32 and IA-32, which follow the principle of backward binary compatibility, have to copy all the old SIMD instructions defined for narrower SIMD registers every time the SIMD width is doubled. Hundreds of MIPS-32 and IA-32 instructions were created in multiple generations of the SIMD ISA, with hundreds more to come, and the recent AVX-512 extensions demonstrate this again. The cognitive burden on an assembly language programmer of this brute force approach to data-level parallelism must be staggering. How do we remember what vfmadd213pd means and when to use it?

In contrast, RV32V code is not affected by the size of the vector registers. Not only does RV32V not change if the vector register size grows, you don’t even have to recompile. Since the processor provides a value for the maximum vector length mvl, the code in Figure 2 would remain the same if the processor designer doubled or halved the vector registers.

In summary, since the SIMD ISA is hardware-dependent, achieving higher data-level parallelism means changing instruction sets and compilers. One could argue that SIMD violates the design principle of isolating architecture from implementation. In contrast, the vector ISA allows processor designers to choose data-parallel resources for their applications without affecting programmers or compilers.

We believe that the strong contrast in cost-energy-performance, complexity, and ease of programmability between RV32V’s vector approach and the expanding SIMD architectures of ARMv7, MIPS-32, and IA-32 is the most convincing argument in favor of RISC-V One of the powerful arguments.

Appendix: DAXPY Code and Notes for RV32V, MIPS MSA, and x86 AVX

Figure 2 shows the vector code in RV32V, an optional vector extension to RISC-V. RV32V allows vector registers to be disabled when not in use to reduce the cost of context switching, so the first step is to enable both double-precision floating-point registers. Assume a maximum vector length (mvl) of 64 elements in this example.

The first instruction in the loop sets the vector length for subsequent vector instructions. The instruction setvl writes the smaller of mvl and n to the vector length registers vl and t0 . If the number of iterations of the loop is greater than n, the fastest the code can process data is 64 values at a time, so vl is set to mvl. If n is less than mvl, then we cannot read or write beyond the end of x and y, so we should only count the last n elements in the last iteration.

The following instructions are two vector loads (vld), a multiply-add operation for computation (vfmadd), a vector store (vst) strong>, four instructions for address arithmetic (slli, add, sub, add), a loop test completes (bnez), and returns (ret ).

The power of the vector architecture is that each iteration of this 10-instruction loop initiates 3×64 = 192 memory accesses and 2×64 = 128 floating-point multiply-accumulate operations (assuming n is at least 64). This averages out to about 19 memory accesses and 13 operations per instruction. These ratios are an order of magnitude worse for SIMD.

# a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
  0: li t0, 2<<25
  4: vsetdcfg t0 # enable 2 64b Fl.Pt. registers
loop:
  8: setvl t0, a0 # vl = t0 = min(mvl, n)
  c: vld v0, a1 # load vector x
  10: slli t1, t0, 3 # t1 = vl * 8 (in bytes)
  14: vld v1, a2 # load vector y
  18: add a1, a1, t1 # increment pointer to x by vl*8
  1c: vfmadd v1, v0, fa0, v1 # v1 + = v0 * fa0 (y = a * x + y)
  20: sub a0, a0, t0 # n -= vl (t0)
  24: vst v1, a2 # store Y
  28: add a2, a2, t1 # increment pointer to y by vl*8
  2c: bnez a0, loop # repeat if n != 0
  30: ret # return

Figure 2. DAXPY compiled to RV32V. This loop works for any value of n, including 0; when n = 0, setvl turns the vector operation into a nop. RV32V associates data types and sizes with vector registers, reducing the number of vector instructions.

We now show how SIMD code grows rapidly. Figure 3 is the MIPS SIMD Architecture (MSA) version and Figure 4 is the IA-32 using SSE and AVX2 instructions. (ARMv7 NEON does not support double-precision floating point arithmetic, so it cannot help DAXPY).

Since MSA registers are 128 bits wide, each MSA SIMD instruction can operate on two floating-point numbers. Unlike RV32V, because there is no vector length register, MSA requires additional bookkeeping instructions to check for problematic values of n. When n is odd, there is extra code to compute a single floating-point multiply-accumulate, since MSA must operate on pairs of operands. In the unlikely but possible case, when n is zero, the branch at position 10 skips the main computation loop. If it does not branch around the loop, the instruction at location 18 (splati.d) puts a copy of a into the two halves of SIMD register w2. To add scalar data in SIMD, we need to copy it to be as wide as a SIMD register.

Inside the main loop, two load instructions (ld.d) read the two elements of y and x into SIMD registers w0 and w1, three instructions perform address operations (addiu, addiu, addu), multiply and add instructions (fmadd. d) Compute, then test loop terminates (bne), then stores (st.d) in the branch delay slot to hold the result. After the main loop terminates, the code checks to see if n is odd. If it is, it performs the final multiply-add using scalar instructions. The last instruction (jr) returns to the call site.

Intel has gone through multiple generations of SIMD extensions, and Figure 4 uses these extensions. The extension of SSE to 128-bit SIMD resulted in xmm registers and the instructions that can use them, while the extension to 256-bit SIMD as part of AVX created ymm registers and instructions that can use them.

The first set of instructions at addresses 0 to 25 load variables from memory, make four copies of a in 256-bit ymm registers, and test to make sure n is at least 4 before entering the main loop. (The caption to Figure 4 explains how this is done in more detail.)

The main loop performs the core of the DAXPY calculation. The AVX instruction vmovapd at address 27 loads 4 elements of x into ymm0. AVX instruction vfmadd213pd at address 2c multiplies 4 copies of a (ymm2) by 4 elements of x (ymm0), adds 4 elements of y (in memory at address ecx + edx*8), and divides 4 and into ymm0. The following AVX instruction vmovapd at address 32 stores 4 results into y. The next three instructions increment the counter and repeat the loop as needed.

Similar to the code for the MIPS MSA, the “edge” code between addresses 3e and 57 handles the case where n is not a multiple of 4.

# a0 is n, a2 is pointer to x[0], a3 is pointer to y[0], $w13 is a
  0: li a1,-2
  4: and a1,a0,a1 # a1 = floor(n/2)*2 (mask bit 0)
  8: sll t0,a1,0x3 # t0 = byte address of a1
  c: addu v1,a3,t0 # v1 = &y[a1]
 10: beq a3,v1,38 # if y== &y[a1] goto Fringe
                                # (t0==0 so n is 0 | 1)
 14: move v0,a2 # (delay slot) v0 = &x[0]
 18: splati.d $w2,$w13[0] # w2 = fill SIMD reg. with copies of a
Main Loop:
 1c: ld.d $w0,0(a3) # w0 = 2 elements of y
 20: addiu a3,a3,16 # incr. pointer to y by 2 FP numbers
 24: ld.d $w1,0(v0) # w1 = 2 elements of x
 28: addiu v0,v0,16 # incr. pointer to x by 2 FP numbers
 2c: fmadd.d $w0,$w1,$w2 # w0 = w0 + w1 * w2
 30: bne v1,a3,1c # if (end of y != ptr to y) go to Loop
 34: st.d $w0,-16(a3) # (delay slot) store 2 elts of y
Fringe:
 38: beq a1,a0,50 # if (n is even) goto Done
 3c: addu a2,a2,t0 # (delay slot) a2 = &x[n-1]
 40: ldc1 $f1,0(v1) # f1 = y[n-1]
 44: ldc1 $f0,0(a2) # f0 = x[n-1]
 48: madd.d $f13,$f1,$f13,$f0# f13 = f1 + f0*f13 (muladd if n is odd)
 4c: sdc1 $f13,0(v1) # y[n-1] = f13 (store odd result)
Done:
 50: jr ra # return
 54: nop # (delay slot)

Figure 3. DAXPY compiled to MIPS32 MSA. The bookkeeping overhead of SIMD is apparent when comparing this code to the RV32V code in Figure 2. The first part of the MIPS MSA code (addresses 0 to 18) copies the scalar variable a in a SIMD register and checks to make sure n is at least 2 before entering the main loop. The third section of the MIPS MSA code (addresses 38 to 4c) handles edge cases when n is not a multiple of 2. No such bookkeeping code is needed in RV32V because the vector length register vl and the setvl instruction handle it. This code is generated by gcc with flags set to generate small code.

# eax is i, n is esi, a is xmm1,
# pointer to x[0] is ebx, pointer to y[0] is ecx
 0: push esi
 1: push ebx
 2: mov esi,[esp + 0xc] # esi = n
 6: mov ebx,[esp + 0x18] # ebx = x
 a: vmovsd xmm1,[esp + 0x10] # xmm1 = a
 10: mov ecx,[esp + 0x1c] # ecx = y
 14: vmovddup xmm2,xmm1 # xmm2 = {a,a}
 18: mov eax,esi
 1a: and eax,0xfffffffc # eax = floor(n/4)*4
 1d: vinsertf128 ymm2,ymm2,xmm2,0x1 # ymm2 = {a,a,a,a}
 23: je 3e # if n < 4 goto Fringe
 25: xor edx,edx # edx = 0
Main Loop:
 27: vmovapd ymm0,[ebx + edx*8] # load 4 elements of x
 2c: vfmadd213pd ymm0,ymm2,[ecx + edx*8] # 4 mul adds
 32: vmovapd [ecx + edx*8],ymm0 # store into 4 elements of y
 37: add edx,0x4
 3a: cmp edx,eax # compare to n
 3c: jb 27 # repeat loop if < n
Fringe:
 3e: cmp esi,eax # any fringe elements?
 40: jbe 59 # if (n mod 4) == 0 go to Done
Fringe Loop:
 42: vmovsd xmm0,[ebx + eax*8] # load element of x
 47: vfmadd213sd xmm0,xmm1,[ecx + eax*8] # 1 mul add
 4d: vmovsd [ecx + eax*8],xmm0 # store into element of y
 52: add eax,0x1 # increment Fringe count
 55: cmp esi,eax # compare Loop and Fringe counts
 57: jne 42 <daxpy + 0x42> # repeat FringeLoop if != 0
Done:
 59: pop ebx # function epilogue
 5a: pop esi
 5b: ret

Figure 4. DAXPY compiled to IA-32 SSE and AVX2. The SSE instruction vmovsd at address a loads a into half of the 128-bit xmm1 register. The SSE instruction vmovddup at address 14 copies a into both halves of xmm1 for later SIMD computations. The AVX instruction vinsertf128 at address 1d starts with two copies of a in xmm1 and makes four copies of a in ymm2. The three AVX instructions (vmovsd, vfmadd213sd, vmovsd) at addresses 42 to 4d are processed when mod(n,4) ? 0. They perform DAXPY calculations one element at a time, and the loop repeats until the function performs exactly n multiply-add operations. Again, such code is unnecessary for RV32V, because the vector length register vl and the setvl instruction make the loop work for any value of n.

About the author: David Patterson is a professor at the Graduate School of Computer Science at UC Berkeley and a Distinguished Engineer at Google. Andrew Waterman is Chief Engineer at SiFive and holds a Ph.D. in Computer Science from UC Berkeley.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions expressed in this blog are personal, those of the blog author and do not necessarily represent those of ACM SIGARCH or its parent organization, ACM.