Computer architecture instruction pipeline execution analysis

Topic:

Chp3 Lab: Analysis of Instruction Execution Pipelining

Problem description:

Using the following code fragment:

LOOP: LW R1,0(R2) ; load R1 from address 0 + R2

ADDI R1,R1,#1; R1=R1 + 1

SW 0(R2),R1; store R1 at address 0 + R2

ADDI R2,R2,#4; R2=R2 + 4

SUB R4,R3,R2; R4=R3-R2

BNEZ R4,Loop ; branch to loop if R4!=0

Assume that the initial value of R3 is R2 + 396.

Analyze (a) by showing the timing of the first two iterations and calculate the total cycles needed; implement (b) and (c) in WinMIPS64. Execute the code in WinMIPS64.

(a) Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle “forwards” through the register file. Assume that the branch is handled by flushing the pipeline. If all memory references take 1 cycle, how many cycles does this loop take to execute?

(b) Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory references take 1 cycle, how many cycles does this loop take to execute?

(c) Assume the RISC pipeline with a single-cycle delayed branch and normal forwarding and bypassing hardware. Schedule the instructions in the loop including the branch delay slot. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop.

Problem analysis:

Before starting the experiment, we need to make it clear that the code in the question is 32-bit code and obviously cannot run on 64-bit WinMIPS64. So we need to convert the above code as follows:

The difference between 64-bit and 32-bit (32-bit w: word —– 64-bit d: double)

Operation class: add becomes dadd, and similarly sub/mul/div becomes dsub/dmul/ddiv

Memory access: lw becomes ld, sw becomes sd

Register name: S0/S1… all become r0/r1/… / r31

But there is still a problem. I don’t see the $ symbol to represent the register, but I see the # symbol to represent immediate numbers. I checked the information and found that the use of # should be relatively rare. To be on the safe side, we removed the # so that WinMIPS64 can run with peace of mind. I’ll add a HALT to stop the program just in case. If it is not a DLX architecture, in the question, the source operand and destination operand of the statement SW 0(R2), R1 are reversed (an error will be reported in WinMIPS64), and it should be SW R1,0 (R2), I checked a lot of information and found that this question is copied everywhere on the Internet and is inverted (except for the real questions of the Ph.D. Entrance Examination of Huazhong University of Science and Technology). They seem to be able to run in WinMIPS64 , but I can only say that the acting is very real. In other words, this is correct in their instruction set, at least it cannot run in WinMIPS64. We all know that assembly code is generally not case-sensitive, so let’s express it in all capital letters, otherwise it will look nondescript.

Pipeline structure: Instruction memory and data memory exist by default (instruction fetching and data fetching in IF and MEM no longer conflict, and the structural risk in this part can be ignored)

The converted code is as follows:

.text
       DADDI R3,R2,396;the initial value of R3 is R2 + 396
LOOP:
       LD R1,0(R2) ; load R1 from address 0 + R2
       DADDI R1,R1,1 ; R1=R1 + 1
       SD R1,0(R2); store R1 at address 0 + R2
       DADDI R2,R2,4; R2=R2 + 4
       DSUB R4,R3,R2; R4=R3-R2
       BNEZ R4,LOOP ; branch to loop if R4!=0
       HALT

(a)Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle “forwards” through the register file . Assume that the branch is handled by flushing the pipeline. If all memory references take 1 cycle, how many cycles does this loop take to execute?

Let’s first assume that all memory accesses are hits (does the simulator have this particularity? Just be more rigorous).

  • Explanation of the problem:

Requirements: There is no forwarding (also known as forwarding) or bypassing (Bypassing). Fundamentally speaking, forwarding and bypassing refer to the same thing. It’s just that the angles of observation and description are different. Forward pass is described from the perspective of instruction execution sequence, while bypass is described from the perspective of circuit structure, and this a register read and a write in the same clock cycle “forwards” through the register file refers to W, There is no contradiction in reading and writing in the register file in phase B, because it is written in the first half of the cycle and read in the second half of the cycle. In this question, we are purely manually simulating using the usual architecture. The EX phase in BNEZ performs judgment, and the MEM phase transfers the transfer address to the PC. This is different from the architecture in the subsequent simulator. In the simulation, the BNEZ instruction ID stage performs two operations, namely judgment and transfer of instructions to the PC.

(1) What are forward push and bypass?

Forwarding: One step further than assembly line stalling. The Stall scheme is like a convoy driving on the road. When they pass through the middle of the intersection, the red light turns on. According to the traffic rules, the cars that have crossed the line can still pass through the intersection, while the cars behind the line can only Being able to wait in a queue (assuming there is only one road), before the light turns green, the car behind will never reach the position of the car in front (unless it flies up), they all have to wait there, and luckily The car that has passed the line may have already crossed the Ten Thousand Mountains by boat. And operand push forward is like a sprint relay race. The latter athlete can jump ahead and the former athlete will run a little longer and take the initiative to pass the baton to him. In the pipeline, each instruction moving in the pipeline is a player, and each wall is the pipeline register of that stage. Pushing forward will

Bypassing: That is, the execution result of the first instruction is directly “forwarded” to the ALU of the second instruction as input. In order to realize the “forwarding” here, we need to pull out a separate signal transmission line in the CPU hardware so that the calculation results of the ALU can be returned to the input of the ALU. Such a line is our “bypass”. It bypasses the process of writing to the register and reading from the register, which also saves us 2 clock cycles.

(2) What is assembly line flushing? (flushing the pipeline)

Branch instructions usually decide whether to jump in the EX stage. The two instructions following the branch instruction will be evaluated and executed. Without intervention, these two subsequent instructions will start executing before the beq instruction jumps, which is what we don’t expect. Therefore, in the EX phase of the BEQ instruction, the last two instructions enter the fetch and decode stages respectively. If you want to jump, you need to clear the last two instructions. Otherwise, letting the two instructions continue to execute will change the value of the register and thus Leading to wrong results, this is the pipeline flushing mechanism.

Flushing out the first two levels of pipeline registers can be achieved by inserting nop instructions (bubbles). When the EX stage of the BEQ instruction is judged to jump, a clear signal is given to the IF_ID and ID_EX pipeline registers so that they output 0 in the next cycle. At the same time, the control signal in the ID_EX pipeline register must also be cleared.

  • Problem Resolution:

The pipeline space-time diagram is as follows:

Interpretation of space-time diagram:

The blue circle represents write first and read later in the register file, which can effectively solve the conflict between the W and B stages in the pipeline register. Other notes are shown in the image above.

It can be seen that the second iteration starts at 18cycles. According to the total calculation, this loop structure will perform 396/4=99 cycles. However, during the calculation process, we regard the previous 98 cycles as each cycle body of 17 cycles, and the last cycle as 18 cycles. This can ensure that the cycles are not repeated or leaked, and the calculation is simple.

Then the total pipeline length is =17*98 + 18=1684 clock cycles

(b) Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory references take 1 cycle, how many cycles does this loop take to execute?

  • Explanation of the problem:

(1), forward delivery and bypass have been explained above, as above.

(2) Treat every branch as not taken predict-not-taken:

For each transfer instruction, before making a judgment, we will treat it as not transferring, and continue to execute the statements immediately below. Until it can be judged whether to transfer, we will make the corresponding transfer or continue.

(3) In WinMIPS64, branch transfer is predicted not to be transferred by default, so we only need to set the forward function.

  • Problem Resolution:

Manual analysis:

Since the question gives too little information, we have to assume here that the ID stage does not add a small adder and a 0 comparator.

Interpretation of manual space-time diagram:

The blue box represents writing first and then reading in the register file, which can effectively solve the conflict between the W and D phases. The W/D phases of other instructions also overlap, but the registers involved in writing and reading are different, (there are The internal reading and writing of the hardware may be two ports) here there is only time sequence, the performance of data risk is too small, it is more of a structural risk, we have too many marks for each, so we will not mark it here. Other notes are shown in the image above.

WinMIPS64 emulation:

The results of the simulator simulation made me a little puzzled at first. As far as I can roughly understand temporarily, handwriting and simulation are equivalent, but the simulator has its own design ideas. In my own hand-drawn space-time diagram, I want the ADDI instruction to pause for a cycle before entering the EX stage after the ID stage is executed; and the simulator ADDI instruction directly enters the EX stage after the ID stage is executed, but the EX at this time The stage is invalid, and no actual action is performed. It is a state of state latch, and then it is still the EX stage. At this time, the EX stage really waits for the data and executes it. In fact, the EX entered for the first time The stage is a pause, but the simulator classifies it as the EX stage. I think it’s a bit uneconomical because three instructions are paused at the same time, while the space-time diagram I drew by hand only requires two instructions. It’s better to use the timing diagram I pushed by hand.

Manual drawing results:

(BNEZ’s judgment is still in the EX stage, and the address of the next instruction will be available only after the M stage)

It can be seen that the second iteration starts at 11cycles. According to the total calculation, this loop structure will perform 396/4=99 cycles. However, during the calculation process, we regard the previous 98 cycles as each cycle body of 10 cycles, and the last cycle as 11 cycles. This can ensure that the cycles are not repeated or leaked, and the calculation is simple.

Then the total pipeline length is =10*98 + 11=991 clock cycles

Simulator results:

(Obviously, a small adder and a 0 comparator are added to the ID stage of BNEZ in the simulator) This is different from the question requirements and is also the reason why our answer is different, but it does not hinder our analysis.

Plus R3’s preprocessing instructions and HALT instructions: 9*98 + 12 + 1 + 1=896 clock cycles

Ignore R3’s preprocessing instructions and HALT instructions (minus two 1’s): 9*98 + 12=894 clock cycles

The reason for the difference between the two 1s is that the R3 preprocessing instruction and the HALT instruction each cause the pipeline cycle to increase by 1.

The first 98 cycles of the cycle are regarded as 9 cycles, and the last cycle is regarded as 12 cycles, which is a complete cycle.

(c) Assume the RISC pipeline with a single-cycle delayed branch and normal forwarding and bypassing hardware. Schedule the instructions in the loop including the branch delay slot. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop.

  • Explanation of the problem:

(1), forward delivery and bypass have been explained above, as above.

(2), that is, only one statement (single-cycle delayed branch) is placed in the delay slot:

Statements in the delay slot must be executed

(3) In WinMIPS64, branch transfer is predicted not to be transferred by default, so we only need to set the forward function.

  • Problem Solution:

(1) First, reorganize the order of statements, act as the compiler yourself, and put the above statement into the delay slot.

Adjustments are as follows:

This statement under the BNEZ instruction (the statement in the red circle) serves as the content of the delay slot.

This statement in the delay slot is the same as BNEZ statement is not dependent, its improvement is permanent, we found that no matter Whether to judge whether to transfer or not to transfer, the effect will be improved and it is the best strategy (From before).

Manual timing diagram:

In this case, a small adder and a 0 comparator are added to the ID stage, so the branch transfer instruction knows whether to transfer after the ID, and the next transfer address.

Interpretation of manual timing diagram:

WinMIPS64 emulation:

The above-mentioned space-time diagram is in a perfect state. It can be said that it can hardly be optimized anymore. It has no pause and is in the shape of steps.

Manual drawing results:

It can be seen that the second iteration starts at 7cycles. According to the total calculation, this loop structure will perform 396/4=99 cycles. However, during the calculation process, we regard the previous 98 cycles as each cycle body of 6 cycles, and the last cycle as 10 cycles. This can ensure that the cycles are not repeated or leaked, and the calculation is simple.

Then the total pipeline length is =6*98 + 10=598 clock cycles

Simulator results:

(Obviously, a small adder and a 0 comparator are added to the ID stage of BNEZ in the simulator) This is different from the question requirements and is also the reason why our answer is different, but it does not hinder our analysis.

Plus R3’s preprocessing instructions and HALT instructions: 6*98 + 10 + 1 + 1=600 clock cycles

Ignore R3’s preprocessing instructions and HALT instructions (minus two 1’s): 6*98 + 10=598 clock cycles

The picture on the left is a diagram of a single loop, and the picture on the right is a diagram of the complete program operation.

At this time, the simulator and manual results are surprisingly consistent! ! (Of course, this depends on the preprocessing instructions and HALT instructions of R3)