LoongArch instruction set pipeline design

Simple version of the pipeline
  • The overall idea of the assembly line-your own insights
    Divide the instruction execution into several stages (five-level pipeline – value acquisition, decoding, execution, memory access, write back), and each stage does its own thing (generates corresponding control signals and completes its own work). Due to the need for “piping”, the four registers need to play the role of “interruption” and “transmission”. Not only should the signals of the subsequent stages of the instruction be “inherited”, but also “carried forward”, because different instructions may generate corresponding signals for the subsequent stages of the instruction at different stages.
IF-stage

The if_stage module is responsible for fetching instructions in each clock cycle, calculating the address of the next instruction, processing the logic of the branch target, and passing the obtained instruction data to the ID stage to support instruction decoding and execution in the next stage.

  • fs_allowin signal generation logic
assign fs_allowin = !fs_valid || (fs_ready_go & amp; & amp; ds_allowin);

!fs_valid: This part means that if the IF stage is not currently in a valid state (fs_valid is false), then fs_allowin will be true. This is usually when the CPU is just started or a reset occurs, the IF stage needs to receive new instructions, so it is allowed to receive instruction fetch requests from the ID stage.

(fs_ready_go & amp; & amp; ds_allowin): This part means that if the IF stage is ready (fs_ready_go is true), and the ds_allowin signal from the ID stage is true, then fs_allowin is also true. This means that the IF phase is ready to receive new instructions and the ID phase requires the instruction to be passed.

ID-stage

The ID stage module performs tasks such as instruction decoding, data path control, data correlation detection, and pipeline control to provide necessary support for the correct execution of instructions and smooth transfer to the next stage. At the same time, this module also involves operations in branch prediction, register file access, destination register and data hazards.

EXE-stage

The execution phase module performs arithmetic and logical operations of instructions, including ALU calculations, data memory access, etc., and also involves data path control and pipeline stage control. It provides the necessary support for the correct execution of instructions and the preparation of data to be passed to the storage stage. In addition, there are also operations on destination registers, data dependencies, immediate data and operand selection.

MEM-stage

The memory stage module is mainly responsible for handling data storage access operations, including reading data from the data storage and preparing the data to be passed to the next stage. It is also related to the destination register and selects the data to be transferred based on the calculation results of the execution phase and the results of the data storage. Pipeline stage control ensures that data is passed to the next stage at the right time.

WB-stage

The write-back phase module is responsible for controlling the operation of writing data back to the register file. This includes determining when to perform a writeback operation and what data to write back to the register file. Control flow and data transfer are performed under the control of the clock, ensuring that the correct data is written back.

Assembly line-one of the secrets of flow
  • if-stage
always @(posedge clk) begin
    if (reset) begin
        fs_valid <= 1'b0;
    end
    else if (fs_allowin) begin
        fs_valid <= to_fs_valid;
    end

    if (reset) begin
        fs_pc <= 32'hbfbffffc; //trick: to make nextpc be 0xbfc00000 during reset
    end
    else if (to_fs_valid & amp; & amp; fs_allowin) begin
        fs_pc <= nextpc;
    end
end
  • id-stage
always @(posedge clk ) begin
    if (reset) begin
        ds_valid <= 1'b0;
    end
    else if (ds_allowin) begin
        ds_valid <= fs_to_ds_valid;
    end
end
always @(posedge clk) begin
    if (fs_to_ds_valid & amp; & amp; ds_allowin) begin
        fs_to_ds_bus_r <= fs_to_ds_bus;
    end
end
  • exe-stage
always @(posedge clk) begin
    if (reset) begin
        es_valid <= 1'b0;
    end
    else if (es_allowin) begin
        es_valid <= ds_to_es_valid;
    end

    if (ds_to_es_valid & amp; & amp; es_allowin) begin
        ds_to_es_bus_r <= ds_to_es_bus;
    end
  • mem-stage
always @(posedge clk) begin
    if (reset) begin
        ms_valid <= 1'b0;
    end
    else if (ms_allowin) begin
        ms_valid <= es_to_ms_valid;
    end

    if (es_to_ms_valid & amp; & amp; ms_allowin) begin
        es_to_ms_bus_r = es_to_ms_bus;
    end
end
  • wb-stage
always @(posedge clk) begin
    if (reset) begin
        ws_valid <= 1'b0;
    end
    else if (ws_allowin) begin
        ws_valid <= ms_to_ws_valid;
    end

    if (ms_to_ws_valid & amp; & amp; ws_allowin) begin
        ms_to_ws_bus_r <= ms_to_ws_bus;
    end
end

Handling pipeline conflicts caused by read-after-write registers

The scenario that causes pipeline conflicts caused by register write and read data is: the instruction that produces the result has not yet written the result back to the register file, and the instruction that needs the result is already in the decoding stage. At this moment, it is read from the general register file. The only value in is the old value, not the new value.
An intuitive solution is to let the instruction that requires a result wait in the decoding pipeline until the instruction that produces the result writes the result back to the general register file before it can enter the next level of execution pipeline.

  • Design ideas
    The key point is how to generate the conditions that control whether the decoding pipeline level instructions advance or block: determine whether instructions in different stages of the pipeline have “read-after-write” correlation that will cause conflicts. The specific description is: the instruction in the decoding pipeline level has a source operand from a non-0 register, then if the register number of any of these source operands is the same as the instruction at the execution level, memory access level and write back level at the current moment, If the register numbers of the destination operands are the same, it indicates that there is a “read-after-write” relationship between the instructions at the decoding level and the instructions at the following three levels. There is an implicit detail that is easily overlooked by beginners, that is, it must be ensured that the two register numbers being compared are both valid.
  1. Does the instruction participating in the comparison have a source operand of a register or a destination operand of a register? ADDIU only rs. JAL does not have a register source operand, and BNE, BEQ, and JR do not have a register destination operand.
  2. If the definition of the instruction does have a source operand or destination operand of a register, but the register number is 0, there is no need to compare. The value of register 0 under the MIPS architecture is always 0.
  3. Whether there are instructions on the pipeline level used for comparison. If there are no instructions, then the register number, instruction type and other information of this level are invalid.
    How to determine when the condition is true and block this instruction in the decoding pipeline level. Only the ready_go signal of the heap decoding pipeline level needs to be adjusted.
  • Special case
    Load-to-Branch means that the i-th instruction is Load, and the i + 1-th instruction is a transfer (branch or jump) instruction. At least one source register of the transfer instruction is the same as the destination register of the Load instruction, that is, there is read-after-write. . In this case, the transfer calculation becomes incomplete.
    At this time, the nextPC generated by the transfer information (br_bus) sent from the decoding stage to the fetching level is incorrect, so a control signal br_stall should be added to the br_bus sent from the decoding stage to the fetching level to indicate Transfer calculation not completed. In addition, it is necessary to add a ready_go for pre-IF. When br_stall is 1, the combinational logic ready_go signal is set to 0, and then to_fs_valid is 0; the IF level sees to_fs_valid is 0, and when the IF level allowin is 1, the sequential logic IF-valid will be set to 0. In addition, nextPC is sent to the address port of the instruction RAM. When the transfer calculation is not completed, it is recommended that the read enable of the instruction RAM should be controlled to 0.

Forward delivery technology

Posting a picture is not particularly suitable, but it should be able to convey the meaning.
When considering forward path design, ADDU instructions represent those instructions that produce results at the execution level. However, the LW instruction does not belong to this category of instructions. It cannot generate results until the memory access level. However, after analysis, it will be found that the LW instruction can completely reuse the forward paths added at the memory access level and write-back level in order to forward the results of ADDU instructions. Because the forward path knowledge sends the result to the device, the two pieces of information, the register number and the value, are enough and have nothing to do with the function of the instruction that generates the result.
There are two possible design options:

  1. The starting point is at the result output of the execution level ALU, and the end point is at the result generation logic of the decoding level register file.
  2. The starting point is located at the Q port output of the flip-flop that stores the ALU result in the memory access level pipeline cache, and the end point is located at the execution level ALU input data generation logic.
    Option 1 is better here. The starting point of the forward path of the memory access level result is the result output after selecting one of the data RAM return result and the ALU result saved in the memory access level cache. In another case, in order to solve the pipeline conflicts caused by control, we move the processing of all transfer instructions to the decoding level. Among the currently implemented transfer instructions, the three instructions BEQ, BNE, and JR have register source operands. If they have a read-after-write relationship with the previous instruction, then they must block themselves until writing to the register file unless the end of the forward path is at the decode level.
  • Adjust the logic used by the decoding stage to produce register read results. From a spatial perspective, the value of register source operand 1 of the decoding level instruction can come from the output of the general register file read port 1, or can also come from the results passed in from the execution level, memory access level and write back level. Similarly, the value of register source operand 2 of the decoding level instruction can come from the output of the general register file read port 2, or can come from the results passed in from the execution level, memory access level, and write back level. So we need to add two “choose one from four” components, and two “choose one from four” components with selection priority. The choice of which transfer result depends not only on whether the register number of the source operand is consistent with the register number of the previously delivered result, but also on the priority relationship between different pipeline levels.
mips 19 instructions + blocking + forward structure overview (partially incomplete)

SOC structure

Control peripherals

The method used by LoongArch to access peripherals is MMIO, which means that the peripheral registers are directly mapped to the address space, and the CPU operates through the ld/st instruction. The data path for the CPU to access peripherals is:

  • The pipeline initiates a memory access request to the data SRAM bus (CPU internal logic)
  • When the memory access request arrives on the 1 x 2 bridge, the conversion bridge arbitrates and sends the request to the bus connected to confreg
  • confreg request, processed internally, changes the level of the output port (confreg.v)
  • Bind the output port to the chip pin through the constraint file to control specific peripherals.
    Let’s look at the software part. We need to use the ld/st instructions to operate peripherals, therefore, we need to know the address being accessed and the meaning of reading and writing data.