FPGA finds the maximum value and the second maximum value

Click the blue words to follow us

Follow and star the public account, and exciting content will be delivered every day
Source: Internet material

Implement a module on FPGA to find the maximum value and second maximum value among 32 inputs. 32 inputs are given by one clock cycle. (The questions are from forums and interview questions. If you feel they are inappropriate, please leave a message and delete them)

From my personal point of view, this is a good interview question:

First, this is probably a simplification of the problems encountered in the implementation of certain machine learning algorithms, and it is a very meaningful question;

The second is that this question not only requires FPGA code capabilities, but also has many possibilities that can be optimized algorithmically;

Of course, the input bit width may affect the final problem-solving ideas and final implementation possibilities. But within a certain range of bit width, such as 8 or 32, the solution should be the same, but it will affect the final frequency. The following article will provide a detailed analysis of this topic. (The question does not explain how to deal with repeated elements. It is considered that the maximum value and the second largest value can be the same, that is, the repeated elements are calculated)

1. Solution

From the perspective of the algorithm itself, the process of finding the maximum value and the second-largest value is very simple; it passes through two traversals: the first time to find the maximum value, and the second time to find the second-largest value; the algorithm complexity is O(2n). It is obviously impossible for FPGA to complete such complex operations in one cycle, and generally requires pipeline design. Under this method, the entire structure is like this

Through comparison, the maximum value is obtained, and the comparison between two is achieved through the pipeline. 32-16-8-4-2-1 can obtain the maximum value through a delay of 5 clk;

Since the second largest value needs to be found, the position of the maximum value needs to be determined, and the coordinates of the maximum value need to be maintained during the process of finding the maximum value;

Clear the value at the maximum coordinate (set to minimum)

The comparison between the two is realized through the pipeline, 32-16-8-4-2-1, and the second largest value can be obtained after a delay of 5 clk;

This solution has several disadvantages, including: Delaying the maximum value and the second maximum value requires 5clk delay respectively, and the total delay will exceed 10 cycles; resource usage is high, and maintaining the maximum value coordinates and clearing operations consume more resources , and at the same time, in order to calculate the second largest value, the input needs to be registered for several cycles, which consumes more registers.

Another idea is to consider finding the maximum value and the second maximum value at the same time. Since this logic is relatively complex, it can be streamlined, as shown in the figure below. (Take 8 inputs as an example, 32 inputs need to add two levels)

fb11b43db1f6e9a864abf17ce8968b67.png

The sort module completes the function of sorting 4 inputs and obtaining the maximum value and the second largest value output. The sorting of 4 numbers is more complicated, and this process takes about 2-3 cycles to complete. For 32 inputs, the input data is output through 32-16-8-4-2 to obtain the result, and the delay is about 10 cycles.

2. Divide and Conquer

If you need to implement a specific algorithm on FPGA, then just find a suitable method to implement it; but if you want to implement a specific function, then you need to find an excellent method that is suitable for FPGA implementation.

Finding the maximum value and the second largest value is a very incomplete sorting. The complexity of simple search is O(2n), and it is not conducive to hardware implementation. For sorting, both quick sort and merge sort use the idea of divide and conquer. If we try to use the idea of divide and conquer to solve this problem. Consider that when there are only 2 inputs, the output can be obtained through a comparison. At this time, what is obtained is an ordered array of length 2. If there are two ordered arrays, then the maximum value and the second largest value can be obtained through two comparisons. Using the idea of merging sort, the complexity of finding the maximum value and the second largest value is O(1.5n) (that is, n/2 + n/2 + n/4…, I don’t know if there is a miscalculation). Using the idea of merging sort is more efficient from the perspective of algorithm time complexity.

So is this solution suitable for FPGA implementation? The answer is yes. The locality of divide and conquer is suitable for pipeline implementation of FPGA. The block diagram is as follows. (Take 8 inputs as an example, 32 inputs need to add two levels)

6eaaf98b8a688cbad18e3f0a47a4c4fc.png

There are two levels of comparators inside the meg module. Generally speaking, it can be completed in 1clk. The input data goes through 32-32-16-8-4-2 to get the result, and the delay is 5 clock cycles. The implementation code is as follows

Code

 
module test#(

parameterDW=8

)

(

input clk,

input [32*DW-1 :0] din,

output [DW-1:0] max1,

output [DW-1:0] max2

);

 

wire[DW-1:0] d[31:0];

generate

genvar i;

for(i=0;i<32;i=i + 1)

begin:loop_assign

assign d[i] = din[DW*i + DW-1:DW*i];

end

endgenerate

 

// stage 1,comp

reg[DW-1:0] s1_max[15:0];

reg[DW-1:0] s1_min[15:0];

generate

for(i=0;i<16;i=i + 1)

begin:loop_comp

always@(posedge clk)

if(d[2*i]>d[2*i + 1])begin

s1_max[i] <= d[2*i];

s1_min[i] <= d[2*i + 1];

end

else begin

s1_max[i] <= d[2*i + 1];

s1_min[i] <= d[2*i];

end

end

endgenerate

 

// stage 2,

wire[DW-1:0] s2_max[7:0];

wire[DW-1:0] s2_min[7:0];

generate

for(i=0;i<8;i=i + 1)

begin:loop_megs2

meg u_s2meg(

.clk(clk),

.g1_max(s1_max[2*i]),

.g1_min(s1_min[2*i]),

.g2_max(s1_max[2*i + 1]),

.g2_min(s1_min[2*i + 1]),

.max1(s2_max[i]),

.max2(s2_min[i])

);

end

endgenerate

// stage 3,

wire[DW-1:0] s3_max[3:0];

wire[DW-1:0] s3_min[3:0];

generate

for(i=0;i<4;i=i + 1)

begin:loop_megs3

meg u_s3meg(

.clk(clk),

.g1_max(s2_max[2*i]),

.g1_min(s2_min[2*i]),

.g2_max(s2_max[2*i + 1]),

.g2_min(s2_min[2*i + 1]),

.max1(s3_max[i]),

.max2(s3_min[i])

);

end

endgenerate

 

// stage 4,

wire[DW-1:0] s4_max[1:0];

wire[DW-1:0] s4_min[1:0];

generate

for(i=0;i<2;i=i + 1)

begin:loop_megs4

meg u_s4meg(

.clk(clk),

.g1_max(s3_max[2*i]),

.g1_min(s3_min[2*i]),

.g2_max(s3_max[2*i + 1]),

.g2_min(s3_min[2*i + 1]),

.max1(s4_max[i]),

.max2(s4_min[i])

);

end

endgenerate

 

// stage 5,

meg u_s5meg(

.clk(clk),

.g1_max(s4_max[0]),

.g1_min(s4_min[0]),

.g2_max(s4_max[1]),

.g2_min(s4_min[1]),

.max1(max1),

.max2(max2)

);

endmodule

 

module meg#(

parameterDW=8

)

(

input clk,

input [DW-1 :0] g1_max,

input [DW-1 :0] g1_min,

input [DW-1 :0] g2_max,

input [DW-1 :0] g2_min,

output reg [DW-1:0] max1,

output reg [DW-1:0] max2

);

always@(posedge clk)

begin

if(g1_max>g2_max) begin

max1 <= g1_max;

if(g2_max>g1_min)

max2 <= g2_max;

else

max2 <= g1_min;

end

else begin

max1 <= g2_max;

if(g1_max>g2_min)

max2 <= g1_max;

else

max2 <= g2_min;

end

end

endmodule

3. Others

I simply tested the above code. On the previous generation device (20nm FPGA), the 8-bit data input module can be synthesized to a very high frequency. The number of logic levels is about 5. For the entire project, the bottleneck will basically not appear here. part. Because the data bit width is too large, the frequency of 32-bit data input will not be too high. However, by using the meg module as a first-level pipeline, it will hardly become a bottleneck for the entire system.

In the case of 32bit32 input, the data input bit width is 1024 (not IO input, but internal signal). Data with such a large bit width may not have been used in communications/digital signal processing before, but for FPGA applications in the AI field, thousands of bits of input should be common, which will indeed affect the final effect achieved on the FPGA. In order for the machine learning algorithm to run better on FPGA, the algorithm and FPGA need to work together.

89541d8054423cb7f3a8380a3f8a2b94.jpeg

Want to learn about FPGAs? Here are examples to share, ZYNQ design, follow our official account and explore