Concepts of signed and unsigned numbers and various truncation methods

Directory

1.0 Why do you need to complement

The nature of 2.0’s complement

3.0 Signed and unsigned operations and truncation methods

3.1 Signed and unsigned numbers

3.2 Sign bit extension for signed integers

3.3 Signed decimals

3.4 Sum of two signed numbers

3.5 Product of two signed numbers

3.6 Rounding (round) truncation

3.7 Saturation truncation

4.0 Project example

4.1 Floor truncation

4.2 Ceil truncation

4.3 Round to nearest integer (round) truncation

4.4 Rounding towards zero (fix) truncation

4.5 Saturation truncation

5.0 Supplement

References

important:

generally:

Why does 1.0 need to complement

In the computer, the length of the integer is determined. In a computer with a word length of 32 bits, the length of the integer is 32 binary bits, which also includes the sign bit (1 means positive, 0 means negative). Here, for the convenience of description, we assume that the machine word length is 8 bits.

For example, the decimal integer 23 is expressed as 10111 in binary, and its original code is expressed as 0001 0111.

The decimal integer -23, the binary truth value is -10111, and the original code is 1001 0111.

In short, the source code is that the highest bit is the sign bit, and the other bits represent the absolute value of the number

If the computer uses the original code to represent the number, then when the addition and subtraction are performed, it will eventually be converted into the addition and subtraction of two absolute values. Therefore, when designing a calculator, it is necessary to design an addition operator. It is also necessary to design a subtraction operator, which will put additional requirements on the design of the circuit. Can you only use addition to represent addition and subtraction uniformly, so you need to use complement code.

Note: Whether it is systemverilog, verilog, or vhdl, the signed binary number is the complement representation of the data, for example:

in system verilog

wire signed [5:0] math = 6'b111000

The actual natural number expression is 6’b101000 (= -8).

The nature of 2.0’s complement

The actual two’s complement can be understood as the following calculation method:

For example: the complement code of the signed number 4’b1010 is 4’b1110, which is actually 010 after the original code removes the sign bit, plus 110 after the complement code removes the sign bit: 010 + 110=1000.

To convert a positive number into a corresponding negative number, in fact, just subtract the number from 0. For example, -8 is actually 0-8.

Knowing that the binary value of 8 is 00001000, -8 can be obtained by the following formula:

$00000000 - 00001000$

Because 00000000 (minuend) is less than 0000100 (minuend), it is not enough to subtract. Please recall primary school arithmetic, if a certain digit of the subtrahend is less than the subtrahend, what should we do? It’s very simple, just ask the previous person to borrow 1.

$100000000 - 00001000 = 11111000$

Further decomposition, it can be found that 100000000 = 11111111 + 1, so the above formula can be split into two:

$11111111 - 00001000 = 11110111$

$11110111 + 00000001 = 11111000$

This is how the two conversion steps of two’s complement come about. (The 1111_1000 is the complement of -8, which is obtained by inverting 000_1000 to get 111_0111 plus 1 to finally get 111_1000, and finally add the sign bit 1 to get 1111_1000). This is the origin of the complement coderules.

3.0 Signed and unsigned operations and truncation methods

3.1 Signed and unsigned numbers

As the name implies, a signed number refers to data with a sign bit, where the highest bit is the sign bit (if the highest bit is 0, it means a positive number, if the highest bit is 1, it means a negative number ); an unsigned number is data without a sign bit.

Consider a 4-bit integer 4’b1011. If it is an unsigned data, then the value it represents is:

$1\times2^3 + 0\times2^2 + 1\times2^1 + 1\times2^0 = 11$

If it is a signed number, then the value it represents is:

$1\times-2^{3}+0\times2^2+1\times2^1+1\times2^0=-5$

Therefore, the same binary number defines it as a signed number and an unsigned number may represent different values. At the same time, I also tell you here that the only difference between signed and unsigned numbers when converted to decimal representation is the weight of the highest bit. Taking the above example, the weight of the highest bit of an unsigned number is $2^3$ The weight of the highest bit of the signed number is< img alt="-2^3" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?-2^3">.

Just because the weight of the highest bit of signed and unsigned numbers is different, the range of data they represent is also different. For example, the data range of a 4-bit unsigned integer is 0~15, corresponding to binary 4’b0000~4’b1111 respectively, and the data range of a 4-bit signed integer is -8~7, corresponding to binary 4′ b1000~4’b0111.

Extended to the general situation, the data range of an unsigned integer with a bit width of m is $0 \sim 2^{m-1}$ , and the data range of a signed integer with a bit width of m is $\$ $-2^{m-1} \sim \ left ( {2^{m-1}-1} \right )$ .

3.2 Sign extension for signed integers

Question: How to extend a 4-bit signed integer into a 6-bit signed integer.

Suppose a 4-bit signed integer is 4’b0101. Obviously, since the highest bit is 0, it is a positive number. If you want to expand it to 6 bits, you only need to add 2 0s at the front. Then the result is: 6’b000101.

Looking at another example, suppose a 4-bit signed integer is 4’b1011. Obviously, since the highest bit is 1, it is a negative number. If you want to expand it to 6 bits, you must pay attention here. Instead of adding 2 0s, add 2 1s, the result after expansion is: 6’b111011. In order to ensure that no errors occur after data expansion, here is a simple verification:

$4'b1011 = 1 \times -2^3 + 0\times2^2 + 1\times2^1 + 1\times2^0 = -8 + 0 + 2 + 1 = -5$

$6'b111011 = 1 \times -2^5 + 1 \times2^4 + 1 \times2^3 + 0\times2^2 + 1\times2^1 + 1\ \times2^0 = -32 + 16 + 8 + 2 + 1 = -5$

Obviously, the data size has not changed after bit expansion.

To sum up, it is concluded that when expanding a signed integer, in order to ensure that the data size does not change, the sign bit should be added when expanding.

3.3 Signed decimal

With the basis of the previous two sections, we will study signed decimals next. The notation for signed decimals has been specified previously.

Suppose a signed decimal is 4’b1011, and its data format is 4Q2, that is to say, its decimal place is 2. Then look at the decimal number represented by this number:

$4'b10.11 = 1 \times -2^1 + 0\times2^0 + 1\times2^{-1} + 1\times2^{-2} = - 2 + 0 + 0.5 + 0.25 = -1.25$

Obviously, the calculation method of decimals is actually the same as that of integers, except that we need to determine the corresponding weight according to the position of the decimal point.

Next, look at the data range of signed decimals. Take the data in 4Q2 format as an example, its data range is $-2 \sim \left ( 2 - \frac{1}{2^2} \right )$ , corresponding to binary 4’b1000~4’b0111 respectively. Extended to the general situation, the data range of mQn format data is $-2^{m-n-1} \sim \left ( 2^{m-n-1} - \frac{1}{2^ n} \right )$ .

Finally, let’s look at the data extension of signed decimals. Assume that a signed decimal is 4’b1011, and its data format is 4Q2, and now we need to store this data in 6Q3 format. Obviously, the integer part and the fractional part need to be expanded by one bit respectively. The integer part is extended by the sign bit mentioned in the previous section, and a 0 is added at the end of the fractional part. verify:

$4'b10.11 = 1 \times -2^1 + 0\times2^0 + 1\times2^{-1} + 1\times2^{-2} = - 2 + 0 + 0.5 + 0.25 = -1.25$

$4'b110.110 = 1 \times -2^2 + 1 \times 2^1 + 0\times2^0 + 1\times2^{-1} + 1\ times2^{-2} + 0 \times 2^{-3} = -4 + 2 + 0 + 0.5 + 0.25 + 0 = -1.25$

Obviously, the data size has not changed after bit expansion.

Summary: When a signed decimal is extended, the integer part is extended with a sign bit, and the decimal part is added with 0 at the end.

3.4 Sum of two signed numbers

To add two signed numbers, in order to ensure that the sum does not overflow, first extend the two data to align the decimal points, then extend the extended data with a sign bit , so that the result of the addition can be guaranteed not to overflow.

Example: Now add the data 5’b100.01 of 5Q2 and the data 4’b1.011 of 4Q3.

Step 1: Since the data of 5Q2 has only 2 decimal places, and the data of 4Q3 has 3 decimal places, first expand the data 5’b100.01 of 5Q2 to the data 6’b100.010 of 6Q3 to make it the same as 4Q3 Alignment of the decimal point of the data

Step 2: After the decimal point is aligned, then sign-extend the 4Q3 data 4’b1.011 into 6Q3 data 6’b111.011

Step 3: Add two 6Q3 data, in order to ensure that the sum does not overflow, the sum should be stored with 7Q3 data. Therefore, it is necessary to sign-extend the two 6Q3 data into 7Q3 data first, and then add them together, so as to ensure that the calculation result is completely correct.

The above is a series of transformations that need to be done when adding two signed data. Go back and think about why the addition of two 6Q3 data must use 7Q3 data to accurately store their sum. Because the data range of 6Q3 format data is $-4 \sim \left ( 4 - \frac{1}{2^3} \right )$ ;Then the range of the sum of the two 6Q3 format data is $-8 \sim \left ( 8 - \frac{1}{ 2^2} \right )$ ; Obviously, if the sum is still stored in 6Q3, it will definitely overflow, and the data range of 7Q3 format data is $-8 \sim \left ( 8 - \frac{1}{2^3} \right )$ , so use 7Q3 format data to save the sum of two 6Q3 format data Must not overflow.

Conclusion: When using Verilog for addition operations, the two addends must be aligned with the decimal point and the highest bit must be extended with the sign bit to add it later. If the bit width of the sum of the two numbers is the highest bit width of the two numbers + 1 , so as to ensure that it does not overflow.

3.5 Product of two signed numbers

When multiplying two signed numbers, in order to ensure that the product does not overflow, the total data bit width of the product is the sum of the total bit widths of the two signed numbers, and the decimal data bit width of the product is the sum of the decimal bit widths of the two signed numbers. and. To put it simply, if two 4Q2 data are multiplied, in order to ensure that the product does not overflow, the product should be stored in 8Q4 format. This is because the range of 4Q2 format data is: $-2 \sim \left ( 2 - \frac{1}{2^2} \right )$ , then the range of the product of two 4Q2 data is: $\left ( -4 + \frac{1}{2^1} \right ) \sim 4$ , and the data range of 8Q4 format is: $-8 \sim \left ( 8 - \frac{1}{2^4} \right )$ , it must be able to accurately store the product of two 4Q2 format data.

Conclusion: When multiplying mQn and aQb data, the product should be stored in (m + a)Q(n + b) format data, so as to ensure that the product will not overflow.

3.6 Rounding (round) truncation

What was discussed above is to expand the data. This section talks about how to round the data when truncating the data to improve the accuracy of the truncated data.

Assuming a data in 9Q6 format is: 9’b011.101101, now I only want to keep 3 decimal places, obviously the last three decimal places must be truncated, but the data cannot be truncated directly to 6’b011.101, so it is wrong To be precise, this is generally not allowed in engineering. The correct way is to first check whether the data is positive or negative, because the highest bit of 9’b011.101101 is 0, so it is a positive number, and then look at the truncated part (In this example, the truncated part is the last 101), whether the highest bit is 0 or 1. If the data is a positive number, if the highest bit of the truncated part is 1, then a carry needs to be generated, so the final 9 ‘b011.101101 should be truncated to 6’b011.110.

The opposite is true for negative numbers. Suppose a data in 9Q6 format is: 9’b100.101101. Since the highest bit is 1, this number is a negative number, and then check whether the highest bit of the truncated part and other bits except the highest bit have 1. In this example, truncation The highest bit of the part (the truncated part is 101 at the end) is 1, and other bits other than the highest bit are also 1, because the weight of the highest bit of a negative number is ( $-2^2$ ), so no carry is required for this case, unlike positive numbers What’s more, negative numbers need to be added by 1. So in the end 9’b100.101101 should be truncated to 6’b100.110.

Note: The round of a positive number is rounded; the round of a negative number can be understood as “rounding up”.

Assume that a is a data in 9Q6 format, and it is required to truncate the decimal places to 3. Here is the Verilog code:

assign carry_bit = a[8] ? (a[2] & amp; (|a[1:0])) : a[2] ; assign a_round = {a[8], a[8:3]} + carry_bit ;

The first line of the above code is to determine whether a carry is required by judging the sign bit a[8] and truncating part of the data characteristics. If a[8] is 0, the calculated carry_bit is 1, which means that a is a positive number and truncated Carry is required; if a[8] is 1, and the calculated carry_bit is 1, it means that a is a negative number, and truncation does not require carry, and negative numbers need to be added without carry. In order to ensure that the data does not overflow after the carry, the second line of the code extends a sign bit.

3.7 Saturation (saturation) truncation

The so-called saturation processing is that if the calculation result exceeds the maximum value of the data that can be stored in the required data format, then use the maximum value to represent the data. If the calculation result exceeds the minimum and minimum value of the data that can be stored in the required data format, then Use the minimum value to represent this data.

Example 1: There is a 6Q3 data of 6’b011.111, and now it is required to use the data in 4Q2 format to store it. Obviously, the conversion of 6’b011.111 into decimal is as follows:

$6'b011.111 = 1 \times 2^1 + 1 \times 2^0 + 1 \times 2^{-1} + 1 \times 2^{-2} + 1 \times 2{-3} = 3.875$

However, the maximum value of data that can be represented by the data in the 4Q2 format is 4’b01.11, which is 1.75 when converted into decimal. Therefore, the data in the 4Q2 format cannot accurately store the data of 3.875, which is the so-called saturation situation. In this case, the saturation processing is to express all the data exceeding 1.75 with 1.75, that is to say, the data of 6Q3 is 6’b011.111. If the data in 4Q2 format must be used for storage, it is saturated In the case of processing, the final storage result is: 4’b01.11.

Example 2: There is a 6Q3 data of 6’b100.111, and now it is required to use the data in 4Q2 format to store it. Obviously, the conversion of 6’b100.111 into decimal is as follows:

$6'b100.111 = 1 \times 2^{-2} + 0 \times 2^1 + 1 \times 2^0 + 1 \times 2^{-1} + 1 \times 2^{-2} + 1 \times 2{-3} = -3.125$

The minimum value of the data that can be represented by the data in the 4Q2 format is 4’b10.00, which is converted to -2 in decimal. Therefore, the data in the 4Q2 format cannot accurately store the data of -3.125. This is another saturation situation. . In this case, the saturation processing is to use -2 to represent all the data less than -2, that is to say, the data of 6Q3 is 6’b100.111. In the case of saturation processing, the final storage result is: 4’b10.00.

4.0 Project Example

4.1 Floor truncation

Define a as a signed number with a bit width of n bits, a[n-1:0], intercept m bits and round down, floor(a/2^m), and the Verilog implementation is as follows:

// floor(a/2**m) assign b[n-m-1:0] = a[n-1:m];

4.2 Ceil truncation

Define a as a signed number with a bit width of n bits, a[n-1:0], intercept m bits and round up, ceil(a/2^m), and the Verilog implementation is as follows:

// ceil(a/2**m) assign b[n-m-1:0] = a[n-1:m] + |a[m-1:0];

4.3 Round to nearest integer (round) truncation

Define a as a signed number with a bit width of n bits, a[n-1:0], intercepting m bits and rounding, round(a/2^m), Verilog implementation is as follows:

// round(a/2**m) always @(*) begin if(a[n-1] == 1'b0) begin // positive number b[n-m-1:0] = a[n-1:m] + a[m-1]; end else begin // negative number b[n-m-1:0] = a[n-1:m] + (a[m-1] & amp; & amp; (|a[m-2:0])); end end

It should be noted that here we do not consider the overrun of the truncation result.

4.4 Toward zero (fix) truncation

Define a as a signed number with a bit width of n bits, a[n-1:0], intercept m bits and round to 0, fix(a/2^m), and the Verilog implementation is as follows:

//fix(a/2**m) always @(*) begin if(a[n-1] == 1'b0) begin // positive number b[n-m-1:0] = a[n-1:m]; end else begin // negative number b[n-m-1:0] = a[n-1:m] + |a[m-1:0]; end end

In fact, positive numbers are reduced and negative numbers are enlarged.

4.5 saturation (saturation) truncation

Define a as a signed number with a bit width of n bits, a[n-1:0], saturated to m bits, and the Verilog implementation is as follows:

// saturation to m bit alwyas @(*)begin if(a[n-1] == 1'b0) begin // positive number if(|a[n-2:m] == 1'b1) begin b[m-1:0] = {1'b0,{(m-1){1'b1}}}; end else begin b[m-1:0] = {1'b0,a[m-2:0]}; end end else begin if(a[n-1] == 1'b1) begin // negative number if( &a[n-2:m] == 1'b0) begin b[m-1:0] = {1'b1,{(m-1){1'b0}}}; end else begin b[m-1:0] = {1'b1,a[m-2:0]}; end end end

5.0 Supplement

Pay attention to the use of $signed(). If the two operands are not defined as signed in the previous variable definition, then if you need to perform signed calculations, you can use this command to change the operands into signed number.

In fact, the signed numbers written on the computer are all in complement form, that is:

logic signed [3:0] a = 4’b1011;

I wrote -3 with my mathematical thinking in reality, but in fact when we defined signed on the computer, the computer thought that the value I wrote was the complement of the data we wanted to write, so I thought the computer would recognize it into -3, but actually the computer recognizes it as -5.

And when writing logic code, we don’t need to think too much, we just need to perform + – according to the requirements for normal use.

References

Important:

https://www.cnblogs.com/liujinggang/p/10549095.html

Why does the computer use complement code to perform operations_ImportNewXXT0101’s Blog-CSDN Blog

General:

Matlab and Verilog truncation, rounding and saturation processing_verilog ceil_re_call’s blog-CSDN Blog

Verilog implements floor, round rounding and saturation operations – FPGA Forum – The most resourceful FPGA/CPLD learning forum – 21ic Electronic Technology Development Forum

float in FPGA design