SSE Image Algorithm Optimization Series Thirty-One: Instruction Set Optimization of RGB2HSL/RGB2HSV and HSL2RGB/HSV2RGB – Part 1.

The mutual conversion between RGB and HSL/HSV color space is widely used in our image processing. Whether it is image adjustment or some skin color algorithms, HSL/HSV color space is very useful. It provides RGB color Space does not have some unique characteristics, but due to the complexity of the HSL/HSV color space, the conversion efficiency between them has not been very high. There are some attempts based on fixed-point algorithms, which can improve the speed to a certain extent, but one The improvement is not particularly obvious, and the other is that it has a certain impact on the accuracy of the result.

For the instruction set optimization of these two algorithms, there is no information on the Internet at all, and no one has tried it. I also had an idea to toss him, but the preliminary judgment feels that there are too many branches in it, and I should use After learning the instruction set, there will not be much difference in speed, so I have not done it.

But recently, a friend’s potential demand, and then I had some expectations for this algorithm, and picked up the conversion process again, and the result was still rewarding, and the speed was improved by 3 to 4 times. ,

Let’s talk about the conversion optimization from RGB to HSL or HSV color space first

There are a lot of them on the Internet, so I don’t waste time reorganizing them. Let me just share a piece of code and URL:

Reference URL: http://www.xbeat.net/vbspeed

This article gives the VB6 code, you can refer to it below.

We agree that the RGB data source is of unsigned char type, and the effective range is [0,255], while HSL/HSV are both floating-point types, where the effective range of H is [0,6], and the effective range of S is [0,1] , the effective range of L/V is also [0,1].

After my personal arrangement and slight optimization, a simple RGB2HSV code is as follows:

void IM_RGB2HSV_PureC(unsigned char Blue, unsigned char Green, unsigned char Red, float & amp;Hue, float & amp;Sat, float & amp;Val)
{
    int Min = IM_Min(Red, IM_Min(Green, Blue));
    int Max = IM_Max(Red, IM_Max(Green, Blue));
    if (Max == Min)
    {
        Hue = 0;
        Sat = 0;
    }
    else
    {
        int Delta = Max - Min;
        if (Max == Red)
            Hue = (float)(Green - Blue) / Delta;
        else if (Max == Green)
            Hue = 2.0f + (float)(Blue - Red) / Delta;
        else
            Hue = 4.0f + (float)(Red - Green) / Delta;

        // In fact, only when Max==Red, it is possible for Hue < 0 (corresponding to Green < Blue),
        // So some codes make judgments within Max == Red. For C codes, this should be more efficient
        if (Hue < 0) Hue + = 6;
        Sat = (float)Delta / Max;
    }
    Val = Max * IM_Inv255;
}

The code of RGB2HSL is as follows:

void IM_RGB2HSL_PureC(unsigned char Blue, unsigned char Green, unsigned char Red, float & amp;Hue, float & amp;Sat, float & amp;Val)
{
    int Min = IM_Min(Red, IM_Min(Green, Blue));
    int Max = IM_Max(Red, IM_Max(Green, Blue));
    int Sum = Max + Min;
    if (Max == Min)
    {
        Hue = 0;
        Sat = 0;
    }
    else
    {
        int Delta = Max - Min;
        if (Max == Red)
            Hue = (float)(Green - Blue) / Delta;
        else if (Max == Green)
            Hue = 2.0f + (float)(Blue - Red) / Delta;
        else
            Hue = 4.0f + (float)(Red - Green) / Delta;

        if (Hue < 0) Hue + = 6;

        if (Sum <= 255)
            Sat = (float)Delta / Sum;
        else
            Sat = (float)Delta / (510 - Sum);
    }
    Val = Sum * IM_Inv510;
}

Comparing the codes of two different models, it can be found that they have the same definition for the H component. For the V/L component, one uses the maximum value, and the other uses the average value of the maximum and minimum values. For the S component, everyone Considering the difference between the maximum value and the minimum value, only one is compared with the maximum value, and the other is compared with the sum of the maximum value and the minimum value. Overall, the RGB2HSV model is relatively simple and requires less calculation.

It can be seen that whether it is RGB2HSL or RGB2HSV, there are a lot of judgments and branch statements in the process of finding H, and there are some other special judgments considering the division by zero error (Max == Min) as a whole. As mentioned the second time, there is no such thing as branch jump in the instruction set, and these jumps are very unfavorable to instruction set optimization. There are only two ways to implement such a thing in the instruction set:

1. Find a way to combine all the branches and jumps together with some clever tricks, and use one sentence to express it.

2. Calculate the results of all branch statements, and then use the relevant Blend to perform conditional merging.

Carefully analyzing the above C code, I didn’t think of any particularly good technique to combine the three branches of the hue part into one statement. Based on personal feeling, only the second method can be used.

For the convenience of description, I first post the results of a relatively simple SIMD instruction set optimization of the RGB2HSV algorithm:

 1 void IM_RGB2HSV_SSE_Old(__m128i Blue, __m128i Green, __m128i Red, __m128 & amp; Hue, __m128 & amp; Sat, __m128 & amp; Val)
 2 {
 3 __m128i Max = _mm_max_epi32(Red, _mm_max_epi32(Green, Blue)); // The maximum value of R/G/B Max
 4 __m128i Min = _mm_min_epi32(Red, _mm_min_epi32(Green, Blue)); // The minimum value of R/G/B Min
 5 __m128i Delta = _mm_sub_epi32(Max, Min); // Difference between maximum and minimum values Delta = Max - Min
 6
 7 __m128 MaxS = _mm_cvtepi32_ps(Max);
 8 __m128 DeltaS = _mm_cvtepi32_ps(Delta);
 9 
10 Sat = _mm_divz_ps(DeltaS, MaxS); // S = Delta / Max, pay attention to the exception handling of division by zero, and if Max == Min, Delta is 0, and S also returns 0, which is correct
11 Val = _mm_mul_ps(MaxS, _mm_set1_ps(IM_Inv255)); // V = Max / 255;
12
13 __m128 Inv = _mm_divz_ps(_mm_set1_ps(1), DeltaS);
14
15 //if (Max == Red)
16 // Hue = (float)(Green - Blue) / Delta;
17
18 __m128 HueR = _mm_mul_ps(_mm_cvtepi32_ps(_mm_sub_epi32(Green, Blue)), Inv);
19
20 //else if (Max == Green)
21 // Hue = 2.0f + (float)(Blue - Red) / Delta;
twenty two 
23 __m128i Mask = _mm_cmpeq_epi32(Max, Green);
24 __m128 HueG = _mm_add_ps(_mm_set1_ps(2.0f), _mm_mul_ps(_mm_cvtepi32_ps(_mm_sub_epi32(Blue, Red)), Inv));
25
26 Hue = _mm_castsi128_ps(_mm_blendv_epi8(_mm_castps_si128(HueR), _mm_castps_si128(HueG), Mask));
27
28 //else
29 // Hue = 4.0f + (float)(Red - Green) / Delta;
30 Mask = _mm_cmpeq_epi32(Max, Blue);
31 __m128 HueB = _mm_add_ps(_mm_set1_ps(4.0f), _mm_mul_ps(_mm_cvtepi32_ps(_mm_sub_epi32(Red, Green)), Inv));
32
33 Hue = _mm_castsi128_ps(_mm_blendv_epi8(_mm_castps_si128(Hue), _mm_castps_si128(HueB), Mask));
34
35 // if (H < 0) H + = 6; In fact, this is mainly for the situation where Max == R will appear
36 Hue = _mm_blendv_ps(Hue, _mm_add_ps(Hue, _mm_set1_ps(6)), _mm_cmplt_ps(Hue, _mm_setzero_ps()));
37
38 }

Explanation: The three __m128i variables of Blue, Green and Red in the IM_RGB2HSV_SSE function store four 32-bit color components instead of 16 colors.

There is nothing difficult to understand in the process of finding Max\Min\Delta in the third, fourth, and fifth lines. The seventh and eighth lines just convert the integer into a floating point type (note that the SSE instruction is also strongly typed, you must manually convert the type yourself).

The tenth and eleventh lines directly calculate the Sat and Val components. Val is not difficult to understand. In the corresponding C code, Sat is divided into two situations: Max == Min and Max != Min. When Max == Min , it is 0, otherwise, use (Max – Min) / Max. In fact, there is no need to make a judgment here, just use (Max – Min) / Max uniformly, because when Max == Min, Max – Min is also 0, but the only It should be noted that if Max = Min = 0, Max is also 0, 0 / 0 is not allowed in mathematics, and there will be overflow errors in calculations, so a custom _mm_divz_ps function is used here to realize when When the divisor is 0, returns a result of 0. In this way, the branch statement can be stripped away.

Complicated is the calculation of the Hue component, from the thirteenth line to the end is about its optimization.

On line 13, we first calculate 1.0f / Delta. Note that the _mm_divz_ps function is also used here.

In line 16, we first calculate the Hue result when Max == Red according to the formula.

In line 23, we compare whether Max and Green are equal. Note that the comparison of 32-bit int type is also used here.

Line 24 calculates the result of the Hue component when Max == Green according to the formula.

Line 25 mixes these two results. There are many coding skills in the mixing here, because the HueR and HueG we calculated twice are both of type __m128, and our comparison is the comparison of integers, and the returned It is the data of __m128i type, and the comparison result of __m128 is required for the mixing of _mm_blendv_ps, but if the Mask is directly cast to a floating point type as a parameter of _mm_blendv_ps, an incorrect result will be produced. Then there are 2 solutions:

1. Use floating-point type comparison, that is, first convert the Blue\Green component to __m128 type, and then use _mm_cmpeq_ps for comparison, thus adding several type conversion functions.

2. Use the code of this example, use _mm_blendv_epi8 + _mm_castps_si128 to mix, on the surface, it seems that the process of casting 3 times is more time-consuming, but in fact, the cast series of statements are just syntactic sugar, after compiling, he does not Generate any assembly instructions. He just makes the compiler think that it is another type of data type, so that it can be compiled. In fact, __m128 and __m128i are stored in XMM registers on the hardware, and the register itself does not distinguish between data types.

Lines 30 and 31 are also similar, mixing the results of those Max == Blue components.

Line 36 handles the special case of Hue < 0. Nothing too complicated either.

We tested a 24-bit image (filled with random data) of size 5000*5000. The time-consuming of ordinary C language is about 114ms, and the time-consuming of the above-mentioned SIMD optimization is about 49ms, and the speed-up ratio is close to 2.2 times.

In fact, the code optimized by the above SIMD instructions still has some room for optimization. We noticed that in order to calculate HueR\HueG\HueB, we performed 3 times of multiplication and addition of the floating-point version. But if we separate out the multiplication and addition part and mix it accordingly every time, then we only need to do the multiplication and addition one last time, which increases the number of mixing times but reduces the number of calculations. Mixed instructions are actually implemented through bit operations, which are relatively fast. The specific code is as follows:

 1 void IM_RGB2HSV_SSE(__m128i Blue, __m128i Green, __m128i Red, __m128 & amp; Hue, __m128 & amp; Sat, __m128 & amp; Val)
 2 {
 3 __m128i Max = _mm_max_epi32(Red, _mm_max_epi32(Green, Blue)); // The maximum value of R/G/B Max
 4 __m128i Min = _mm_min_epi32(Red, _mm_min_epi32(Green, Blue)); // The minimum value of R/G/B Min
 5 __m128i Delta = _mm_sub_epi32(Max, Min); // Difference between maximum and minimum values Delta = Max - Min
 6
 7 __m128 MaxS = _mm_cvtepi32_ps(Max);
 8 __m128 DeltaS = _mm_cvtepi32_ps(Delta);
 9 
10 Sat = _mm_divz_ps(DeltaS, MaxS); // S = Delta / Max, pay attention to the exception handling of division by zero, and if Max == Min, Delta is 0, and S also returns 0, which is correct
11 Val = _mm_mul_ps(MaxS, _mm_set1_ps(IM_Inv255)); // V = Max / 255;
12
13 // SIMD does not have jump instructions, and can only use Blend plus conditional judgment to implement multi-conditional statements. Note that the three judgments can be regarded as a Base (0/120/240) plus different Diff multiplied by Inv.
14 //Based on Max == B, the advantage of doing this is: when Max == Min, H will return 0, but if it follows the mixed order of C language, it will be true when Max == B is finally judged , then H returns 4, then in order to return the correct result
15 // There is one more _mm_blendv_epi8 statement. Note that a fact hidden here is that when Max == Min, G - B, B - R, R - G are actually 0, so something like this (float) (G - B) / Delta * 60 must also be 0.
16
17 // if (Max == bB)
18 // H = 4.0f + (float)(R - G) / Delta;
19
20 __m128i Base = _mm_set1_epi32(4);
21 __m128i Diff = _mm_sub_epi32(Red, Green);
twenty two 
23 //if (Max == G)
24 // H = 2.0f + (float)(B - R) / Delta;
25
26 __m128i Mask = _mm_cmpeq_epi32(Max, Green);
27 Base = _mm_blendv_epi8(Base, _mm_set1_epi32(2), Mask); // When Mask is true, _mm_blendv_epi8 returns the value of the second parameter, otherwise returns the value of the first parameter
28 Diff = _mm_blendv_epi8(Diff, _mm_sub_epi32(Blue, Red), Mask);
29
30 // if (Max == R)
31 // H = (float)(G - B) / Delta;
32 Mask = _mm_cmpeq_epi32(Max, Red);
33 Base = _mm_blendv_epi8(Base, _mm_setzero_si128(), Mask);
34 Diff = _mm_blendv_epi8(Diff, _mm_sub_epi32(Green, Blue), Mask);
35
36 __m128 Inv = _mm_divz_ps(_mm_set1_ps(1), DeltaS); // 1 / Delta, pay attention to the exception handling of division by zero
37
38 // H = Base + Diff * Inv
39 Hue = _mm_add_ps(_mm_cvtepi32_ps(Base), _mm_mul_ps(_mm_cvtepi32_ps(Diff), Inv));
40
41 // if (H < 0) H + = 6; In fact, this is mainly for the situation where Max == R will appear
42 Hue = _mm_blendv_ps(Hue, _mm_add_ps(Hue, _mm_set1_ps(6)), _mm_cmplt_ps(Hue, _mm_setzero_ps()));
43
44 }

By optimizing in this way, you can probably get a performance improvement of 15-25%.

Of course, there may be some space to consider here, that is, we use 32-bit int type comparison, which can only compare 4 numbers at a time. In addition, calculations such as _mm_max_epi32 can be used for original image data. It is done by epi8, so that the information of 16 pixels can be obtained at one time instead of 8 bits, but the problem faced in this way is that multiple data type conversions need to be done later. The time-consuming of these conversions and the time-consuming comparison have not yet been concluded. Interested readers can test it by themselves.

If you understand the SSE code of RGB2HSV, do you think it will be difficult for RGB2HSL? I hope readers can code it by themselves.

The next article will focus on the optimization of HSL2RGB and HSV2RGB spaces. The difficulty of optimization and the speed-up ratio of optimization are relatively more complex and effective than RGB2HSL and RGB2HSL.

The test code of this article can be obtained from the following link: https://files.cnblogs.com/files/Imageshop/RGB2HSV.rar?t=1689216617 & amp;download=true

If you want to keep an eye on my latest articles, you can also follow the official account or add my WeChat: laviewpbt