Filter·5Ultra-fast exponential fuzzy algorithm

https://www.cnblogs.com/Imageshop/p/6849611.html

1. Algorithm core and principle

Derich filter (tsdconseil.fr)

The core code of the two links is

*zR + = (alpha * ((R << zprec) - *zR)) >> aprec;

In fact, it can be seen that the code implements the formula mentioned in Deriche filter: a first order IIR, which equation is: y[n] = a*x[n] + (1-a)*y[n -1] .

The first-order alpha parameter code used is:

int alpha = (int)((1 << aprec) * (1.0f - std::exp(-2.3f / (radius + 1.f))));

And it just so happens that the coefficient is something I have always had doubts about. I don’t know how to set it. I use

Summary: That is, exponential blur is actually the derich version of iterative Gaussian filtering, but it implements the first-order formula of a specific alpha coefficient, uses fixed-pointing, and uses zprec and aprec parameters to control accuracy.

2. Implementation

/*
Among them, Pixel is the pixel we want to process, zR, zG, zB, zA is an accumulated value passed in from the outside, and alpha, aprec, and zprec are some fixed coefficients generated by the blur radius radius.

Similar to Gaussian Blur or StackBlur, the algorithm is also separated by rows. In the row direction, it is first calculated from left to right using the above method, and then from right to left, and then processed in the column direction, first from top to bottom, and then from bottom to bottom. up. Of course, the calculation of rows and columns can also be reversed. It is important to note that each step is performed using previously processed data.
*/
static inline void _blurinner(guchar* pixel, gint* zR, gint* zG, gint* zB, gint* zA, gint alpha, gint aprec, gint zprec)
{
gint R, G, B, A;
R = *pixel;
G = *(pixel + 1);
B = *(pixel + 2);
A = *(pixel + 3);

*zR + = (alpha * ((R << zprec) - *zR)) >> aprec;
*zG + = (alpha * ((G << zprec) - *zG)) >> aprec;
*zB + = (alpha * ((B << zprec) - *zB)) >> aprec;
*zA + = (alpha * ((A << zprec) - *zA)) >> aprec;

*pixel = *zR >> zprec;
*(pixel + 1) = *zG >> zprec;
*(pixel + 2) = *zB >> zprec;
*(pixel + 3) = *zA >> zprec;
}
  
  
static inline void _blurrow(guchar* pixels,
                             gint width,
                             gint /* height */, // TODO: This seems very strange. Why is height not used as it is in _blurcol() ?
                             gint channels,
                             gint line,
                             gint alpha,
                             gint aprec,
                             gint zprec)
{
gint zR;
gint zG;
gint zB;
gint zA;
gint index;
guchar* scanline;

scanline = & amp;(pixels[line * width * channels]);

zR = *scanline << zprec;
zG = *(scanline + 1) << zprec;
zB = *(scanline + 2) << zprec;
zA = *(scanline + 3) << zprec;

for (index = 0; index < width; index + + )
_blurinner( & amp;scanline[index * channels], & amp;zR, & amp;zG, & amp;zB, & amp;zA, alpha, aprec,
zprec);

for (index = width - 2; index >= 0; index--)
_blurinner( & amp;scanline[index * channels], & amp;zR, & amp;zG, & amp;zB, & amp;zA, alpha, aprec,
zprec);
}

static inline void _blurcol(guchar* pixels,
                               gint width,
                               gint height,
                               gint channels,
                               gint x,
                               gint alpha,
                               gint aprec,
                               gint zprec)
{
gint zR;
gint zG;
gint zB;
gint zA;
gint index;
guchar* ptr;

ptr = pixels;

ptr + = x * channels;

zR = *((guchar*) ptr ) << zprec;
zG = *((guchar*) ptr + 1) << zprec;
zB = *((guchar*) ptr + 2) << zprec;
zA = *((guchar*) ptr + 3) << zprec;

for (index = width; index < (height - 1) * width; index + = width)
_blurinner((guchar*) & amp;ptr[index * channels], & amp;zR, & amp;zG, & amp;zB, & amp;zA, alpha,
aprec, zprec);

for (index = (height - 2) * width; index >= 0; index -= width)
_blurinner((guchar*) & amp;ptr[index * channels], & amp;zR, & amp;zG, & amp;zB, & amp;zA, alpha,
aprec, zprec);
}

// pixels image-data
// width image-width
// height image-height
// channels image-channels
// in-place blur of image 'img' with kernel of approximate radius 'radius'
// blurs with two sided exponential impulse response
// aprec = precision of alpha parameter in fixed-point format 0.aprec
// zprec = precision of state parameters zR,zG,zB and zA in fp format 8.zprecb
void _expblur(guchar* pixels,
                 gint width,
                 gint height,
                 gint channels,
                 gint radius,
                 gint aprec,
                 gint zprec)
{
gint alpha;
gint row = 0;
gint col = 0;

if (radius < 1)
return;

// calculate the alpha such that 90% of
// the kernel is within the radius.
// (Kernel extends to infinity)
alpha = (gint) ((1 << aprec) * (1.0f - expf(-2.3f / (radius + 1.f))));
//The commonly used aprec value is 16, and the Zprec value is 7.

for (; row < height; row + + )
_blurrow(pixels, width, height, channels, row, alpha, aprec, zprec);

for (; col < width; col + + )
_blurcol(pixels, width, height, channels, col, alpha, aprec, zprec);

return;
}

As a typical application, or to minimize parameters, the commonly used aprec value is 16 and Zprec value is 7.

Looking back at the code, except for the calculation of the alpha parameter, which involves floating point, the other parts are integer multiplication and shift operations. Therefore, it is conceivable that the speed should not be slow, and it is very suitable for mobile phone processing. At the same time, it is noted that the _blurrow and _blurcol function loops are obviously independent of each other and can be processed in parallel with multiple threads. However, this code mainly focuses on the expression of the algorithm and does not consider too much better efficiency.

Another point, it is obvious that the time consumption of the algorithm has nothing to do with the Radius parameter, which means that this is also an O(1) algorithm.

3. Grayscale image processing

3.1 Horizontal direction

for (int Y = 0; Y < Height; Y + + )
{
    byte *LinePS = Src + Y * Stride;
    byte *LinePD = Dest + Y * Stride;
    int Sum = LinePS[0] << Zprec;
    for (int X = 0; X < Width; X + + )   // From left to right
    {
        Sum + = (Alpha * ((LinePS[X] << Zprec) - Sum)) >> Aprec;
        LinePD[X] = Sum >> Zprec;
    }
    for (int X = Width - 1; X >= 0; X--)  // From right to left
    {
        Sum + = (Alpha * ((LinePD[X] << Zprec) - Sum)) >> Aprec;
        LinePD[X] = Sum >> Zprec;
    }
}

3.2 Vertical direction

In Sharing the Comprehensive Optimization Process of the Gaussian Fuzzy Algorithm (1), we discussed that the vertical direction processing algorithm is generally not suitable to be written directly, but should be processed using a temporary row cache, so that the processing code for the grayscale image in the column direction is similar to the following:

int *Buffer = (int *)malloc(Width * sizeof(int));
for (int X = 0; X < Width; X + + ) Buffer[X] = Src[X] << Zprec;
for (int Y = 0; Y < Height; Y + + )
{
    byte *LinePS = Src + Y * Stride;
    byte *LinePD = Dest + Y * Stride;
    for (int X = 0; X < Width; X + + )    // From top to bottom
    {
        Buffer[X] + = (Alpha * ((LinePS[X] << Zprec) - Buffer[X])) >> Aprec;
        LinePD[X] = Buffer[X] >> Zprec;
    }
}
for (int Y = Height - 1; Y >= 0; Y--)   // From bottom to top
{
    byte *LinePD = Dest + Y * Stride;
    for (int X = 0; X < Width; X + + )
    {
        Buffer[X] + = (Alpha * ((LinePD[X] << Zprec) - Buffer[X])) >> Aprec;
        LinePD[X] = Buffer[X] >> Zprec;
    }
}
free(Buffer);

After modifying to the above, testing a 3000*2000 8-bit grayscale image takes about 52ms (without using multi-threading), which is about the same time as Boxblur implemented in ordinary C language.

In addition to threads, is there any room for improvement in this time? Let’s first look at column-direction optimization.

Within the for (int X = 0; X < Width; The instruction is optimized, but there are some unsigned char type data inside the loop body. In order to use the SIMD instruction, it needs to be converted to the int type, which is more convenient, and it needs to be reprocessed into the unsigned char type when saving at the end. This kind of back-and-forth conversion is very time-consuming. Can speeding up time and other calculations bring benefits? We have written code, such as:

for (int X = 0; X < Width; X + + )    // From top to bottom
{
    Buffer[X] + = (Alpha * ((LinePS[X] << Zprec) - Buffer[X])) >> Aprec;
    LinePD[X] = Buffer[X] >> Zprec;
}

This code can be replaced with the following SIMD instructions:

int X = 0;
for (X = 0; X < Width - 8; X + = 8)
{
    // Store 8 bytes into 2 XMM registers
    // Option 1: Use the new _mm_cvtepu8_epi32 function in SSE4. The advantage is that the two lines are independent.
    __m128i Dst1 = _mm_cvtepu8_epi32(_mm_cvtsi32_si128((*(int *)(LinePD +
    __m128i Dst2 = _mm_cvtepu8_epi32(_mm_cvtsi32_si128((*(int *)(LinePD + .
    __m128i Buf1 = _mm_loadu_si128((__m128i *)(Buffer + X + 0));
    __m128i Buf2 = _mm_loadu_si128((__m128i *)(Buffer + X + 4));
    Buf1 = _mm_add_epi32(_mm_srai_epi32(_mm_mullo_epi32(_mm_sub_epi32(_mm_slli_epi32(Dst1, Zprec), Buf1), Alpha128), Aprec), Buf1);
    Buf2 = _mm_add_epi32(_mm_srai_epi32(_mm_mullo_epi32(_mm_sub_epi32(_mm_slli_epi32(Dst2, Zprec), Buf2), Alpha128), Aprec), Buf2);
    _mm_storeu_si128((__m128i *)(Buffer + X + 0), Buf1);
    _mm_storeu_si128((__m128i *)(Buffer + X + 4), Buf2);
    _mm_storel_epi64((__m128i *)(LinePD +
}
for (; X < Width; X + + )
{
    Buffer[X] + = (Alpha * ((LinePD[X] << Zprec) - Buffer[X])) >> Aprec;
    LinePD[X] = Buffer[X] >> Zprec;
}

The original three or four lines of code suddenly turned into dozens of lines of code. Will it slow down? In fact, don’t worry, SIMD is really powerful. The test result is that the time consumption of a 3000*2000 image is reduced to about 42ms. Moreover, the time-consuming proportion in the vertical direction has been reduced from the original 60% to about 35%. The core now is the time-consuming in the horizontal direction.

When the image is not in grayscale mode, there will be no difference between processing in the vertical direction and grayscale. This is because only the length of the loop needs to be increased.

Reference

https://www.cnblogs.com/Imageshop/p/6849611.html