Categories:
challenge

**Contents:**

- Reducing the number of branches
- Replacing branches with arithmetics.
- Substitute division with multiplication

The function `non_max_supp`

was the most interesting one, at least according to me. Click here to see the full source code of the critical loop.

The loop iterates over pixels row-wise. Inside the loop, for each pixel, the code performs the same operations. But the input values for the operation depend on the complex derivative values calculated in the previous step. Depending on the value of the derivatives, the code will perform the action on one of the eight neighboring pixels. Here is the full source code of the critical loop:

```
for(rowcount=1,magrowptr=mag+ncols+1,gxrowptr=gradx+ncols+1,
gyrowptr=grady+ncols+1,resultrowptr=result+ncols+1;
rowcount<nrows-2;
rowcount++,magrowptr+=ncols,gyrowptr+=ncols,gxrowptr+=ncols,
resultrowptr+=ncols){
for(colcount=1,magptr=magrowptr,gxptr=gxrowptr,gyptr=gyrowptr,
resultptr=resultrowptr;colcount<ncols-2;
colcount++,magptr++,gxptr++,gyptr++,resultptr++){
m00 = *magptr;
if(m00 == 0){
*resultptr = (unsigned char) NOEDGE;
}
else{
xperp = -(gx = *gxptr)/((float)m00);
yperp = (gy = *gyptr)/((float)m00);
}
if(gx >= 0) {
if(gy >= 0){
if (gx >= gy){
// calculate mag1 and mag2
} else {
// calculate mag1 and mag2
}
} else {
if (gx >= -gy){
// calculate mag1 and mag2
} else {
// calculate mag1 and mag2
}
}
} else {
if ((gy = *gyptr) >= 0) {
if (-gx >= gy) {
// calculate mag1 and mag2
} else {
// calculate mag1 and mag2
}
} else {
if (-gx > -gy) {
// calculate mag1 and mag2
} else {
// calculate mag1 and mag2
}
}
}
/* Now determine if the current point is a maximum point */
if ((mag1 > 0.0) || (mag2 > 0.0))
{
*resultptr = (unsigned char) NOEDGE;
}
else
{
if (mag2 == 0.0)
*resultptr = (unsigned char) NOEDGE;
else
*resultptr = (unsigned char) POSSIBLE_EDGE;
}
}
}
```

The above code is quite verbose but can be basically divided into three parts. The first part loads the common values `m0`

, `gx`

, `gy`

, `xperp`

, and `yperp`

. The second part on lines 736-871 performs the calculation depending on the values of `gx`

and `gy`

. The third part writes into the resulting array depending on the calculated values `mag1`

and `mag2`

.

This nested loop suffers from a large number of branch mispredictions, which can be confirmed with `perf`

. Any attempt in reducing the number of branches will result in better performance. Branches in this case prevent auto-vectorization since compilers do not know how to deal with them.

One of the way to reduce number of branches in case of nested branches is to replace nested branches with conditionally joined expression. E.g:

```
if (gx > 0) {
if (gy > 0) {
if (gx >= 0) {
...
}
}
}
```

Can be replaced with:

```
if (gx > 0 & gy > 0 & gx >= 0) {
...
}
```

The above three branches are not equivalent. The first example, condition `gy > 0`

will be evaluated only if `gx > 0`

. In the optimized version, all three conditions are evaluated before the branch is evaluated as a whole. When condition evaluation is cheap (which is normally the case with simple arithmetic conditions), the second solution will be faster. Please note that we are joining the condition with arithmetic `&`

, not `&&`

, since only when they are joined with `&`

will they be evaluated altogether.

The above approach will decrease the number of branches, but not eliminate them.

We can go completely branchless by replacing them with arithmetics. Consider this part of the code:

```
#define NOEDGE 255
#define POSSIBLE_EDGE 128
if ((mag1 > 0.0) || (mag2 > 0.0))
{
*resultptr = (unsigned char) NOEDGE;
}
else
{
if (mag2 == 0.0)
*resultptr = (unsigned char) NOEDGE;
else
*resultptr = (unsigned char) POSSIBLE_EDGE;
}
```

We conditionally store to `*resultptr`

depending on the set of conditions. We store `NOEDGE`

if condition `((mag > 0.0) || (mag2 >= 0.0))`

, otherwise we store `POSSIBLE_EDGE`

. Or to say it the other way around, we store `POSSIBLE_EDGE`

if condition `((mag <= 0.0) & (mag2 < 0.0))`

is 1. So we can rewrite it with:

```
*resultptr = ((mag <= 0.0) & (mag2 < 0.0)) ? POSSIBLE_EDGE : NOEDGE;
```

If we analyze the large set of nested `if`

s (link to the source), we notice a pattern. The computations are always the same, there is only a difference in which of the neighboring pixels we are processing.

The processing in each of the ifs body looks something like this:

```
z1 = *(magptr - A);
z2 = *(magptr - B);
mag1 = (X1 * m00 + Y1 * z1 + Z1 * z2)*xperp + (X2 * m00 + Y2 * z1 + Z2 * z2)*yperp;
/* Right point */
z1 = *(magptr + A);
z2 = *(magptr + B);
mag1 = (X1 * m00 + Y1 * z1 + Z1 * z2)*xperp + (X2 * m00 + Y2 * z1 + Z2 * z2)*yperp;
```

Where constants `A`

, `B`

, `X1`

, `X2`

, `Y1`

, `Y2`

, `Z1`

and `Z2`

depend on conditions `gx > 0`

, `gy > 0`

and `gx > gy`

.

One of the ways to go branchless is to select values for the constants using a lookup table that indexes with conditions `gx > 0`

, `gy > 0`

or `gx > gy`

. The example solutions can be found here or here.

The last optimization in this function concerns the strength reduction. Here is an example:

```
xperp = -(gx = *gxptr)/((float)m00);
yperp = (gy = *gyptr)/((float)m00);
...
mag1 = (m00 - z1)*xperp + (z2 - z1)*yperp;
...
mag2 = (m00 - z1)*xperp + (z2 - z1)*yperp;
...
bool condition = ((mag1 > 0.0) | (mag2 > 0.0));
```

We first divide `gx`

and `gy`

by `m00`

to get `xperp`

and `yperp`

and then we calculate `mag1`

and `mag2`

by using `xperp`

and `yperp`

. Finally, we only check the sign of `mag1`

and `mag2`

, we are not interested in its actual value.

We can substitute the division with multiplication like this:

```
mag1 = (m00 - z1)*xperp + (z2 - z1)*yperp;
// Multiply the whole expression by m00
mag1 * m00 = (m00 - z1) * xperp * m00 + (z2 - z1) * yperp * m00
mag1 * m00 = (z1 - m00) * gx + (z2 - z1) * gy
```

From this point, our condition becomes a bit more complex:

```
bool condition = (m00 < 0) ^ ((mag1 > 0.0) | (mag2 > 0.0))
```

Nevertheless, an evaluation of the condition and exclusive xor is much cheaper than the division operation.

comments powered by Disqus

All content on Easyperf blog is licensed under a Creative Commons Attribution 4.0 International License