Denis Bakhvalov

Summary for performance challenge #4. Part 1.

Categories: challenge

27 Mar 2021 by Ivica Bogosavljevic from johnysswlab.com

Contents:

The function gaussian_smooth consists of two loop nests (code).

Multiversioning of loop nest 1.

Here is the source code of the first loop nest:

for(r=0;r<rows;r++){
  for(c=0;c<cols;c++){
      dot = 0.0;
      sum = 0.0;
      for(cc=(-center);cc<=center;cc++){
        if(((c+cc) >= 0) && ((c+cc) < cols)){
            dot += (float)image[r*cols+(c+cc)] * kernel[center+cc];
            sum += kernel[center+cc];
        }
      }
      tempim[r*cols+c] = dot/sum;
  }
}

The access pattern for the arrays in the innermost loop is good, it is always the sequential pattern. Every time the loop counter cc in the innermost loop increases by one, the index into the image and kernel increases by one.

The problem is the trip count of the innermost loop: it is low. Vectorization doesn’t pay off in that case.

One of the proposed solution was to multiversion a loop. In our case, center had value 2. You don’t get this information from the profiler, but simply by printing the value of the center on the screen. We can create two versions of the loop nest, one where the value of center is known at compile time, and another one where it isn’t. Keep in mind that it is not alway beneficial to do. If for some workload, center is never equal to 2, we end up pessimizing the performance of the program. Once we do such multiversioning, we expose a loop with compile-time known trip count, which compilers can easily unroll. Here is the solution:

if (center == 2) {
  for(r=0;r<rows;r++){
    for(c=0;c<cols;c++){
        dot = 0.0;
        sum = 0.0;
        for(cc=(-2);cc<=2;cc++){
          if(((c+cc) >= 0) && ((c+cc) < cols)){
              dot += (float)image[r*cols+(c+cc)] * kernel[2+cc];
              sum += kernel[2+cc];
          }
        }
        tempim[r*cols+c] = dot/sum;
    }
  }
} else {
  for(r=0;r<rows;r++){
    for(c=0;c<cols;c++){
        dot = 0.0;
        sum = 0.0;
        for(cc=(-center);cc<=center;cc++){
          if(((c+cc) >= 0) && ((c+cc) < cols)){
              dot += (float)image[r*cols+(c+cc)] * kernel[center+cc];
              sum += kernel[center+cc];
          }
        }
        tempim[r*cols+c] = dot/sum;
    }
  }
}

The above approach is quite crude, we could have used C macros to achieve similar things with less copying. But the basic idea is there.

Interchange in loop nest 1.

Another approach is to do the loop interchange. If we could exchange the loop over cc and loop over c, we would get the innermost loop with a high trip count. Loop interchange is possible if two loops are perfectly nested, which is not our case. However, with a trick, they can become perfectly nested.

for(r=0;r<rows;r++){
  for(c=0;c<cols;c++){
      dot[c] = 0.0;
      sum[c] = 0.0;
      for(cc=(-2);cc<=2;cc++){
        if(((c+cc) >= 0) && ((c+cc) < cols)){
            dot[c] += (float)image[r*cols+(c+cc)] * kernel[center+cc];
            sum[c] += kernel[center+cc];
        }
      }
      tempim[r*cols+c] = dot[c]/sum[c];
  }
}

We converted temporary values dot and sum into arrays. By doing this, their values are preserved accross loop iterations. This allows us to break the loop over c into two loops:

for(r=0;r<rows;r++){
  for(c=0;c<cols;c++){
      dot[c] = 0.0;
      sum[c] = 0.0;
  }
  for(c=0;c<cols;c++){
      for(cc=(-2);cc<=2;cc++){
        if(((c+cc) >= 0) && ((c+cc) < cols)){
            dot[c] += (float)image[r*cols+(c+cc)] * kernel[center+cc];
            sum[c] += kernel[center+cc];
        }
      }
  }
  for(c=0;c<cols;c++){
      tempim[r*cols+c] = dot[c]/sum[c];
  }
}

Loops over c and cc are now perfectly nested and they can be interchanged.

Interchange in loop nest 2.

The second loop nest looks similar to the first one:

for(c=0;c<cols;c++){
  for(r=0;r<rows;r++){
      sum = 0.0;
      dot = 0.0;
      for(rr=(-center);rr<=center;rr++){
        if(((r+rr) >= 0) && ((r+rr) < rows)){
            dot += tempim[(r+rr)*cols+c] * kernel[center+rr];
            sum += kernel[center+rr];
        }
      }
      (*smoothedim)[r*cols+c] = (short int)(dot*BOOSTBLURFACTOR/sum + 0.5);
  }
}

The access pattern in the innermost loop is bad this time. Access to the array tempim is with a stride cols. Every time the value of rr increases by one we are accessing the elements which are cols places away from the previous element. This access pattern is bad from the performance point of view. To get rid of it, we would need to do a complex loop interchange to move the loop over c to the innermost position.

To do this, we first interchange the loop over c and the loop over r. Then we interchange the loop over c and loop over rr. The final solution looks like this:

for (r = 0; r<rows; r++){
  for(c=0;c<cols;c++){
      dot[c] = 0.0;
      sum[c] = 0.0;
  }
  for (rr = (-center); rr <= center; rr++){
    if (((r + rr) >= 0) && ((r + rr) < rows)){
      for (c = 0; c<cols; c++){
        dot[c] += tempim[(r + rr) * cols + c] * kernel[center + rr];
        sum[c] += kernel[center + rr];
      }
    }
  }
  for (c = 0; c < cols; c++) {
    (*smoothedim)[r*cols + c] = (short int)(dot[c]*BOOSTBLURFACTOR / sum[c] + 0.5);
  }
}

One of the contestants also noted that the sum[c] has a fixed value 1 for all the pixels except the pixels at the edge of the image. This would allow us to get rid of the expensive division when sum[c] is 1 in the calculation of values for smoothedim.

And finally, note that we are allocating a temporary image tempim for our calculation. One of the contestants managed to merge the above two loops nests into one loop nest, thus avoiding the need to allocate a huge buffer.


comments powered by Disqus

Subscribe to get more updates from me:

If you like this blog, support me on Patreon or by PayPal donation.

All content on Easyperf blog is licensed under a Creative Commons Attribution 4.0 International License