Summary for performance challenge #4. Part 2.

Categories: challenge

27 Mar 2021 by Ivica Bogosavljevic from johnysswlab.com

Contents:

Loop interchange for achieving sequential memory accesses.

Function derivative_x_y also consists of two loops that are processing the image. The first loop nest processes the image row-wise and the second column-wise.

As far as performance is concerned, the first loop nest has a sequential memory access pattern and from the performance point of view, there is nothing that needs to be done there. CLANG auto-vectorized the loop nest, and from the performance point of view, nothing needs to be done there. You can click here to see the source code of the first loop nest.

Loop interchange for achieving sequential memory accesses.

The second loop nest is more problematic. We are giving its source code here:

for(c=0;c<cols;c++){
  pos = c;
  (*delta_y)[pos] = smoothedim[pos+cols] - smoothedim[pos];
  pos += cols;
  for(r=1;r<(rows-1);r++,pos+=cols){
      (*delta_y)[pos] = smoothedim[pos+cols] - smoothedim[pos-cols];
  }
  (*delta_y)[pos] = smoothedim[pos] - smoothedim[pos-cols];
}

In the innermost loop (over r) you can see the expression (*delta_y)[pos] = smoothedim[pos+cols] - smoothedim[pos-cols]. The variable pos is increased by cols, which represents inefficient strided memory access.

We could try loop interchange here is well, by interchanging the loop over c and loop over r. But to do that, first we need to get rid of induction variable pos. We want to be able to calculate index into arrays for each iteration that depends only on c and r. Here is how the rewrite goes:

int index;
for(c=0;c<cols;c++){
  (*delta_y)[c] = smoothedim[c+cols] - smoothedim[c];
  for(r=1;r<(rows-1);r++,pos+=cols){
      index = c + r * cols;
      (*delta_y)[index] = smoothedim[index+cols] - smoothedim[index-cols];
  }
  index = c + (rows - 1) * cols
  (*delta_y)[index] = smoothedim[index] - smoothedim[index];
}

To make the loop over c and loop over r perfectly nested, we have get rid of the lines that are preventing the nesting. In this case, all the iterations of the loop are independent, and this can be achieved easily. We move the epilogue and prologue of the loop ver r into separate loops. The transformed code looks like this:

int index;
for(c=0;c<cols;c++){
  (*delta_y)[c] = smoothedim[c+cols] - smoothedim[c];
  index = c + (rows - 1) * cols
  (*delta_y)[index] = smoothedim[index] - smoothedim[index];
}
for(c=0;c<cols;c++){
  for(r=1;r<(rows-1);r++,pos+=cols){
      index = c + r * cols;
      (*delta_y)[index] = smoothedim[index+cols] - smoothedim[index-cols];
  }
}

The last step is to interchange the loops over c and loops over r and with this, the accesses to arrays delta_y and smoothedim become sequential.

One contestant noticed that it is convenient to keep the arrays delta_x and delta_y as positive values, since this simplifies the processing in the function non_max_supp. A simple abs(smoothedim[index+cols] - smoothedim[index-cols]) while processing the data in the loop does the trick.

Denis Bakhvalov

Summary for performance challenge #4. Part 2.

27 Mar 2021 by Ivica Bogosavljevic from johnysswlab.com

Loop interchange for achieving sequential memory accesses.

Subscribe to get more updates from me: