Denis Bakhvalov

Summary for performance challenge #4. Part 4.

Categories: challenge

27 Mar 2021 by Ivica Bogosavljevic from


Function apply_hysteresis consists of several loops, out of which only two have a measurable impact on performance and we will deal with them.

Replacing branches with arithmetics.

The first loop belongs to the class of histogram computation loops and looks like this:

    if(edge[pos] == POSSIBLE_EDGE) hist[mag[pos]]++;

The above code will suffer from both a large data cache miss rate (because of the random access to array hist) and a large number of branch mispredictions. There isn’t much we can do related to the data cache miss rate, but we can get rid of the branch misprediction very simply, by doing:

hist[mag[pos]]+= (edge[pos] == POSSIBLE_EDGE);

The above code is equivalent to if(edge[pos] == POSSIBLE_EDGE) hist[mag[pos]]++ but 1.5 times faster.

Optimizing recursive function calls.

A second loop with a performance impact is following:

for (r = 0, pos = 0; r<rows; r++){
  for (c = 0; c<cols; c++, pos++){
    if ((edge[pos] == POSSIBLE_EDGE) && (mag[pos] >= highthreshold)){
      edge[pos] = EDGE;
      follow_edges((edge + pos), (mag + pos), lowthreshold, cols);

The loop itself doesn’t say a lot, so we have to look at the function follow_edges:

void follow_edges(unsigned char *edgemapptr, short *edgemagptr, short lowval,
    int cols)
    short *tempmagptr;
    unsigned char *tempmapptr;
    int i;
    float thethresh;
    int x[8] = { 1, 1, 0, -1, -1, -1, 0, 1 },
        y[8] = { 0, 1, 1, 1, 0, -1, -1, -1 };

    for (i = 0; i<8; i++){
        tempmapptr = edgemapptr - y[i] * cols + x[i];
        tempmagptr = edgemagptr - y[i] * cols + x[i];

        if ((*tempmapptr == POSSIBLE_EDGE) && (*tempmagptr > lowval)){
            *tempmapptr = (unsigned char)EDGE;
            follow_edges(tempmapptr, tempmagptr, lowval, cols);

The function is recursive. There are several optimizations that can be done on recursive functions:

  • Decrease the number of parameters passed to the function
  • Move read-only stack-allocated parameters to global memory
  • Make the function tail-recursive (not possible in this case)

This function allocates arrays x and y in each invocation, but they are read-only. Changing the type from int to static int will decrease the size of the function’s stack frame and save a few instructions.

In this code, we are changing all pixels with POSSIBLE_EDGE to EDGE if certain conditions are fulfilled. In the later loop (not shown here), all the remaining pixels which are POSSIBLE_EDGE are converted into NOEDGE. To decrease the amount of work in the critical loop, we can first convert all pixels with mag[pos] <= lowthreshold to NOEDGE. With this preprocessing we’ve decreased the number of pixels with POSSIBLE_EDGE and our ciritcal and complex loop has less work to do. Here how it looks in code:

      edge[pos] = (mag[pos] <= lowthreshold) & (edge[POS] == POSSIBLE_EDGE) ? NOEDGE : edge[pos];

        if(edge[pos] == POSSIBLE_EDGE){
          edge[pos] = EDGE;
          follow_edges((edge+pos), cols);

Another idea is to convert follow_edges to a non-recursive function, but in this case, our contestants noticed that this didn’t have any performance impact.

A third possibility would be to inline follow_edge in itself once. Nobody did it so we didn’t follow this through.

For this test, we used the CLANG compiler, but Intel’s compiler has a nice feature where it can vectorize a function using OpenMP SIMD directive #pragma omp declare simd. It would be interesting to measure the performance impact of vectorization of this loop, but unfortunately, we didn’t use it in our tests.

comments powered by Disqus

Subscribe to get more updates from me:

If you like this blog, support me on Patreon, Github, or by PayPal donation.

All content on Easyperf blog is licensed under a Creative Commons Attribution 4.0 International License