Subscribe to my newsletter, support me on Patreon or by PayPal donation.
In this post we will dig deep into the different type of multiversioning. This time we will look at creating multiple versions of the same loop that have different trip counts. If you haven’t read part 4: vectorization width yet I encourage you to do that, because we will use knowledge form this post a lot.
In the post that I mentioned above I showed how it can be beneficial to optimize your function if you know the data you are working with. Specifically, in the section “Why we care about vectorization width?” I left off on the case when there are two different trip counts and you can’t just optimize for one of it. Let’s get back to this case.
I decided to write a benchmark to show how we can use the knowledge of our data to improve performance. Here is my baseline code:
void add_arrays_scalar(unsigned char* a, unsigned char* b, unsigned n)
{
unsigned chunks = 32 / n;
for (unsigned k = 0; k < chunks; ++k)
{
for (unsigned i = 0; i < n; ++i)
a[i] += b[i];
a += n;
b += n;
}
}
This code processes two 32-byte arrays in chunks, specified by n
. So, for n = 4
there will be 8 4-byte chunks with 8 outer loop iterations.
Example is obviously contrived, but it is quite common for image processing when pixels are processed by chunks with some stride. I have seen real world code examples when this code makes sense and improvement that we will see in this post is valid.
Let’s say we know all possible values for n
. They are 4, 8 and 16. Invocation of this function looks like this:
// 1. All trip counts (tcXX variables) are read from the file,
// so compiler doesn't know them at compile time
// 2. a and b have random numbers and values can wrap around,
// we don't care about it now.
add_arrays_scalar(a, b, tc4); // tc4 = 4 (8 inner loop iters)
add_arrays_scalar(a, b, tc8); // tc8 = 8 (4 inner loop iters)
add_arrays_scalar(a, b, tc16); // tc16 = 16 (2 inner loop iters)
// In the end I ensure the results are not optimized away.
Compiler options are: -O3 -march=core-avx2
.
By default clang 5.0
autovectorize the inner loop with vectorization width = 32 and interleaved by a factor of 4, so processing 128 bytes in one iteration. Also it does multiversioning by DD, i.e. creating two scalar version of the loop unrolled by a factor of 8 with a run-time trip count dispatching. The latter simply means that there are two scalar versions: one is unrolled by a factor of 8 and second is no unrolled, processing one byte at a time (sort of a fallback option).
All code for the benchmark can be found here.
If we profile the baseline case we will notice that vector code is completely cold. Well, that’s not surprising, because all our possible trip counts are smaller that 128 (vectorization width * interleave factor
). For n = 8,16
scalar unrolled version is used and for n = 4
scalar basic version is used.
Another thing worth to mention is that those function invocations have different weights. Because for n = 4
inner loop is executed 8 times and for n = 16
only 2. That’s why the fact that our function calls with n = 4
are executed by a simple scalar loop version (processing element by element) hurts our performance more than if we would miss using unrolled loop in the case n = 16
. The reasoning behind this is: for the larger trip counts less amount of branch instructions are executed. We will see the prove for that when we will try to optimize our function for different trip counts.
So, I tried another 3 ways to optimize the function:
First three versions try to optimize for specific trip count by adding pragma to the inner loop:
#pragma clang loop vectorize(enable) vectorize_width(X)
#pragma clang loop interleave(disable) unroll(disable)
, where X is the vectorization width. As you can see I disabled unrolling and interleaving, because otherwise clang will do this and I will not get the hit in vector version of the loop.
The last version (multiversioning by trip counts) looks like this:
void add_arrays_multiver_by_trip_counts(
unsigned char* a,
unsigned char* b,
unsigned n)
{
unsigned chunks = 32 / n;
for (unsigned k = 0; k < chunks; ++k)
{
if (n == 4)
{
#pragma clang loop vectorize(enable) vectorize_width(4)
#pragma clang loop interleave(disable) unroll(disable)
for (unsigned i = 0; i < 4; ++i)
a[i] += b[i];
}
else if (n == 8)
{
#pragma clang loop vectorize(enable) vectorize_width(8)
#pragma clang loop interleave(disable) unroll(disable)
for (unsigned i = 0; i < 8; ++i)
a[i] += b[i];
}
else if (n == 16)
{
#pragma clang loop vectorize(enable) vectorize_width(16)
#pragma clang loop interleave(disable) unroll(disable)
for (unsigned i = 0; i < 16; ++i)
a[i] += b[i];
}
else
{
for (unsigned i = 0; i < n; ++i)
a[i] += b[i];
}
a += n;
b += n;
}
}
I ran the benchmark on a Intel Core i7-6700 processor:
---------------------------------------------------------------
Benchmark Time CPU Iterations Throughput
---------------------------------------------------------------
Scalar 72 ns 72 ns 96790551 370.835MB/s (baseline)
Vectorization_width_4 52 ns 52 ns 133665303 510.321MB/s (+37.6%)
Vectorization_width_8 48 ns 48 ns 146755270 559.533MB/s (+50.9%)
Vectorization_width_16 55 ns 55 ns 128159636 488.847MB/s (+31.8%)
Multiver_by_trip_counts 44 ns 44 ns 157490466 601.103MB/s (+62.1%)
Our final version is 62% faster than the baseline. Let’s figure out why. I profiled case by case to understand what is going on in each case:
1/4
of the xmm register capacity). We can do better for trip counts 8,16.n = 4
). However, it was a surprise to me that this version outperforms the version optimized for n = 4
.This post is not to encourage you to optimize every routine like this - please don’t do that. Do this only if your measurements show significant benefit of such change. If your compiler is not doing the best job for your hot function probably someday it will. It’s not always beneficial to vectorize loops with small trip count, sometimes it’s better to do full unrolling. Compiler will be many times better than you at figuring out such things.
Also this optimization doesn’t make much sense when you have big arrays. Compiler already prepared the code for it (remember that there is autovectorized version that processes 128 bytes at a time - see in the beginning of the article).
Another thing worth to mention is that in the fastest attempt there are at least 6 versions of the same loop (3 handwritten + 1 autovectorized + 1 unrolled + 1 fallback). This increases code size significantly!
In this case enabling LTO doesn’t make much of a difference (results are mostly the same).
However, if you replace the trip count arguments with constant values (4,8,16
) and enable LTO (pass -flto
flag), then compiler will propagate this constant into the function and scalar version will beat all the others! I profiled this case and noticed that compiler recognized my dirty trick with processing array by chunks and realized that in the nutshell there is no difference between those 3 function calls - they all do the same thing.
I want to share a great talk from CppCon 2014 by Mike Acton: Data-Oriented Design and C++.
Want performance - know your data.
UPD: one more article related to the topic of multiversioning: Function multi-versioning in GCC 6.