Vectorization part3. Compiler report.

Categories: vectorization

30 Oct 2017

Subscribe to my newsletter, support me on Patreon or by PayPal donation.

This post will be short but it is quite important to know about compiler optimization reports because it can save you a lot of time. Sometimes you want to know if your loop was vectorized or not, unrolled or not. If it was unrolled, what is the unrol factor? Was your function inlined? There is a hard way - by looking at the assembly. This can be a really hard if the function is big, or it has many loops that were also vectorized, or if compiler created multiple versions of the same loop, OMG.

There is more convienient way to know that - by checking compiler report. For example, for the following code (link to godbolt):

void add_arrays(float* a, float* b, std::size_t n)
{
    for (std::size_t i = 0; i < n; ++i)
        a[i] += b[i];
}

To emit opt report in clang you need to pass -Rpass* flags:

$ clang -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
a.cpp:5:5: remark: vectorized loop
  (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize]
    for (std::size_t i = 0; i < n; ++i)
    ^

Great, so at least now we know that our loop was vectorized with a vectorization width = 4 (see next posts what that mean) and vectorized loop iterations were interleaved with count = 2. You still may want to check assembly, as it might surprise you in some cases, but it gives a good starting point and quick way to check things. However, it requires a little bit of experience to understand what those parameters mean to fully leverage compiler opt reports.

In compiler explorer there is a cool opt report viewer. You need just to hover your mouse over the line with the code and you will see all high-level optimizations that were performed on that loop.

Sometimes, vectorization fails. For example:

void add_arrays(float* a, float* b, std::size_t n)
{
    float agg = 0.0;
    for (std::size_t i = 0; i < n; ++i)
    {
        a[i] += b[i];
        agg += b[i];
        if (agg > 100)
            break;
    }
}

Opt report:

a.cpp:6:5: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize]
    for (std::size_t i = 0; i < n; ++i)
    ^
a.cpp:6:5: remark: loop not vectorized: could not determine number of 
    loop iterations [-Rpass-analysis=loop-vectorize]
a.cpp:6:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]

Sometimes you will see reports about missed vectorization opportunities because it was not beneficial to vectorize the loop. For example, because there were not enough iterations. Vectorizer has some internal cost model, which compiler uses to make decision about vectorizing particular loop.

Situation gets a little bit compilcated when you are using LTO. When you are building with LTO, clang does not produce the binary files, but bitcode (intermediate representation) which will be combined into executable on linking stage. So, the final decision about whether it’s beneficial to vectorize the loop or not, now may happen on the LTO stage. For example, compiler inlined the function call and now it knows all possible trip counts of the loop. So, when you pass -Rpass* along with -flto it won’t print you anything. To see opt reports in this case first you need to add debug information(-g) to the compilation of the file you are interesting in. Lack of debug info will cause no filenames and line numbers in the report. After that, you need to pass additional options to the linking stage:

Gold plugin - pass -Wl,-plugin-opt,-pass-remarks=loop-vectorize -pass-remarks-missed=. etc.
LLD linker - pass -Wl,-mllvm -Wl,-pass-remarks=loop-vectorize -Wl,-mllvm -Wl,-pass-remarks-missed=. etc.

Other compilers

For gcc you need to pass -ftree-vectorize -ftree-vectorizer-verbose=X, where X is the verbose level. More about this here.

I find the most usable opt reports from Intel Compiler (icc). It shows if the loop was multiversioned, it has filter by the line of the code, etc. Also the issue with LTO (like in clang) works with no additional steps from the user. It remembers that user requested opt report on compilation stage and it will generate output in the text file on the linking stage (in icc it is called IPO - Inter Procedural Optimization). More links for icc here and here.

All posts from this series:

Vectorization intro.
Vectorization warmup.
Checking compiler vectorization report (this article).
Vectorization width.
Multiversioning by data dependency.
Multiversioning by trip counts.
Tips for writing vectorizable code.

Denis Bakhvalov

Vectorization part3. Compiler report.

30 Oct 2017

Other compilers

All posts from this series:

Subscribe to get more updates from me: