Vectorization part7. Tips for writing vectorizable code.

Categories: vectorization

10 Nov 2017

Subscribe to my newsletter, support me on Patreon or by PayPal donation.

This post is wrapping up the series. We just saw some really simple examples when vectorization either happens or not. But usually you have more complicated code. What to do in this case, how make use of vectorization capabilities of your CPU?

To best answer this question I want to highlight the typical reasons for not vectorized code and guidlines for writing vectorizable code.

Typical reasons for loop not being vectorized.

Low trip count
Not Inner Loop
Existence of vector dependence
Vectorization possible but seems inefficient
Condition may protect exception
Data type unsupported
Subscript too complex
Unsupported loop structure
Statement inside the loop unsuited for vectorization

General tips for writing vectorizable code.

Favor simple for loops
Write straight line code. Avoid:
- Function calls
- Branches that cannot be treated as masked assignments
Avoid dependencies between loop iterations
- Avoid read-after-write dependencies
Prefer array notation to the use of pointers
- Or provide help for compiler to understand
- Try to use the loop index directly in array subscripts, instead of incrementing a separate counter for use as an array address
Use efficient memory addresses
- Favor inner loops with unit stride
- Minimize indirect addressing
Align your data where possible to some boundary (32 bytes in case of AVX)

However, the main advice is: see compiler opt reports to understand what compiler did for you. If you measured and your code stil

Other resources

Some items from the two checklists below were taken from Intel Compiler Autovectorization Guide. I really recommend it, even though it is slightly outdated.

Specifically I want to point out that compilers can do all sorts of loop transformations to make vectorization possible. I recommend to at least familiarize yourself with the basic loop transformations. For example, compiler can perform some of them if it will help to eliminate some loop dependency. Doing so will enable vectorization.

This is really nice article with lots of examples: Crunching numbers with AVX and AVX2. It is a good guide if you want to try out writing vector intrinsics. This post has nice pictures of how some particular hardware instruction works.

Vectorization codebook has rather high-level view for the topic with links to the more detailed documents.

All posts from this series:

Vectorization intro.
Vectorization warmup.
Checking compiler vectorization report.
Vectorization width.
Multiversioning by data dependency.
Multiversioning by trip counts.
Tips for writing vectorizable code (this article).

Denis Bakhvalov