# Vectorization part2. Warmup.

Categories: vectorization

## 27 Oct 2017

Subscribe to my newsletter, support me on Patreon or by PayPal donation.

I want to do this warmup post, just to be sure that we are on the same page. It will also give you the taste of how the vector assembly looks like and how it works. I remember when I was learning vectorization stuff, I really missed the assembly that was generated by the compiler. I was like: “Show me assembly, I want to see how it’s done.” Even if there was assembly I didn’t understand how this hairy scary lines of code do the job.

I find the example in this post somewhat special. It might be not the most efficient assembly code, but in my opinion it nicely shows capabilities of the vectorization. So, in this post we will go through an assembly code line by line, visualizing the state of the registers after some portion of assembly code.

Here is the example: godbolt.

In general, compiler has quite hard time dealing with various control flow statements inside the loop when trying to vectorize it. Vectorization gets much harder when there is if statement inside the loop, and even infeasible when there is a break or throwing exception. However, in our example there is if statement and the loop was successfully vectorized by gcc.

``````int foo( short a[16], unsigned short b[16], unsigned short bias[16] )
{
int agg = 0;
for( int i = 0; i < 16; i++ )
{
if( a[i] > 0 )
a[i] = (bias[i] + a[i]) * b[i] >> 16;
else
a[i] = - ((bias[i] - a[i]) * b[i] >> 16);
agg += a[i];
}
return agg;
}
``````

Let’s look at first 4 instructions. Description of all instructions can be found here.

`````` vmovdqu ymm4,YMMWORD PTR [rdi] # a[]
vmovdqu ymm2,YMMWORD PTR [rdx] # bias[]
vmovdqu ymm5,YMMWORD PTR [rsi] # b[]
vmovdqa ymm6,YMMWORD PTR [rip+0x5c6] # 400b00
``````

They load our input arrays into registers. By x86 calling conventions for 64 bit architecture our first three arguments are stored in `rdi`, `rsi` and `rdx` respectively.

Here is what values are in registers after this step (I removed registers `ymm10-ymm15` because they are not used in this example):

### Extracting

Because our computations can overflow the size of short, we need to use 32 bit integers for computation. That’s why we first “unpack” all the 16-bit values and make them 32-bit.

`````` vpmovzxwd ymm1,xmm2
vextracti128 xmm8,ymm4,0x1
vextracti128 xmm2,ymm2,0x1
``````

`vpmovzxwd ymm1, xmm2` instruction zero extend 8 16-bit integers in `xmm2` to 8 32-bit integers and puts them into `ymm1`. `vextracti128 xmm8,ymm4,0x1` extract 128 most significant bits (3rd parameter `0x1` tells to do so) of integer data from `ymm4` and store results in `xmm8`.

Next stage is mostly similar to the previous one, so I will just skip it and only show the result. You can trace it by your own.

`````` vpmovzxwd ymm9,xmm5
vpmovzxwd ymm2,xmm2
vpmovsxwd ymm0,xmm4
vpmovsxwd ymm8,xmm8
vextracti128 xmm5,ymm5,0x1
``````

### Computation

Up to this moment we prepared all data into separate parts. Next assembly blob is actually doing computation inside the loop.

`````` vpaddd ymm3,ymm1,ymm0
vpmovzxwd ymm5,xmm5
vpsubd ymm0,ymm1,ymm0
vpsubd ymm1,ymm2,ymm8
vpmulld ymm0,ymm0,ymm9
vpmulld ymm1,ymm1,ymm5
vpmulld ymm3,ymm3,ymm9
vpmulld ymm7,ymm7,ymm5
``````

So, `vpaddd`, `vpsubd` and `vpmuld` do arithmetic operations on dwords and I believe you can easily extract what each instruction does from its name. I will mark temporary results with the signs of completed arithmetic operations. For example, `bias[i] + a[i]` as `result of +`, `(bias[i] - a[i]) * b[i]` as `result of - *`.

Updated state of the registers looks like this:

Finally, we do the right shifts.

`````` vpsrad ymm0,ymm0,0x10
vpand ymm1,ymm6,ymm1
vpand ymm0,ymm6,ymm0
``````

By doing `vpand ymm1,ymm6,ymm1` we just zeroing most significant half of the bits of each value in `ymm1` (recall that `ymm6` has values `0xFFFF` in it). We need to do that because `vpsrad` is doing sign extension after shifting.

### Packing

`````` vpackusdw ymm0,ymm0,ymm1
vpxor xmm1,xmm1,xmm1
``````

`vpackusdw ymm0, ymm0, ymm1` converts 8 16-bit signed integers from `ymm0` and 8 16-bit signed integers from `ymm1` and store them into 16 16-bit unsigned integers in `ymm0`.

### Handling if statement

`````` vpcmpgtw ymm4,ymm4,ymm1
vpand ymm3,ymm6,ymm3
vpand ymm7,ymm6,ymm7
vpackusdw ymm3,ymm3,ymm7
``````

This is the most interesting part of this example. Instruction `vpcmpgtw ymm4,ymm4,ymm1` compares 16-bit signed integers for greater than. Effectively this instruction does all 16 comparisons just in one instruction. Amazing! And for each pair of 16-bit ints it stores either `0xFFFF` or `0` in the destination vector, depending on the result of the comparison.

### Writing back

Note, that at this stage we have both branches of if statement computed (in `ymm0` and `ymm3`) and we know all outcomes of the if conditions (in `ymm4`). So, we need to select proper values from correct branch results.

`````` vpermq ymm0,ymm0,0xd8
vpermq ymm3,ymm3,0xd8
vpsubw ymm0,ymm1,ymm0
vpblendvb ymm0,ymm0,ymm3,ymm4
vpmovsxwd ymm1,xmm0
vmovdqu YMMWORD PTR [rdi],ymm0
``````

`vperfmq ymm0,ymm3,0xd8` permute qwords in ymm3 using indexes in the third operand(`0xd8`) and store the result in ymm0. We need to permute values in `ymm0` and `ymm3` because of the nature of `vpackusdw` (see above). When `vpackusdw` combines two arrays together it does interleaving. When combining `a[0..7]` and `a[8-15]` this instruction makes the output of `a[0..3] a[8-11] a[4-7] a[12-15]`. So, to receive correct result we need to swap elements in the middle. And this is exactly what `vpermq` instruction with the mask `0xd8` is doing.

`vpsubw ymm0,ymm1,ymm0` simply negates all the values in `ymm0`, completing final steps of computation.

`vpblendvb ymm0,ymm0,ymm3,ymm4` selects values from `ymm0` and `ymm3` according to the control in `ymm4` and stores them into `ymm0`.

And finally, `vmovdqu YMMWORD PTR [rdi],ymm0` stores values from `ymm0` back to memory (in `a[]`). What is left is calculate the return value (`agg`), bit I will leave this exercise for the reader.

### Conclusions

I want to point out a couple of things we learned from this example.

1. Note, that the loop was fully vectorized, meaning that there are no backward jump. It is simple for compiler to do because the trip count (number of loop iterations) is known.
2. Still there is lots of scalar code. I will explain why we need it in the next posts. Stay tuned.
3. Surprisingly, gcc is not using horizontal add for calculating `agg`. Clang does, but it is only able to vectorize the loop when using GVN-hoisting (-mllvm -enable-gvn-hoist). link to godbolt.
4. There is not so much register pressure in this function. Registers `ymm10-ymm15` are free for compiler to do something else. So, still there is some free space for the vectorizer to handle even more complex code.

As a final note I want to put a links to the two great presentations from CppCon2017: