Contents:
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
In this short post I want to show the example how denormal values that you can use (unintentionally) in your calculations might slow down your code. And especially how to detect it using performance counters.If you’re not familiar with what denormal floats it’s now the good time to read it.
Disclaimer: In this post I don’t touch the topic of how to disable denormal floats at the code/compiler level. There is lots of information in the web.
I put a division of two floats in a tight loop:
int bench(volatile float x, volatile float y)
{
float sum = 0.0f;
for (int i = 0; i < 100000000; i++)
{
sum = x / y;
DoNotOptimize(sum);
sum = 0.0f;
}
return (int)sum;
}
In first case I pass 2 normal floats and in second 2 denormal floats as arguments. Example of a denormal float would be 0xF and 0x7. Example of normal float would be 0.1f and 0.2f. You can check their binary representations and compare.
Full code can be found on my github.
I built everything with gcc -O1
and checked that we have a loop with division inside:
4009df: c5 fa 10 4c 24 fc vmovss xmm1,DWORD PTR [rsp-0x4]
4009e5: c5 fa 10 44 24 f8 vmovss xmm0,DWORD PTR [rsp-0x8]
4009eb: c5 f2 5e d0 vdivss xmm2,xmm1,xmm0
4009ef: c5 f9 7e d2 vmovd edx,xmm2
4009f3: 83 e8 01 sub eax,0x1
4009f6: 75 e7 jne 4009df <_Z5benchff+0x11>
Normal floats:
$ perf stat -e cycles,cpu/event=0xc2,umask=0x2,name=UOPS_RETIRED.RETIRE_SLOTS/,cpu/event=0xca,umask=0x1e,cmask=0x1,name=FP_ASSIST.ANY/,cpu/event=0x79,umask=0x30,name=IDQ.MS_UOPS/ ./a.out norm
x isnormal: yes
y isnormal: yes
Performance counter stats for './a.out norm':
303078534 cycles
502937703 UOPS_RETIRED.RETIRE_SLOTS
0 FP_ASSIST.ANY
808676 IDQ.MS_UOPS
0,081426690 seconds time elapsed
Denormal floats:
$ perf stat -e cycles,cpu/event=0xc2,umask=0x2,name=UOPS_RETIRED.RETIRE_SLOTS/,cpu/event=0xca,umask=0x1e,cmask=0x1,name=FP_ASSIST.ANY/,cpu/event=0x79,umask=0x30,name=IDQ.MS_UOPS/ ./a.out denorm
x isnormal: no
y isnormal: no
Performance counter stats for './a.out denorm':
15720344436 cycles
4721230495 UOPS_RETIRED.RETIRE_SLOTS
100000000 FP_ASSIST.ANY
4307771477 IDQ.MS_UOPS
4,154192419 seconds time elapsed
First observation is that divisions on denormal values is 50
times slower. No surprise, but lets understand why that happens.
Whenever CPU see that it’s processing denormal value it asks for a microcode assist. Microcode Sequencer (MS) then will feed the pipeline with lots of UOPs for handling that scenario. We can see that in the slow case we have exactly 100000000
fp assits from MS and in normal case it’s zero. Also we can spot that in the slow case major part of UOPs comes from MS.
So, here are your tools in detecting situations when your programm starts doing calculation with denormal values.