Using denormal values is slow. How to detect it?

Categories: tuning

08 Nov 2018

Contents:

Subscribe to my newsletter, support me on Patreon or by PayPal donation.

In this short post I want to show the example how denormal values that you can use (unintentionally) in your calculations might slow down your code. And especially how to detect it using performance counters.If you’re not familiar with what denormal floats it’s now the good time to read it.

Disclaimer: In this post I don’t touch the topic of how to disable denormal floats at the code/compiler level. There is lots of information in the web.

Example

I put a division of two floats in a tight loop:

``````int bench(volatile float x, volatile float y)
{
float sum = 0.0f;
for (int i = 0; i < 100000000; i++)
{
sum = x / y;
DoNotOptimize(sum);
sum = 0.0f;
}
return (int)sum;
}
``````

In first case I pass 2 normal floats and in second 2 denormal floats as arguments. Example of a denormal float would be 0xF and 0x7. Example of normal float would be 0.1f and 0.2f. You can check their binary representations and compare.

Full code can be found on my github.

I built everything with `gcc -O1` and checked that we have a loop with division inside:

``````  4009df:	c5 fa 10 4c 24 fc    	vmovss xmm1,DWORD PTR [rsp-0x4]
4009e5:	c5 fa 10 44 24 f8    	vmovss xmm0,DWORD PTR [rsp-0x8]
4009eb:	c5 f2 5e d0          	vdivss xmm2,xmm1,xmm0
4009ef:	c5 f9 7e d2          	vmovd  edx,xmm2
4009f3:	83 e8 01             	sub    eax,0x1
4009f6:	75 e7                	jne    4009df <_Z5benchff+0x11>
``````

Measurements

Normal floats:

``````\$ perf stat -e cycles,cpu/event=0xc2,umask=0x2,name=UOPS_RETIRED.RETIRE_SLOTS/,cpu/event=0xca,umask=0x1e,cmask=0x1,name=FP_ASSIST.ANY/,cpu/event=0x79,umask=0x30,name=IDQ.MS_UOPS/ ./a.out norm

x isnormal: yes
y isnormal: yes

Performance counter stats for './a.out norm':

303078534      cycles
502937703      UOPS_RETIRED.RETIRE_SLOTS
0      FP_ASSIST.ANY
808676      IDQ.MS_UOPS

0,081426690 seconds time elapsed
``````

Denormal floats:

``````\$ perf stat -e cycles,cpu/event=0xc2,umask=0x2,name=UOPS_RETIRED.RETIRE_SLOTS/,cpu/event=0xca,umask=0x1e,cmask=0x1,name=FP_ASSIST.ANY/,cpu/event=0x79,umask=0x30,name=IDQ.MS_UOPS/ ./a.out denorm

x isnormal: no
y isnormal: no

Performance counter stats for './a.out denorm':

15720344436      cycles
4721230495      UOPS_RETIRED.RETIRE_SLOTS
100000000      FP_ASSIST.ANY
4307771477      IDQ.MS_UOPS

4,154192419 seconds time elapsed
``````

Explanation

First observation is that divisions on denormal values is `50` times slower. No surprise, but lets understand why that happens.

Whenever CPU see that it’s processing denormal value it asks for a microcode assist. Microcode Sequencer (MS) then will feed the pipeline with lots of UOPs for handling that scenario. We can see that in the slow case we have exactly `100000000` fp assits from MS and in normal case it’s zero. Also we can spot that in the slow case major part of UOPs comes from MS.

So, here are your tools in detecting situations when your programm starts doing calculation with denormal values.