Contents:
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
For a beginer it can be a very hard time looking in a profile generated by the tool like perf or Intel Vtune Amplifier. It immediately throws at you lots of strange terms, which you might not know. Whenever I’m presenting to the audience or speaking with somebody who is not very much involved into performance analysis activities, they ask the same basic questions. Like: “What is instruction retired?” or “What is reference cycles?”. So, I decided to write an article describing some of unobvious terms connected with performance analysis.
Modern processors execute much more instructions that the program flow needs. This is called a speculative execution. Instructions that were “proven” as indeed needed by the program execution flow are “retired”.
Source: https://software.intel.com/en-us/vtune-amplifier-help-instructions-retired-event.
So, instruction processed by the CPU can be executed but not necessary retired. And retired instruction is usually executed, except those times when it does not require an execution unit. An example of it can be mov elimination (see my post What optimizations you can expect from CPU?). Taking this into account we can usually expect the number of executed instructions to be higher than the number of retired instructions.
There is a fixed performance counter (PMC) that is collecting this metric. See one of my previous articles for more information on this topic: PMU counters and profiling basics.
To collect this basic metric, you can use perf:
$ perf stat -e instructions ./a.out
or just simply
$ perf stat ./a.out
From Agner’s Fog microarchitecture manual, chapter 2.1 “Instructions are split into µops”:
The microprocessors with out-of-order execution are translating all instructions into microoperations - abbreviated µops or uops. A simple instruction such as
ADD EAX,EBX
generates only one µop, while an instruction likeADD EAX,[MEM1]
may generate two: one for reading from memory into a temporary (unnamed) register, and one for adding the contents of the temporary register to EAX. The instructionADD [MEM1],EAX
may generate three µops: one for reading from memory, one for adding, and one for writing the result back to memory. The advantage of this is that the µops can be executed out of order.
In the chapter about micro-ops Agner has more examples, so you may want to read them as well.
Modern Intel architectures are capable of collecting the number of issued, executed and retired uops. The difference between executed and retired uop is mostly the same as for instruction.
$ perf stat -e cpu/event=0xe,umask=0x1,name=UOPS_ISSUED.ANY/,cpu/event=0xb1,umask=0x1,name=UOPS_EXECUTED.THREAD/,cpu/event=0xc2,umask=0x1,name=UOPS_RETIRED.ALL/ ls
Performance counter stats for 'ls':
2856278 UOPS_ISSUED.ANY
2720241 UOPS_EXECUTED.THREAD
2557884 UOPS_RETIRED.ALL
Uops also can be MacroFused and MicroFused.
Majority of modern CPUs including Intel’s and AMD’s ones don’t have fixed frequency on which they operate. Instead, they have dynamic frequency scaling. In Intel’s CPUs this technology is called Turbo Boost, in AMD’s processors it’s called Turbo Core. There is nice explanation of the term “reference cycles” on this stackoverflow thread:
Having a snippet A to run in 100 core clocks and a snippet B in 200 core clocks means that B is slower in general (it takes double the work), but not necessarily that B took more time than A since the units are different. That’s where the reference clock comes into play - it is uniform. If snippet A runs in 100 ref clocks and snippet B runs in 200 ref clocks then B really took more time than A.
$ perf stat -e cycles,ref-cycles ./a.out
Performance counter stats for './a.out':
43340884632 cycles # 3.97 GHz
37028245322 ref-cycles # 3.39 GHz
10,899462364 seconds time elapsed
I did this experiment on Skylake i7-6000 process, which base frequency is 3.4 GHz. So, ref-cycles
event counts cycles as if there were no frequency scaling. This also matches with clock multiplier for that processor, which can find in the specs (it’s equal to 34). Usually system clock has frequency of 100 MHz, and if we multiply it by clock multiplier we will receive the base frequency of the processor. You also might be interested to read about Overclocking.
One interesting experiment which I suggest to do on your own looks like this: open 3 terminals and run corresponding commands:
1. perf stat -e cycles -a -I 1000
2. perf stat -e ref-cycles -a -I 1000
3. perf stat -e bus-cycles -a -I 1000
Place them in such a way that all 3 will be visible. Then open another terminal in which start executing some workload. You will notice how collected values will increase and decrease over time. This experiment will also give you an idea how the state of the CPU is changing.
For advanced information about reference cycles please check this thread on Intel forum.
Modern CPUs try to predict the outcome of a branch instruction (taken or not taken). For example, when processor see a code like that:
dec eax
jz .zero
# eax is not 0
...
zero:
# eax is 0
Instruction jz
is a branch instruction and in order to increase performance modern CPU architectures try to predict the result of such branch. This is also called speculative execution. Processor will speculate that, for example, branch will not be taken and will execute the code that corresponds to the situation when eax is not 0
. However, if the guess was wrong, this is called “branch misprediction” and CPU is required to undo all the speculative work that it has done lately. This may involve something between 10 and 20 clock cycles.
You can check how much branch mispredictions there were in the workload by using perf:
$ perf stat -e branches,branch-misses ls
Performance counter stats for 'ls':
358209 branches
14026 branch-misses # 3,92% of all branches
0,009579852 seconds time elapsed
or simply
$ perf stat ls
More information like history, possible and real world implementations and more can be found on wikipedia and in Agner’s Fog microarchitecture manual, chapter 3 “Branch prediction”.
Those two are derivative metrics that stand for:
There are lots of other analysis that can be done based on those metrics. But in the nutshell, you want to have low CPI and high IPC.
Formulas:
IPC = INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD
CPI = 1 / IPC
Let’s look at the example:
$ perf stat -e cycles,instructions ls
Performance counter stats for 'ls':
2369632 cycles
1725916 instructions # 0,73 insn per cycle
0,001014339 seconds time elapsed
Notice, perf tool automatically calculates IPC metric for us.