Subscribe to my newsletter, support me on Patreon or by PayPal donation.
There are many ways to analyze performance of an application running on a modern CPU, in this post I offer you one more way to look at it. The concept I write about here is important to understand and to have a high-level understanding of the limitations that the current CPU computing industry is facing. Also, the mental model I present will help you to better understand performance of your code. In the second part of the article, I expand on each of the 4 categories, discuss SW and HW solutions, provide links to the latest research, and say a few words about future directions.
Let’s jump straight into it. At the CPU level, performance of any application is limited by 4 categories:
Think about it like this. First, modern CPUs always try to predict what code will be executed next (“predictability of code”). In that sense, a CPU core always looks ahead of execution, into the future. Correct predictions greatly improve execution as it allows a CPU to make forward progress without having results of previous instructions available. However, bad speculation often incurs costly performance penalties.
Second, a processor fetches instructions from memory and moves them through a very sophisticated execution pipeline (“execution throughput”). How many independent instructions a CPU can execute simultaneously determines the execution throughput of a machine. In this category, I include all the stalls that occur as a result of a lack of execution resources, for example, saturated instruction queues, lack of free entries in the reservation station, not enough multiplication or divider units, etc.
Third (“predictability of data”), some instructions access memory, which is one of the biggest bottlenecks nowadays as the gap in performance between CPU and memory continues to grow. Hopefully, everyone by now knows that accessing data in the cache is fast while accessing DRAM could be 100 times slower. That’s the motivation for the existence of HW prefetchers, which try to predict what data a program will access in the nearest future and pull that data ahead of time so that by the time a program demands it to make forward progress, the values are already in caches. This category includes issues with having good spatial and temporal locality, as caches try to address both. Also, I include all the TLB-related issues under this category.
Fourth (“execution latency”), the vast majority of applications have many dependency chains when you first need to perform action
A before you can start to execute action
B. So, this last category represents how well a CPU can execute a sequence of dependent instructions. Just for the record, some applications are massively parallel, which have small sequential portions followed by large parallel ones – such tasks are limited by execution throughput, not latency.
A careful reader can draw parallels with a Top-Down performance analysis methodology, which is true. Below I provide a corresponding TMA metric for each of our 4 categories.
Let’s expand on each of those categories.
Current: Stalls frequently occur as a lack of some execution resources. It is very hard to predict which resource in a CPU pipeline will be a bottleneck for a certain application. Collecting CPU counters will give you some insight, but a more general recommendation would be to use Top-Down Analysis. OOO machines are very good at extracting parallelism, although there are blind spots. They can easily catch “local” parallelism within a mid-size loop or function, but can’t find global parallelism, for example:
foo(); // large non-inlined function bar(); // bar does not depend on foo
If a processor would overlap the execution of
bar, that would significantly speed up the program, but currently, CPUs execute such code sequentially1.
Modern architectures range from 4-wide to 8-wide. Increasing the width of a machine is very costly as you need to simultaneously widen all the elements of the CPU pipeline to keep the machine balanced. The complexity of making wider designs grows rapidly, which makes it impossible to build such a system in practice.
bar, one would need to manually unroll and interleave both functions, but then you could run out of available registers, which will generate memory spills and refills (hurts performance but could be still worthwhile).
Future: Cache sizes will likely continue to grow, although writing SW cache-friendly algorithms usually makes much more impact.
There is a revolutionary idea called PIM (Processing-In-Memory), which moves some computations (for example, memset, memcpy) closer to where the data is stored. Ideas range from specialized accelerators to direct in-DRAM computing. The main roadblock here is that it requires a new programming model for such devices thus rewriting the application’s code. [PIM]
Also, there are a few ideas for more intelligent HW prefetching. For example: [Pythia] uses reinforcement learning to find patterns in past memory request addresses to generate prefetch requests, [Hermes] tries to predict which memory accesses will miss in caches and initiates DRAM access early.
Every now and then HW architects have an increase in the number of transistors they can use in the chip design. These new transistors can be used to increase the size of caches, improve branch prediction accuracy or make the CPU pipeline wider by proportionally enlargening all the internal data structures and buffers, adding more execution units, etc.
Data dependency chains are the hardest bottleneck to overcome. From an architectural standpoint, there is not much modern CPUs can do about it. Out-of-Order engines, which are employed by the majority of modern processors, are useless in the presence of a long dependency chain, for example traversing a linked list aka pointer chasing. While you can do something to influence other categories (improve branch prediction, more intelligent HW prefetchers, make your pipeline wider), so far modern architectures don’t have a good response to handling Read-After-Write (aka “true”) data dependencies.
I hope that this post gave you a useful mental model, which will help you better understand performance of modern CPU cores. To fine-tune performance of an application means that you tailor the code to the HW, that runs that code. So, it’s not only useful to know the limitations of HW, but it is also critical to understand to which category your application belongs. Such data will help you focus on the performance problem that really matters.
Top-Down microarchitecture analysis and Roofline performance analysis should usually be a good way to start. Most of the time you’ll see a mix of problems, so you have to analyze hotspots case by case. Figuring out predictability of code or data is relatively easy (you check Top-Down metrics) while distinguishing if your code is limited by throughput or latency is not. I will try to cover it in my future posts.
I expect that this post could spawn a lot of comments from people pointing out my mistakes and inaccuracies. Also, let me know if I missed any important ideas and papers, I would be happy to add them to the article.
There could be some execution overlap when
foo ends and
bar starts. To be fair, it is possible to parallelize the execution of
bar if, say, we spawn a new thread for