Subscribe to my newsletter, support me on Patreon or by PayPal donation.
I would love to hear your feedback!
I wrote this blog series for the second edition of my book titled “Performance Analysis and Tuning on Modern CPUs”. It is open-sourced on Github: perf-book. The book primarily targets mainstream C and C++ developers who want to learn low-level performance engineering, but devs in other languages may also find some useful information.
After you read this write-up, let me know which parts you find useful/boring/complicated, and which parts need better explanation? Send me suggestions about the tools that I use and if you know better ones.
Tell me what you think in the comments or send me an email, which you can find here. Also, you’re welcome to write your suggestions in Github, here is the corresponding pull request.
Please keep in mind that it is an excerpt from the book, so some phrases may sound too formal.
P.S. If you’d rather read this in the form of a PDF document, you can download it here.
In this series of blog posts, you will learn how to collect high-level information about a program’s interaction with memory. This process is usually called memory profiling. Memory profiling helps you understand how an application uses memory over time and helps you build the right mental model of a program’s behavior. Here are some questions it can answer:
When developers talk about memory consumption, they implicitly mean heap usage. Heap is, in fact, the biggest memory consumer in most applications as it accommodates all dynamically allocated objects. But heap is not the only memory consumer. For completeness, let’s mention others:
Next, we will introduce the terms memory usage and memory footprint and see how to profile both.
Memory usage is frequently described by Virtual Memory Size (VSZ) and Resident Set Size (RSS). VSZ includes all memory that a process can access, e.g., stack, heap, the memory used to encode instructions of an executable, and instructions from linked shared libraries, including the memory that is swapped out to disk. On the other hand, RSS measures how much memory allocated to a process resides in RAM. Thus, RSS does not include memory that is swapped out or was never touched yet by that process. Also, RSS does not include memory from shared libraries that were not loaded to memory.
Consider an example. Process A
has 200K of stack and heap allocations of which 100K resides in the main memory, the rest is swapped out or unused. It has a 500K binary, from which only 400K was touched. Process A
is linked against 2500K of shared libraries and has only loaded 1000K in the main memory.
VSZ: 200K + 500K + 2500K = 3200K
RSS: 100K + 400K + 1000K = 1500K
An example of visualizing the memory usage and footprint of a hypothetical program is shown in Figure 1. The intention here is not to examine statistics of a particular program, but rather to set the framework for analyzing memory profiles. Later in this chapter, we will examine a few tools that let us collect such information.
Let’s first look at the memory usage (upper two lines). As we would expect, the RSS is always less or equal to the VSZ. Looking at the chart, we can spot four phases in the program. Phase 1 is the ramp-up of the program during which it allocates its memory. Phase 2 is when the algorithm starts using this memory, notice that the memory usage stays constant. During phase 3, the program deallocates part of the memory and then allocates a slightly higher amount of memory. Phase 4 is a lot more chaotic than phase 2 with many objects allocated and deallocated. Notice, that the spikes in VSZ are not necessarily followed by corresponding spikes in RSS. That might happen when the memory was reserved by an object but never used.
Figure 1. Example of the memory usage and footprint (hypothetical scenario). |
Now let’s switch to memory footprint. It defines how much memory a process touches during a period, e.g., in MB per second. In our hypothetical scenario, visualized in Figure 1, we plot memory usage per 100 milliseconds (10 times per second). The solid line tracks the number of bytes accessed during each 100 ms interval. Here, we don’t count how many times a certain memory location was accessed. That is, if a memory location was loaded twice during a 100ms interval, we count the touched memory only once. For the same reason, we cannot aggregate time intervals. For example, we know that during the phase 2, the program was touching roughly 10MB every 100ms. However, we cannot aggregate ten consecutive 100ms intervals and say that the memory footprint was 100 MB per second because the same memory location could be loaded in adjacent 100ms time intervals. It would be true only if the program never repeated memory accesses within each of 1s intervals.
The dashed line tracks the size of the unique data accessed since the start of the program. Here, we count the number of bytes accessed during each 100 ms interval that have never been touched before by the program. For the first second of the program’s lifetime, most of the accesses are unique, as we would expect. In the second phase, the algorithm starts using the allocated buffer. During the time interval from 1.3s to 1.8s, the program accesses most of the locations in the buffer, e.g., it was the first iteration of a loop in the algorithm. That’s why we see a big spike in the newly seen memory locations from 1.3s to 1.8s, but we don’t see many unique accesses after that. From the timestamp 2s up until 5s, the algorithm mostly utilizes an already-seen memory buffer and doesn’t access any new data. However, the behavior of phase 4 is different. First, during phase 4, the algorithm is more memory intensive than in phase 2 as the total memory footprint (solid line) is roughly 15 MB per 100 ms. Second, the algorithm accesses new data (dashed line) in relatively large bursts. Such bursts may be related to the allocation of new memory regions, working on them, and then deallocating them.
We will show how to obtain such charts in the following two case studies, but for now, you may wonder how this data can be used. Well, first, if we sum up unique bytes (dotted lines) accessed during every interval, we will get the total memory footprint of a program. Also, by looking at the chart, you can observe phases and correlate them with the code that is running. Ask yourself: “Does it look according to your expectations, or the workload is doing something sneaky?” You may encounter unexpected spikes in memory footprint. Memory profiling techniques that we will discuss in this series of posts do not necessarily point you to the problematic places similar to regular hotspot profiling but they certainly help you better understand the behavior of a workload. On many occasions, memory profiling helped identify a problem or served as an additional data point to support the conclusions that were made during regular profiling.
In some scenarios, memory footprint helps us estimate the pressure on the memory subsystem. For instance, if the memory footprint is small, say, 1 MB/s, and the RSS fits into the L3 cache, we might suspect that the pressure on the memory subsystem is low; remember that available memory bandwidth in modern processors is in GB/s and is getting close to 1 TB/s. On the other hand, when the memory footprint is rather large, e.g., 10 GB/s and the RSS is much bigger than the size of the L3 cache, then the workload might put significant pressure on the memory subsystem.
->
part 2
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
Now, let’s take a look at how to profile the memory usage of a real-world application. We will use heaptrack, an open-sourced heap memory profiler for Linux developed by KDE. Ubuntu users can install it very easily with apt install heaptrack heaptrack-gui
. Heaptrack can find places in the code where the largest and most frequent allocations happen among many other things. On Windows, you can use Mtuner which has similar1 capabilities as Heaptrack.
As an example, we took Stockfish’s built-in benchmark. We compiled it using the Clang 15 compiler with -O3 -mavx2
options. We collected the Heaptrack memory profile of a single-threaded Stockfish built-in benchmark on an Intel Alderlake i7-1260P processor using the following command:
$ heaptrack ./stockfish bench 128 1 24 default depth
Figure 2 shows us a summary view of the Stockfish memory profile. Here are some interesting facts we can learn from it:
Stockfish::std_aligned_alloc
is responsible for the largest portion of the allocated heap space (182 MB). But it is not among the most frequent allocation spots (middle table), so it is likely allocated once and stays alive until the end of the program.operator new
, which are all temporary allocations. Can we get rid of temporary allocations?Figure 2. Stockfish memory profile with Heaptrack, summary view. |
Notice, that there are many tabs on the top of the image; next, we will explore some of them. Figure 3 shows the memory usage of the Stockfish built-in benchmark. The memory usage stays constant at 200 MB throughout the entire run of the program. Total consumed memory is broken into slices, e.g., regions 1 and 2 on the image. Each slice corresponds to a particular allocation. Interestingly, it was not a single big 182 MB allocation that was done through Stockfish::std_aligned_alloc
as we thought earlier. Instead, there are two: slice 1 of 134.2 MB and slice 2 of 48.4 MB. Though both allocations stay alive until the very end of the benchmark.
Figure 3. Stockfish memory profile with Heaptrack, memory usage over time stays constant. |
Does it mean that there are no memory allocations after the startup phase? Let’s find out. Figure 4 shows the accumulated number of allocations over time. Similar to the consumed memory chart (Figure 3), allocations are sliced according to the accumulated number of memory allocations attributed to each function. As we can see, new allocations keep coming from not just a single place, but many. The most frequent allocations are done through operator new
that corresponds to region 1 on the image.
Notice, there are new allocations at a steady pace throughout the life of the program. However, as we just saw, memory consumption doesn’t change; how is that possible? Well, it can be possible if we deallocate previously allocated buffers and allocate new ones of the same size (also known as temporary allocations).
Figure 4. Stockfish memory profile with Heaptrack, number of allocations is growing. |
Since the number of allocations is growing but the total consumed memory doesn’t change, we are dealing with temporary allocations. Let’s find out where in the code they are coming from. It is easy to do with the help of a flame graph shown in Figure 5. There are 4800 temporary allocations in total with 90.8% of those coming from operator new
. Thanks to the flame graph we know the entire call stack that leads to 4360 temporary allocations. Interestingly, those temporary allocations are initiated by std::stable_sort
which allocates a temporary buffer to do the sorting. One way to get rid of those temporary allocations would be to use an in-place stable sorting algorithm. However, by doing so we observed an 8% drop in performance, so we discarded this change.
Figure 5. Stockfish memory profile with Heaptrack, temporary allocations flamegraph. |
Similar to temporary allocations, you can also find the paths that lead to the largest allocations in a program. In the dropdown menu at the top, you would need to select the “Consumed” flame graph. We encourage readers to explore other tabs as well.
->
part 3
I would be glad if someone can confirm that MTuner has similar features as heaptrack. ↩
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
Now let’s take a look at how we can estimate the memory footprint. In part 3, we will warm up by measuring the memory footprint of a simple program. In part 4, we will examine the memory footprint of four production workloads.
Consider a simple naive matrix multiplication code presented in the listing below on the left. The code multiplies two square 4Kx4K matrices a
and b
and writes the result into square 4Kx4K matrix c
. Recall that to calculate one element of the result matrix c
, we need to calculate the dot product of a corresponding row in the matrix a
and a column in matrix b
; this is what the innermost loop over k
is doing.
Listing: Applying loop interchange to naive matrix multiplication code.
constexpr int N = 1024*4; // 4K
std::array<std::array<float, N>, N> a, b, c; // 4K x 4K matrices
// init a, b, c
for (int i = 0; i < N; i++) { for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) { => for (int k = 0; k < N; k++) {
for (int k = 0; k < N; k++) => for (int j = 0; j < N; j++) {
c[i][j] += a[i][k] * b[k][j]; c[i][j] += a[i][k] * b[k][j];
} }
} }
} }
To demonstrate the memory footprint reduction, we applied a simple loop interchange transformation that swaps the loops over j
and k
(lines marked with =>
). Once we measure the memory footprint and compare it between the two versions, it will be easy to see the difference. The visual result of the change in memory access pattern is shown in Figure 6. We went from calculating each element of matrix c
one by one to calculating partial results while maintaining row-major traversal in all three matrices.
In the original code (on the left), matrix b
is accessed in a column-major way, which is not cache-friendly. Look at the picture and observe the memory regions that are touched after the first N iterations of the inner loop. We calculate the dot product of row 0 in a
and column 0 in b
, and save it into the first element in matrix c
. During the next N iterations of the inner loop, we access the same row 0 in a
and column 1 in b
to get the second result in matrix c
.
In the transformed code on the right, the inner loop accesses just a single element in the matrix a
. We multiply it by all the elements in the corresponding row in b
and accumulate products into the corresponding row in c
. Thus, the first N iterations of the inner loop calculate products of element 0 in a
and row 0 in b
and accumulate partial results in row 0 in c
. Next N iterations multiply element 1 in a
and row 1 in b
and, again, accumulate partial results in row 0 in c
.
Figure 6. Memory access pattern and cache lines touched after the first N and 2N iterations of the inner loop (images not to scale). |
Let’s confirm it with Intel SDE, Software Development Emulator tool for x86-based platforms. SDE is built upon the dynamic binary instrumentation mechanism, which enables intercepting every single instruction. It comes with a huge cost. For the experiment we run, a slowdown of 100x is common.
To prevent compiler interference in our experiment, we disabled vectorization and unrolling optimizations, so that each version has only one hot loop with exactly 7 assembly instructions. We use this to uniformly compare memory footprint intervals. Instead of time intervals, we use intervals measured in machine instructions. The command line we used to collect memory footprint with SDE, along with the part of its output, is shown in the output below. Notice we use the -fp_icount 28K
option which indicates measuring memory footprint for each interval of 28K instructions. This value is specifically chosen because it matches one iteration of the inner loop in “before” and “after” cases: 4K inner loop iterations * 7 instructions = 28K
.
By default, SDE measures footprint in cache lines (64 bytes), but it can also measure it in memory pages (4KB on x86). We combined the output and put it side by side. Also, a few non-relevant columns were removed from the output. The first column PERIOD
marks the start of a new interval of 28K instructions. The difference between each period is 28K instructions. The column LOAD
tells how many cache lines were accessed by load instructions. Recall from the previous discussion, the same cache line accessed twice counts only once. Similarly, the column STORE
tells how many cache lines were stored. The column CODE
counts accessed cache lines that contain instructions that were executed during that period. Finally, NEW
counts cache lines touched during a period, that were not seen before by the program.
Important note before we proceed: the memory footprint reported by SDE does not equal to utilized memory bandwidth. It is because it doesn’t account for whether a memory operation was served from cache or memory.
Listing: Memory footprint of naive Matmul (left) and with loop interchange (right)
$ sde64 -footprint -fp_icount 28K -- ./matrix_multiply.exe
============================= CACHE LINES =============================
PERIOD LOAD STORE CODE NEW | PERIOD LOAD STORE CODE NEW
-----------------------------------------------------------------------
... ...
2982388 4351 1 2 4345 | 2982404 258 256 2 511
3011063 4351 1 2 0 | 3011081 258 256 2 256
3039738 4351 1 2 0 | 3039758 258 256 2 256
3068413 4351 1 2 0 | 3068435 258 256 2 256
3097088 4351 1 2 0 | 3097112 258 256 2 256
3125763 4351 1 2 0 | 3125789 258 256 2 256
3154438 4351 1 2 0 | 3154466 257 256 2 255
3183120 4352 1 2 0 | 3183150 257 256 2 256
3211802 4352 1 2 0 | 3211834 257 256 2 256
3240484 4352 1 2 0 | 3240518 257 256 2 256
3269166 4352 1 2 0 | 3269202 257 256 2 256
3297848 4352 1 2 0 | 3297886 257 256 2 256
3326530 4352 1 2 0 | 3326570 257 256 2 256
3355212 4352 1 2 0 | 3355254 257 256 2 256
3383894 4352 1 2 0 | 3383938 257 256 2 256
3412576 4352 1 2 0 | 3412622 257 256 2 256
3441258 4352 1 2 4097 | 3441306 257 256 2 257
3469940 4352 1 2 0 | 3469990 257 256 2 256
3498622 4352 1 2 0 | 3498674 257 256 2 256
...
Let’s discuss the numbers that we see in the output above. Look at the period that starts at instruction 2982388
on the left. That period corresponds to the first 4096 iterations of the inner loop in the original Matmul program. SDE reports that the algorithm has loaded 4351 cache lines during that period. Let’s do the math and see if we get the same number. The original inner loop accesses row 0 in matrix a
. Remember that the size of float
is 4 bytes and the size of a cache line is 64 bytes. So, for matrix a
, the algorithm loads (4096 * 4 bytes) / 64 bytes = 256
cache lines. For matrix b
, the algorithm accesses column 0. Every element resides on its own cache line, so for matrix b
it loads 4096 cache lines. For matrix c
, we accumulate all products into a single element, so 1 cache line is stored in matrix c
. We calculated 4096 + 256 = 4352
cache lines loaded and 1 cache line stored. The difference in one cache line may be related to SDE starting counting 28K instruction interval not at the exact start of the first inner loop iteration. We see that there were two cache lines with instructions (CODE
) accessed during that period. The seven instructions of the inner loop reside in a single cache line, but the 28K interval may also capture the middle loop, making it two cache lines in total. Lastly, since all the data that we access haven’t been seen before, all the cache lines are NEW
.
Now let’s switch to the next 28K instructions period (3011063
), which corresponds to the second set of 4096 iterations of the inner loop in the original Matmul program. We have the same number of LOAD
, STORE
, and CODE
cache lines as in the previous period, which is expected. However, there are no NEW
cache lines touched. Let’s understand why that happens. Look again at the Figure 6. The second set of 4096 iterations of the inner loop accesses row 0 in matrix a
again. But it also accesses column 1 in matrix b
, which is new, but these elements reside on the same set of cache lines as column 0, so we have already touched them in the previous 28K period. The pattern repeats through 14 subsequent periods. Each cache line contains 64 bytes / 4 bytes (size of float) = 16
elements, which explains the pattern: we fetch a new set of cache lines in matrix b
every 16 iterations. The last remaining question is why we have 4097 NEW
lines after the first 16 iterations of the inner loop. The answer is simple: the algorithm keeps accessing row 0 in the matrix a
, so all those new cache lines come from matrix b
.
For the transformed version, the memory footprint looks much more consistent with all periods having very similar numbers, except the first. In the first period, we access 1 cache line in the matrix a
; (4096 * 4 bytes) / 64 bytes = 256
cache lines in b
; (4096 * 4 bytes) / 64 bytes = 256
cache line are stored into c
, a total of 513 lines. Again, the difference in results is related to SDE starting counting 28K instruction interval not at the exact start of the first inner loop iteration. In the second period (3011081
), we access the same cache line from matrix a
, a new set of 256 cache lines from matrix b
, and the same set of cache lines from matrix c
. Only the lines from matrix b
have not been seen before, that is why the second period has NEW
256 cache lines. The period that starts with the instruction 3441306
has 257 NEW
lines accessed. One additional cache line comes from accessing element a[0][17]
in the matrix a
, as it hasn’t been accessed before.
In the two scenarios that we explored, we confirmed our understanding of the algorithm by the SDE output. But be aware that you cannot tell whether the algorithm is cache-friendly just by looking at the output of the SDE footprint tool. In our case, we simply looked at the code and explained the numbers fairly easily. But without knowing what the algorithm is doing, it’s impossible to make the right call. Here’s why. The L1 cache in modern x86 processors can only accommodate up to ~1000 cache lines. When you look at the algorithm that accesses, say, 500 lines per 1M instructions, it may be tempting to conclude that the code must be cache-friendly, because 500 lines can easily fit into the L1 cache. But we know nothing about the nature of those accesses. If those accesses are made randomly, such code is far from being “friendly”. The output of the SDE footprint tool merely tells us how much memory was accessed, but we don’t know whether those accesses hit caches or not.
->
part 4
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
In this case study we will use the Intel SDE tool to analyze the memory footprint of four production workloads: Blender ray tracing, Stockfish chess engine, Clang++ compilation, and AI_bench PSPNet segmentation. We hope that this study will give you an intuition of what you could expect to see in real-world applications. In part3 , we collected memory footprint per intervals of 28K instructions, which is too small for applications running hundreds of billions of instructions. So, we will measure footprint per one billion instructions.
Figure 7 shows the memory footprint of four selected workloads. You can see they all have very different behavior. Clang compilation has very high memory activity at the beginning, sometimes spiking to 100MB per 1B instructions, but after that, it decreases to about 15MB per 1B instructions. Any of the spikes on the chart may be concerning to a Clang developer: are they expected? Could they be related to some memory-hungry optimization pass? Can the accessed memory locations be compacted?
Figure 7. A case study of memory footprints of four workloads. MEM - total memory accessed during 1B instructions interval. NEW - accessed memory that has not been seen before. |
The Blender benchmark is very stable; we can clearly see the start and the end of each rendered frame. This enables us to focus on just a single frame, without looking at the entire 1000+ frames. The Stockfish benchmark is a lot more chaotic, probably because the chess engine crunches different positions which require different amounts of resources. Finally, the AI_bench memory footprint is very interesting as we can spot repetitive patterns. After the initial startup, there are five or six sine waves from 40B
to 95B
, then three regions that end with a sharp spike to 200MB, and then again three mostly flat regions hovering around 25MB per 1B instructions. All this could be actionable information that can be used to optimize the application.
There could still be some confusion about instructions as a measure of time, so let us address that. You can approximately convert the timeline from instructions to seconds if you know the IPC of the workload and the frequency at which a processor was running. For instance, at IPC=1 and processor frequency of 4GHz, 1B instructions run in 250 milliseconds, at IPC=2, 1B instructions run in 125 ms, and so on. This way, you can convert the X-axis of a memory footprint chart from instructions to seconds. But keep in mind, that it will be accurate only if the workload has a steady IPC and the frequency of the CPU doesn’t change while the workload is running.
->
part 5
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
As you have seen from the previous case studies, there is a lot of information you can extract using modern memory profiling tools. Still, there are limitations which we will discuss next.
Consider memory footprint charts, shown in Figure 7 (in part 4). Such charts tell us how many bytes were accessed during periods of 1B instructions. However, looking at any of these charts, we couldn’t tell if a memory location was accessed once, twice, or a hundred times during a period of 1B instructions. Each recorded memory access simply contributes to the total memory footprint for an interval, and is counted once per interval. Knowing how many times per interval each of the bytes was touched, would give us some intuition about memory access patterns in a program. For example, we can estimate the size of the hot memory region and see if it fits into the L3.
However, even this information is not enough to fully assess the temporal locality of the memory accesses. Imagine a scenario, where we have an interval of 1B instructions during which all memory locations were accessed two times. Is it good or bad? Well, we don’t know because what matters is the distance between the first (use) and the second access (reuse) to each of those locations. If the distance is small, e.g., less than the number of cache lines that the L1 cache can keep (which is roughly 1000 today), then there is a high chance the data will be reused efficiently. Otherwise, the cache line with required data may already be evicted in the meantime.
Also, none of the memory profiling methods we discussed so far gave us insights into the spatial locality of a program. Memory usage and memory footprint only tell us how much memory was accessed, but we don’t know whether those accesses were sequential, strided, or completely random. We need a better approach.
The topic of temporal and spatial locality of applications has been researched for a long time, unfortunately, as of early 2024, there are no production-quality tools available that would give us such information. The central metric in measuring the data locality of a program is reuse distance, which is the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. Reuse distance shows the likelihood of a cache hit for memory access in a typical least-recently-used (LRU) cache. If the reuse distance of a memory access is larger than the cache size, then the latter access (reuse) is likely to cause a cache miss.
Since a unit of memory access in a modern processor is a cache line, we define two additional terms: temporal reuse happens when both use and reuse access exactly the same address and spatial reuse occurs when its use and reuse access different addresses that are located in the same cache line. Consider a sequence of memory accesses shown in Figure 8: a1,b1,e1,b2,c1,d1,a2
, where locations a
, b
, and c
occupy cache line N
, and locations d
and e
reside on subsequent cache line N+1
. In this example, the temporal reuse distance of access a2
is four, because there are four unique locations accessed between the two consecutive accesses to a
, namely, b
, c
, d
, and e
. Access d1
is not a temporal reuse, however, it is a spatial reuse since we previously accessed location e
, which resides on the same cache line as d
. The spatial reuse distance of access d1
is two.
Figure 8. Example of temporal and spatial reuse. |
Figure 9 provides an example of a reuse distance histogram of a hypothetical program. Its X-axis is clustered in log2
bins each scaled by 1000. The Y-axis provides the rate of occurrence, i.e., how frequently we observed a certain reuse distance. Ideally, we would like to see all of the accesses in the first bin [0;1000], for both temporal and spatial reuses. For instance, for sequential access to a large array, we would see a big temporal reuse distance (bad), but a small spatial reuse distance (good). For a program that traverses a binary tree of 1000 elements (fits in L1 cache) many times, we would see relatively small temporal reuse distance (good), but big spatial reuse distance (bad). Random accesses to a large buffer represent both bad temporal and spatial locality. As a general rule, if a memory access has either a small temporal or spatial reuse distance, then it is likely to hit CPU caches. Consequently, if an access has both big temporal and big spatial reuse distances, then it is likely to miss CPU caches.
Figure 9. Example of a reuse distance histogram. X-axis is the reuse distance, Y-axis is the rate of occurrence. |
Several tools were developed during the years that attempt to analyze the temporal and spatial locality of programs. Here are the three most recent tools along with their short description and current state:
loca
, it incurs an order of magnitude smaller overhead while maintaining 90% accuracy. The tool is no longer maintained and there is almost no documentation on how to use the tool. [RDXpaper]RDX
, but it extends it by taking cache-coherence and cache line invalidation effects into account. Using this tool we were able to produce meaningful results on a small program, however, it is not production quality yet and is not easy to use. Github repo, [ReuseTrackerPaper]Aggregating reuse distances for all memory accesses in a program may be useful in some cases, but future profiling tools should also be able to provide reuse distance histograms for individual loads. Luckily, not every load/store assembly instruction has to be thoroughly analyzed. A performance engineer should first find a problematic load or store instruction using a traditional sampling approach. After that, he/she should be able to request a temporal and spatial reuse distance histogram for that particular operation. Perhaps, it should be a separate collection since it may involve a relatively large overhead.
Temporal and spatial locality analysis provides unique insights that can be used for guiding performance optimizations. However, careful implementation is not straightforward and may become tricky once we start accounting for various cache-coherence effects. Also, a large overhead may become an obstacle to integrating this feature into production profilers.
]]>Contents:
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
There are many ways to analyze performance of an application running on a modern CPU, in this post I offer you one more way to look at it. The concept I write about here is important to understand and to have a high-level understanding of the limitations that the current CPU computing industry is facing. Also, the mental model I present will help you to better understand performance of your code. In the second part of the article, I expand on each of the 4 categories, discuss SW and HW solutions, provide links to the latest research, and say a few words about future directions.
Let’s jump straight into it. At the CPU level, performance of any application is limited by 4 categories:
Think about it like this. First, modern CPUs always try to predict what code will be executed next (“predictability of code”). In that sense, a CPU core always looks ahead of execution, into the future. Correct predictions greatly improve execution as it allows a CPU to make forward progress without having results of previous instructions available. However, bad speculation often incurs costly performance penalties.
Second, a processor fetches instructions from memory and moves them through a very sophisticated execution pipeline (“execution throughput”). How many independent instructions a CPU can execute simultaneously determines the execution throughput of a machine. In this category, I include all the stalls that occur as a result of a lack of execution resources, for example, saturated instruction queues, lack of free entries in the reservation station, not enough multiplication or divider units, etc.
Third (“predictability of data”), some instructions access memory, which is one of the biggest bottlenecks nowadays as the gap in performance between CPU and memory continues to grow. Hopefully, everyone by now knows that accessing data in the cache is fast while accessing DRAM could be 100 times slower. That’s the motivation for the existence of HW prefetchers, which try to predict what data a program will access in the nearest future and pull that data ahead of time so that by the time a program demands it to make forward progress, the values are already in caches. This category includes issues with having good spatial and temporal locality, as caches try to address both. Also, I include all the TLB-related issues under this category.
Fourth (“execution latency”), the vast majority of applications have many dependency chains when you first need to perform action A
before you can start to execute action B
. So, this last category represents how well a CPU can execute a sequence of dependent instructions. Just for the record, some applications are massively parallel, which have small sequential portions followed by large parallel ones – such tasks are limited by execution throughput, not latency.
A careful reader can draw parallels with a Top-Down performance analysis methodology, which is true. Below I provide a corresponding TMA metric for each of our 4 categories.
Let’s expand on each of those categories.
Current: Stalls frequently occur as a lack of some execution resources. It is very hard to predict which resource in a CPU pipeline will be a bottleneck for a certain application. Collecting CPU counters will give you some insight, but a more general recommendation would be to use Top-Down Analysis. OOO machines are very good at extracting parallelism, although there are blind spots. They can easily catch “local” parallelism within a mid-size loop or function, but can’t find global parallelism, for example:
foo(); // large non-inlined function
bar(); // bar does not depend on foo
If a processor would overlap the execution of foo
and bar
, that would significantly speed up the program, but currently, CPUs execute such code sequentially1.
Modern architectures range from 4-wide to 8-wide. Increasing the width of a machine is very costly as you need to simultaneously widen all the elements of the CPU pipeline to keep the machine balanced. The complexity of making wider designs grows rapidly, which makes it impossible to build such a system in practice.
for
and bar
, one would need to manually unroll and interleave both functions, but then you could run out of available registers, which will generate memory spills and refills (hurts performance but could be still worthwhile).Future: Cache sizes will likely continue to grow, although writing SW cache-friendly algorithms usually makes much more impact.
There is a revolutionary idea called PIM (Processing-In-Memory), which moves some computations (for example, memset, memcpy) closer to where the data is stored. Ideas range from specialized accelerators to direct in-DRAM computing. The main roadblock here is that it requires a new programming model for such devices thus rewriting the application’s code. [PIM]
Also, there are a few ideas for more intelligent HW prefetching. For example: [Pythia] uses reinforcement learning to find patterns in past memory request addresses to generate prefetch requests, [Hermes] tries to predict which memory accesses will miss in caches and initiates DRAM access early.
Every now and then HW architects have an increase in the number of transistors they can use in the chip design. These new transistors can be used to increase the size of caches, improve branch prediction accuracy or make the CPU pipeline wider by proportionally enlargening all the internal data structures and buffers, adding more execution units, etc.
Data dependency chains are the hardest bottleneck to overcome. From an architectural standpoint, there is not much modern CPUs can do about it. Out-of-Order engines, which are employed by the majority of modern processors, are useless in the presence of a long dependency chain, for example traversing a linked list aka pointer chasing. While you can do something to influence other categories (improve branch prediction, more intelligent HW prefetchers, make your pipeline wider), so far modern architectures don’t have a good response to handling Read-After-Write (aka “true”) data dependencies.
I hope that this post gave you a useful mental model, which will help you better understand performance of modern CPU cores. To fine-tune performance of an application means that you tailor the code to the HW, that runs that code. So, it’s not only useful to know the limitations of HW, but it is also critical to understand to which category your application belongs. Such data will help you focus on the performance problem that really matters.
Top-Down microarchitecture analysis and Roofline performance analysis should usually be a good way to start. Most of the time you’ll see a mix of problems, so you have to analyze hotspots case by case. Figuring out predictability of code or data is relatively easy (you check Top-Down metrics) while distinguishing if your code is limited by throughput or latency is not. I will try to cover it in my future posts.
I expect that this post could spawn a lot of comments from people pointing out my mistakes and inaccuracies. Also, let me know if I missed any important ideas and papers, I would be happy to add them to the article.
There could be some execution overlap when foo
ends and bar
starts. To be fair, it is possible to parallelize the execution of foo
and bar
if, say, we spawn a new thread for bar
. ↩
Contents:
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
Many people know about performance benefits of using Huge Pages for data, but not many of them know that Huge Pages can be used for code as well. In this article, I show how to speed up source code compilation for the clang compiler by 5% if you allocate its code section on Huge Pages. If it seems small to you to justify the effort, I can say that all major cloud service providers care about every 1% they could optimize since it translates into immense cost savings. And hey, why leave performance on the table?
But before we jump into the topic, I feel that I need to give a brief recap for readers to refresh their knowledge about Memory Pages. Feel free to skip this introduction if you are familiar with all of this or just skim through it if it’s too boring for you. I’ll try to make it brief, I promise.
Virtual Addresses. Applications operate on virtual addresses, which serve 2 purposes: 1) protection and 2) effective physical memory management. Memory is split into pages, the default page size on x86 is 4KB, and on ARM it’s 16KB. So, every address can be split into the address of the page + the offset on that page. In the case of a 4KB page, this will be 52 rightmost bits for page address and 12 leftmost bits for offset within that page. You only need to translate the page address (first 52 bits) since the page offset doesn’t change.
Translations. Since user-space applications don’t know the physical addresses, HW needs to translate virtual to physical addresses to access the data in memory. The kernel maintains a page table, which holds address mappings from virtual pages into physical pages. Without HW support for such translations, every time you need to do a load or store, you would 1) interrupt the process, 2) kernel handles that interrupt, and 3) interrupt service routine “walks” the page table and retrieves the translation. At that point, we are looking at 10000+ cycles, which would be unbearable, of course.
TLBs. The good thing is that changes to that table are relatively not frequent, so it’s a good idea to cache some of the most frequent ones in HW. Such a cache exists in probably every modern CPU and is called TLB (Translation Lookaside Buffer), which keeps the most recent translations. There is a hierarchy of L1 ITLB (Instructions), L1 DTLB (Data), followed by L2 STLB (Shared - instructions and data). On modern laptop processors, L1 can hold up to a few hundred recent translations1, and L2 can hold a few thousand. Now, with the default page size on x86 of 4KB, every such entry in the TLB is a mapping for a 4KB page. Given that, L1 TLB can cover up to 1MB of memory, and L2 cover up to 10 MB.
Page Walks. 10MB of memory space covered by L2 STLB sounds like should be enough for many applications, right? But consider what happens when you miss in TLB. Ok, you don’t have to interrupt the process, that would be terrible for performance. You see, it’s such an important issue, that HW gets your back covered in this case again. The thing is that the format of the page table is dictated by the HW, to which OS’es have to comply. Because HW knows the format of the page table, it can search for translations by itself, without waking up the kernel. This mechanism is called “HW page walker”, which, in the event of TLB cache miss, will do the entire page table traversal internally. I.e. it will issue all the necessary instructions to find the required address translation. It is much faster, but still, a page walk is very expensive.
As I said in the introduction paragraph, people typically use large pages for data. Any algorithm that does random accesses into a large memory region will likely suffer from TLB misses. Examples are plenty: binary search in a big array, large hash tables, histogram-like algorithms, etc. The reason is simple: because the size of a page is relatively small (4KB), there is a high chance that the page you will access next is not in the TLB cache.
How do you solve it? Using Huge Pages, of course, what else you would expect? On x86, besides the default 4KB pages, you also have an option of allocating 2MB and 1GB pages. Let me do the math for you: with just one 2MB page you can cover the same amount of memory as with 512 default 4KB pages. And guess what, you need fewer translation entries in the TLB caches. It does not eliminate TLB misses completely, but greatly increases the chance of a TLB hit. I know you’re probably interested in how to start using Huge Pages, but I do not want to repeat others. And especially since there are so many good articles out there. This and that are among my favorites. You can find an example in one of our recent Performance Tuning Challenges (Youtube) (Slides - slide 19).
If you’re still with me, today I wanted to talk about using Large Pages for mapping the code section of your application onto Huge Pages. I see many people talk about Huge Pages for use with malloc
, mmap
, and so on. But the same problem exists not just for data but for instructions as well. And yet it doesn’t get much attention. Let’s fix that!
The example I use for demonstration purposes is a well-known Clang compiler which has a very flat performance profile, i.e. there are no clear hotspots in it. Instead, many functions take 1-2% of the total execution time. The complete section of statically built clang binary on Linux has a code section of ~60MB, which does not fit into L2 STLB. What likely happens in this case, is that multiple hot functions are scattered all around that 60MB memory space, and very rarely do they share the same memory page of 4KB. When they begin frequently calling each other, they start competing for translation entries in the ITLB. Since the space in L2 STLB is limited, this may become a serious bottleneck, as I will show next.
To show the benefits of using Huge Pages for the code section, I have built the latest Clang compiler from sources using instructions from here. I limited it only to building the compiler itself for the x86 target:
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_TARGETS_TO_BUILD=X86 ../llvm
To demonstrate the performance problem, I compiled one of the LLVM sources with my newly built compiler and collect ITLB misses:
$ perf stat -e iTLB-loads,iTLB-load-misses ../llvm-project/build/bin/clang++ -c -O3 <other options> ../llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
180,759,568 iTLB-loads
12,068,728 iTLB-load-misses # 6.68% of all iTLB cache accesses
15.416281961 seconds time elapsed
Also, there is a nice little script built on top of Linux perf to estimate the fraction of cycles the CPU was stalled due to instruction TLB misses. It can give you an intuition of how much potential speedup you can achieve by tackling this issue.
$ git clone https://github.com/intel/iodlr.git
$ export PATH=`pwd`/iodlr/tools/:$PATH
$ measure-perf-metric.sh -m itlb_stalls -e ../llvm-project/build/bin/clang++ -c -O3 <other options> ../llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
=================================================
Final itlb_stalls metric
--------------------------------------------------
FORMULA: metric_ITLB_Misses(%) = 100*(a/b)
where, a=icache_64b.iftag_stall
b=cycles
=================================================
metric_ITLB_Misses(%)=7.0113070100
The previous output tells us that 7% of cycles that Clang spent compiling LoopVectorize.cpp
was wasted doing demanding page walks and populating TLB entries. 7% is a significant number, so there is something to improve. You can also continue the analysis by adding the -r
option to measure-perf-metric.sh
. This will sample on icache_64b.iftag_stall
event to locate the place where the TLB stalls are coming from. Now let’s consider what we can do about that ITLB misses.
Our first option is to tell the loader/kernel to place the code section of an application onto preallocated (explicit) Huge Pages. The key requirement here is that the code section must be aligned at the Huge Page boundary, in this case, 2MB. As you may have guessed, it requires that you relink your binary.
Here is the full command line to rebuild the clang compiler with the code section aligned at the 2MB boundary, notice I only added two -z
linker options:
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_TARGETS_TO_BUILD=X86 ../llvm -DCMAKE_CXX_LINK_FLAGS="-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152"
$ ninja -j `nproc` clang
Now that I’ve done that I can see the difference in the generated ELF binary:
Baseline:
Aligned at 2MB boundary:
In the modified case, the PROGBITS
section starts from the offset 0xe00000
, which equals 14MB, so it is a multiple of 2MB. While in the baseline case, the .init
section starts right after the .rela.plt
section with not so many padding bytes added. Notice that the size of the .text
section didn’t change – it’s the same functional code, just the offset has changed.
The obvious downside is that the binary size gets larger. In my case, the size of the clang-16
executable increased from 111MB to 114MB. Since the problem with ITLB misses usually arises for large applications, an additional 2-4MB of padded bytes might not make a huge difference, but something to keep in mind.
Now that we’ve relinked our binary, we need to reconfigure the machine which will use our “improved” clang compiler. Here are the steps that you should do:
$ sudo apt install libhugetlbfs-bin
$ sudo hugeadm --create-global-mounts
$ sudo hugeadm --pool-pages-min 2M:128
Notice, I use libhugetlbfs, which is unfortunately no longer actively maintained. The good thing is that most of the things you can do manually, e.g. the sudo hugeadm --pool-pages-min 2M:128
command likely does echo 128 | sudo tee /proc/sys/vm/nr_hugepages
under the hood. The same goes for --create-global-mounts
command. Check out this Linux kernel documentation page for more details.
“Why 128 huge pages?” you would ask. Well, the size of the code section of the clang executable is 0x3b6b7a4
(see above), which is roughly 60MB. I could have allocated less, I agree, but since I have 16GB of RAM on my system… Once I reserved 128 explicit huge pages memory usage jumped from 0.9Gb to 1.15G. That space became unavailable for other applications not utilizing huge pages. You can also check the effect of allocating explicit huge pages if you run:
$ watch -n1 "cat /proc/meminfo | grep huge -i"
AnonHugePages: 2048 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 128 <== 128 huge pages allocated
HugePages_Free: 128
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 262144 kB <== 256MB of space occupied
OK, so we allocated huge pages, let’s run the binary:
$ hugectl --text ../llvm-project/build_huge/bin/clang++ -c -O3 <other options> ../llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Notice, that I prepended my command with hugectl --text
, which according to the description, “requests remapping of the program text”. I do not claim that I fully understand all the mechanics associated with it. Again, I would be happy if someone helps explain all the interactions between the binary, loader, and kernel.
The need to use hugectl
can be eliminated if we set the special bit in the ELF header. Such a bit determines if the text segment is backed by default with huge pages. The same bit for the data segment exists as well. Here is how you can flip that switch:
$ hugeedit --text ../llvm-project/build_huge/bin/clang-16
Segment 2 0x0 - 0xc55758 () default is BASE pages
Segment 3 0xe00000 - 0x496cae1 (TEXT) default is HUGE pages <==
Segment 4 0x4a00000 - 0x59502b0 () default is BASE pages
Segment 5 0x5c04960 - 0x6095ef8 (DATA) default is BASE pages
Now I can run ../llvm-project/build_huge/bin/clang-16
without controlling it at runtime with hugectl --text
. This step can also be done at the time of building the compiler.
I know you were waiting to see the results. Here we go…
$ perf stat -e iTLB-loads,iTLB-load-misses ../llvm-project/build_huge/bin/clang++ -c -O3 <other options> ../llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
46,838,084 iTLB-loads
1,627,038 iTLB-load-misses # 3.47% of all iTLB cache accesses
14.701666694 seconds time elapsed
Comparing to the baseline, we have 7 times less iTLB misses (12M -> 1.6M), which resulted in a 5% faster compiler time (15.4s -> 14.7s). You can see that we didn’t fully get rid of iTLB misses, as there are still 1.6M of those, which account for 4.1% of all cycles stalled (down from 7% in the baseline).
A good indicator that the code of your application was backed by Huge Pages is to watch /proc/meminfo
. In my case, I observed 30 huge pages used by the process, which is enough to back the entire code section of the clang compiler.
# while clang compilation is running...
$ watch -n1 "cat /proc/meminfo | grep huge -i"
HugePages_Total: 128
HugePages_Free: 98 <== 30 huge pages are in use
I recently discovered another way of doing the same thing but without the need to recompile your application. Sounds interesting? I was very surprised when I experimented with it.
There is an iodlr library, which can automatically remap the code from the default pages onto huge pages. The easiest way to use it is to build the liblppreload.so
library and preload it when running your application. For example:
$ cd iodlr/large_page-c
$ make -f Makefile.preload
$ sudo cp liblppreload.so /usr/lib64/
$ LD_PRELOAD=/usr/lib64/liblppreload.so ../llvm-project/build/bin/clang++ -c -O3 <other options> ../llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
This gives the results indistinguishable from the previous method: roughly the same number of iTLB misses and the same running time. Pretty cool! The good thing about it is that you can speed up existing applications even when you don’t have access to the source code. I haven’t measured the overhead of remapping2, but I tend to think it’s not big. I haven’t dug too deep into the liblppreload.so
implementation, but if you’re interested, you can take a look at the code, it’s not that big.
The liblppreload.so
library works both with explicit (EHP) and transparent huge pages (THP). By default it will use THP, so they must be enabled (/sys/kernel/mm/transparent_hugepage/enabled
should be always
or madvise
). You can check that THP are being used by again, watching the /proc/meminfo
:
# while clang compilation is running...
$ watch -n1 "cat /proc/meminfo | grep huge -i"
AnonHugePages: 61440 kB <== 30 transparent huge pages are in use
HugePages_Total: 128
HugePages_Free: 128 <== explicit huge pages are not used
If you want to use EHP instead, then you need to prepend the command line with IODLR_USE_EXPLICIT_HP=1
:
$ IODLR_USE_EXPLICIT_HP=1 LD_PRELOAD=/usr/lib64/liblppreload.so clang++ <...>
Regardless of whether you use explicit or transparent Huge Pages, this method doesn’t require you to recompile the application. This is especially useful when you don’t have access to the source code.
Finally, if you want to get rid of the need for users of your application to preload the liblppreload.so
, you can also integrate it into the code of your application to automatically remap the code section right at the startup of your application (full example):
#include "large_page.h"
int main() {
map_status status;
bool is_enabled;
status = IsLargePagesEnabled(&is_enabled);
if (status == map_ok && is_enabled) {
status = MapStaticCodeToLargePages();
if (status == map_ok) {
// code section remapped
}
}
// ...
}
For peace of mind, I did some benchmarking of both options (aligning the code at 2MB and the iodlr library) against the baseline. As a benchmark, I build three different clang compilers and used them to build clang from sources (aka self-build). I measured the total time it takes to compile the entire codebase as well as measuring iTLB misses. Here is my benchmarking command line3:
$ perf stat -e iTLB-loads,iTLB-load-misses -- ninja -j `nproc` clang
So yeah, as you can see, using Huge pages gives roughly 5% faster compile times for the recent Clang compiler. Option 2 (iodlr version) is faster than option 1, and I should say that I don’t have a good explanation to that, it could be just a measurement error.
Hey, sorry for the long article, but there was a lot to cover. I hope that it sparked your interest in using Huge Pages for code especially if you’re maintaining a large codebase. For further reading, I would recommend the paper “Runtime Performance Optimization Blueprint: Intel® Architecture Optimization with Large Code Pages”, which was instrumental to me.
Speaking of downsides, I should point out that remapping .text
onto Huge Pages is not free and takes additional execution time. This may hurt rather than help short-running programs. Remember the mantra? Always Measure. Another consideration is about Transparent Huge Pages, which suffer from non-deterministic allocation latency and memory fragmentation. The reason is that to satisfy a Huge Page allocation request at runtime, the Linux kernel needs to find a contiguous chunk of 2MB. If unable to find, it needs to reorganize the pages, resulting in significantly longer allocation latency. In contrast, Explicit Huge Pages are allocated in advance and are not prone to such problems.
BTW, another way to attack iTLB misses is to BOLT your application. Among all other things, it will likely group all the hot functions, which should also drastically reduce the iTLB bottleneck. However, I think Huge Pages is a more fundamental solution to the problem since it doesn’t adapt to the particular behavior of the program. As a double proof, you can try using Huge Pages after bolting your application to see if there are any gains to be made.
I did all the experiments on Linux, but what about Windows4 or Mac? The answer is I don’t know if the same is possible and I haven’t tried. I was able to find a few articles about using it for data but not for code. I would appreciate it if someone could provide more insights.
Last words before you close the huge page of this article… If you’ve been reading it and thinking that the default 4KB is unreasonably small page size in the year 2022, hey, let’s not open a whole can of worms. Let me just refer you to this article by one of the industry experts where he explores the topic in more detail.
Footnotes:
This is just to give you intuition, not to provide exact numbers – you can google them relatively easily. One thing I need to mention is that usually L1 ITLB and L1 DTLB have a different number of entries. For example, the latest Intel Golden Cove core has 256 entries in ITLB and 96 entries in DTLB. ↩
I’m not sure it’s visible when running with LD_DEBUG=statistics
. Let me know in the comments if you know how to measure the overhead. ↩
Ideally I would like to run it at least 3 times, but man… on my machine compilation of the entire clang codebase takes about an hour, so, maybe next time. ↩
To utilize huge pages on Windows, one needs to enable SeLockMemoryPrivilege
security policy and use VirtualAlloc
API. ↩
Welcome to the 6th edition of our performance analysis and tuning challenge. If you haven’t participated in our challenges before, I highly encourage you to read the introductory post first.
The benchmark for the 6th edition is wordcount
, which was suggested to me by Karthik Tadinada. The task is very simple: you need to split a text and count each word’s frequency, then print the list sorted by the frequency of each word. It doesn’t sound hard, probably an “Easy” on Leetcode, but you can be creative with implementing it. And guess what? Performance could be way better than what you would typically do at Leetcode.
This challenge was inspired by the following repo on Github. If you take a look at the tables on the front page of that repo, you will see that solution in C++ is in the 4th place, while Java is a leader. I think that C++ can do much better and can gain the lead back. Just to be fair to other languages, I’m sure solutions in other languages like Rust can be improved as well.
The previous challenge ended in one of the participants submitting our best patch to the Kaldi repo, which provided 13.5% speedup. Let’s keep the good performance engineering going and showcase its real power!
Without further ado, here is the link to the challenge: https://github.com/dendibakh/perf-challenge6. There you can find more instructions on how to set up the environment and build and benchmark your solution.
Yay! I have prizes!
Note that to receive your prize you must have a Paypal account. Crypto transaction in the same $ equivalent is also an option. For those who are interested, I fund it from the money I get from my Patreon and Github sponsors.
Also, to everyone in the US who would be able to break the “100 seconds” mark, I will send a signed book “Performance Analysis and Tuning on Modern CPUs” that I’ve written. Currently, the baseline code runs at 164 seconds on the Linux+Intel
target machine and at 141 seconds1 on Windows+AMD
.
Target configurations for this challenge are:
Intel + Linux:
AMD + Win:
Thanks to Mansur for providing AMD machine for the challenge.
You don’t have to have those exact configurations. Feel free to use whatever machine you have access to. It’s fine if you solve the challenge on another Intel, AMD, or even ARM CPUs. You can do your experiments on Mac as well. The reason why I define the target configuration is to have a unified way to assess all the submissions.
I’m waiting for your submissions until June 30th 2022.
The best score will be determined as the lowest average time measured on the two target machines.
The general rules and guidelines for submissions are described here. Send your patch(es) via email to me with the subject “PerfChallenge6-Solution”. Do not open pull requests as other participants will see your solution.
In the end, I will write a post summarizing all the submissions. By submitting your code you’re giving your consent for me to share it if needed. I also ask you to provide a textual description of all the changes you’ve made if they are non-trivial. It will be much easier for me to analyze your code.
If you feel you’re stuck, don’t hesitate to ask questions or look for support on our discord channel. I and other participants will do our best to answer your questions. Also, it’s a good idea to subscribe to my mailing list (if you haven’t already) to get updates about the challenge and more. I also run a monthly newsletter about SW and HW performance through my mailing list.
If you know someone who might be interested in participating in this challenge, please spread the word about it. Good luck and have fun!
I’m open to your comments and suggestions. Especially if you have a proposal for a benchmark for the next edition of the challenge, please let us know.
Finally, if you like such puzzles you can also check out my free online course “Performance Ninja” here. We have many small lab assignments dedicated to certain low-level performance optimization.
Thanks to everyone who participated in this challenge and sent me their solutions! And of course, congratulations to the winners! I determined the best solutions based on the maximum (Linux speedup + Windows speedup)
.
| Name | Solution Linux (sec) | Linux speedup | Solution Windows (sec) | Windows speedup |
|-----------------------------|-----------------------|----------------------|-------------------------|------------------------|
| Robert Burke | 13 | 12.5x | 16 | 13.7x |
| Stellaris62 | 25 | 6.5x | 27 | 8.1x |
| Andrey Evstyukhin | 47 | 3.5x | 55 | 4x |
| Jakub Gałecki | 52 | 3.1x | 55 | 4x |
| Mansur Mavliutov | 52 | 3.1x | 59 | 3.7x |
| Adam Richardson | 44 | 3.7x | 77 | 2.8x |
| Franek Korta | 64 | 2.5x | 83 | 2.6x |
| Alexey Shmelev | 82 | 2x | 96 | 2.3x |
| Adam Folwarczny | 94 | 1.7 | 85 | 2.6x |
| Ole Schulz-Trieglaff | 103 | 1.6x | 210 | 1.1x |
I’m very impressed with the solutions I’ve received! For me, it’s just another confirmation against a popular belief that “most software is optimal by default”. I think that most existing SW is far from optimal and sometimes the headroom is huge, as we showed in this challenge (>10x).
Here is the recording of our summary Zoom call, where we explored all the optimizations that were found during this challenge. (Youtube) (Slides)
Some of the techniques that we showcased:
mmap
file into the address space of the processEven if you haven’t participated, I still encourage you to watch the recording as we tried to make it useful even for people not familiar with the challenge.
With this, I officially declare Performance Challenge #6 closed. :)
If you have ideas for future performance tuning challenges, please share them with me.
This measurement was done on Linux (AMD machine is in dual boot). There must be some issue with running it on Windows, which shows 220 seconds. We are looking into this. ↩
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
Due to the nature of my work as a performance engineer, I analyze CPU performance on various general-purpose applications: data compression, audio/video codecs, office productivity tools, etc. There is one common thing I found in all of them. They all have performance-critical dependency chains. I frequently spend a significant amount of time trying to find the critical data dependency chain that is limiting performance of an application.
In this post I will:
Let’s go!
It is always a good idea to start fairly high level, so let me give you an example. If you consider ANY general-purpose program, you will find that many operations are dependent on previous operations. In fact, it is a dependency graph. Here is what such a graph may look like for the automated system handling CI/CD jobs:
As you can see, there are many dependency chains that branch off from every operation. For a program to finish all the operations need to be executed, but not all of them strictly impact performance. For instance, we can update a database and upload artifacts in parallel, but the latter operation takes more time to finish.
Now, even if we somehow improved the speed of updating a database, it would not improve the overall performance since the program will only end when uploading artifacts is finished. Another interesting observation is that we could probably run a lot more operations in parallel with uploading artifacts, and it will not degrade the overall performance. The latter is only true to the extent that our HW is capable of providing the required execution throughput.
To some degree, execution throughput is not a real limitation like data dependency. Theoretically, we can always add more execution power (CPU cores, memory channels, etc.) to satisfy the hungriest number-crunching algorithm. Modern systems are becoming much more capable of doing work in parallel – the number of CPU cores in laptops and servers is growing at a steady pace. I’m not saying execution throughput is not a real performance bottleneck. There are many scientific, and other massively parallel applications that stress the machine throughput and require significant execution power. It’s just that data dependency is more fundamental.
Given that, for a general-purpose application, it is only the longest dependency chain (trunk of the tree) that defines performance of our program. All the secondary computations (branches of the tree) are becoming less relevant. Probably nothing new for you. But I told you it was a high-level introduction, right? Now let’s go several levels down the SW stack… directly to the assembly level.
As the number of CPU cores grows, so does an individual CPU core get more parallelism in it. Another way of saying this is they’re becoming “wider”1. It means that the individual core is designed to do more operations in a single CPU cycle. Again, the idea is the same: we want to have enough execution throughput in every core so that the secondary computations do not interfere with the operations on the critical path. How can they interfere? Well, they can occupy execution ports and do not allow instructions on the critical path to get access to the resources on time (example follows). Remember, a delay on the secondary dependency chain does not necessarily affect performance, but a delay on a critical path ALWAYS prolongs the entire program.
Consider a hypothetical algorithm, which has a dependency graph that is shown below. Take your time to look at it. I know this can be confusing, for example, “Why STORE
s are not on the critical path?”. Backward jump (CMP+JMP) is also not on the critical path if it can be predicted well. If you struggle to understand it, I strongly recommend you read this awesome blog post by Fabian Giesen. Study it carefully, you will become much better at identifying critical paths through your algorithms.
Consider instructions SUB
and SHR
. They are executed in parallel, but one is on the critical path, while another is not. If you would only have one execution port for SUB
or SHR
, what would you schedule first? It’s obvious that it should be SUB
since it’s on the critical path. It may not be so obvious for a CPU though. If we have multiple operations competing for the same execution resources, not choosing critical instructions first may lead to performance penalties. Luckily, cores are getting wider, and this becomes an issue less frequently.
Also, I left placeholders with three dots …
just to show that there could be other operations that branch off and form new dependency graphs. But then again, having a very wide machine makes them essentially free. As a CPU design goal, on one hand, a narrow machine will let secondary operations create a roadblock for the critical path. On the other hand, having a 100-wide machine probably doesn’t make sense as there are no general purposes algorithms with so much instruction-level parallelism (ILP).
Intermediate summary: because modern CPUs are becoming wider and wider, computations on secondary dependency chains are becoming less relevant, and rare is the source of performance bottlenecks. For me as a performance engineer, it’s crucial to find the critical path since everything else doesn’t matter much. Knowing which sequence of operations constitutes a critical path is super important since you know what should really be optimized and how much headroom you have.
I hope that by now I proved to you the importance of identifying the critical path in your program. Let’s see how we can find it.
Improving the speed of performance-critical pieces of code sometimes requires looking at CPU pipeline diagrams. This is what extreme tuning means! If you see it the first time, this wikipedia page might be a good starting point. It could be hard at first, but after some time you’ll be able to see patterns, I promise 😊.
UICA is one of the leading open-source tools for visualizing CPU pipeline diagrams. Shoutout to the authors, they’ve done a great job2! For illustrating my idea, I took the stock UICA code sample that looks like this (don’t try to extract the semantic meaning from this code, there is none – it’s synthetic):
loop:
add rax, [rsi]
adc rax, [rsi+rbx]
shld rcx, rcx, 1
shld rcx, rdx, 2
dec r15
jnz loop
Exercise: try to find a critical path through this piece of code before reading the rest of the article.
The CPU pipeline diagram below shows the execution of the first three iterations of the loop for the still-very-wide-spread Intel Skylake core. It shows how instructions progress over time. Modern superscalar CPUs can operate on different parts of different instructions in parallel. This is what’s shown on the diagram – several instructions can be in the execute stage at the same time. Notice that instructions from different iterations can also be executed in parallel. CPU effectively unrolls execution of the loop internally.
You can generate a similar trace if you use this link or go to UICA and plug the code above. I will not spend much time talking about this particular diagram – I’m only showing it in case you’ve never seen them before. And even if you have experience in analyzing those, I recommend you skip the first 10-20 iterations to the place where execution reaches a steady state. Let’s take a look at iterations 14 and 15 below:
This is where things are starting to crystalize. Some instructions are executed much earlier than others. Those are instructions on a secondary dependency chain (ADD -> ADC
). Remember, the front end (fetching instructions) and retirement (commit results) stages are done “in-order”, while execution can be done “out-of-order”. Notice that instructions on a secondary dependency chain sit inside the machine, having their results ready to be committed, yet they must wait until the preceding instructions retire.
On the other hand, instructions SHLD -> SHLD
form a critical dependency chain, which has a latency of 4 cycles per iteration. The distance between execution (E
) and retirement (R
) stages for critical dependency chains is very small 3 compared to the secondary dependency chain. We will use this later.
Question: is it possible that ADD->ADC
would become a critical dep chain? Under which circumstances?
After looking at dozens of such diagrams, I began thinking: “Is it possible for a performance analysis tool to highlight critical dependency chains?” Yes, you can plug a hot loop into a tool like UICA, simulate it, and see where your critical path is. However, I don’t think this is a good solution for the 2 primary reasons: 1) it takes developer time, and 2) UICA is still an oversimplified model of a real CPU design.
So, can we teach your favorite performance profiler to highlight critical dependency chains? It is not a simple task, but I think it may be possible. Next, I share some initial thoughts on how to approach this problem.
So far, we figured out that instruction that is on a critical path must 1) have a low E->R
distance, and 2) have a high execution count.
I think that HW support is required to achieve the goal. CPU needs to keep track of the distance between the execution (E
) and retirement (R
) stages for a number of hot instructions. Example of an internal table that a CPU can maintain:
Instruction | Average | Execution |
---|---|---|
address | E->R distance |
count |
0xfffffffffffff1ee | 1.05 | 125 |
0xfffffffffffff1e2 | 1.07 | 125 |
0xfffffffffffff1e8 | 10.13 | 125 |
… | … | … |
A sampling profiler, e.g. Linux perf, can later dump that table from the CPU with every collected sample. After profiling finishes, some postprocessing can be done to aggregate collected data. I don’t have an exact postprocessing algorithm in mind though. Some thoughts:
E->R
distance.Some interesting cases:
1) This information may also be useful if we are bound by execution throughput.
In this example, there are no data dependencies. Instead, we are bound by the execution throughput of ports 0 and 6. Both SHR
and DEC + JNZ
instructions occupy those two ports. The approach I described above will highlight those two instructions since they have the smallest E->R
distance.
2) When we are equally bound by a dependency chain and execution throughput:
In this example, we have 4 instructions that compete for ports 0 and 6, which limits the throughput to one iteration every two cycles. And, we have a data dependency ADD -> SUB
, which also has a latency of 2 cycles. Hopefully, the Skylake core is wide enough to handle both in 2 cycles per iteration.
For such a scenario, it would be good to highlight those two groups of instructions using different colors. SW post-processing algorithm then would need to find independent instructions and group them together (stress execution throughput). What remains is the critical dependency chain.
3) Capturing very long dependency chains may be problematic. Since the HW internal table capacity is finite, there will be cases when the entire dependency chain will not fit into that table. For now, it probably only makes sense for small and medium loops (up to a hundred instructions).
All in all, I didn’t spend too much time digging into that idea. Prototyping it with UICA seems like a reasonable way to proceed. I would love to hear your feedback!
I think that if you adopt thinking in terms of critical dependency chains, you will be able to reason about performance of a certain piece of code much better.
Footnotes:
For example, the width of the still-very-wide-spread Skylake core is 4. Does it always mean a wider core is more performant? - No. But making a core wider requires a lot of silicon and it complicates the design. You see, you cannot just add more execution ports, you need to keep the machine balanced. ↩
Modern performance simulators that mimic the operation of a REAL processor are of course much more involved. They have more pipeline stages than shown by UICA and much more events that can explain, for example, why a certain instruction was stalled at a particular cycle. Of course, those simulators are proprietary since it is a critical intellectual property. It is an SW model of an entire CPU microarchitecture. ↩
People sometimes say that instructions on a critical path are “pushing the retirement”, which is true if you look at the diagram. ↩
Subscribe to my newsletter, support me on Patreon or by PayPal donation.
In this short blog post, I decided to capture the most important highlights (for me) from all the Twitter Spaces conversations that I had during the year 2021. Some of those are not exact quotes, but rather my interpretation of their thoughts (I hope they call me out if I skrewed it up). Everything in this post is in chronological order as episodes were coming out. Recordings of all the episodes are available on my youtube channel (sorry about the mediocre audio quality).
Daniel Lemire:
Nadav Rotem:
Mark Dawson:
Ivica Bogosavljevic:
Thomas Dullien:
Tomer Morad:
James Reinders:
Arnaldo Carvalho De Melo:
Bryan Cantrill:
Andrey Akinshin:
Some of the projects that we were talking about:
If you were my guest and you don’t see yourself on this list, sorry, I haven’t saved the notes for every episode, so it’s hard for me to recover everything. Sorry about that.