Subscribe to my newsletter, support me on Patreon, Github, or by PayPal donation.
This blog is an excerpt from the book. More details in the introduction.
In the case study, we have analyzed several throughput-oriented applications with varying thread count scaling characteristics. Here is a quick summary of our findings:
To confirm that suboptimal scaling is a common case, rather than an exception, let’s look at the SPEC CPU 2017 suite of benchmarks. In the rate part of the suite, each hardware thread runs its own single-threaded workload, so there are no slowdowns caused by thread synchronization. According to one of the MICRO 2023 keynotes1, benchmarks that have integer code (regular general-purpose programs) have a thread count scaling in the range of 40% - 70%
, while benchmarks that have floating-point code (scientific, media, and engineering programs) have a scaling in the range of 20% - 65%
. Those numbers represent inefficiencies caused just by the hardware platform. Inefficiencies caused by thread synchronization in multithreaded programs further degrade performance scaling.
In a latency-oriented application, you typically have a few performance-critical threads and the rest do background work that doesn’t necessarily have to be fast. Many issues that we’ve discussed apply to latency-oriented applications as well. We covered some low-latency tuning techniques in Section 12.2.
Debbie Marr, “Architecting for Power-Efficiency in General-Purpose Computing”, https://youtu.be/IktNjMxJYPE?t=2599 ↩