Subscribe to my newsletter, support me on Patreon, Github, or by PayPal donation.
This blog is an excerpt from the book. More details in the introduction.
Next on our list is the Zstandard compression algorithm, or Zstd for short. When compressing data, Zstd divides the input into blocks, and each block can be compressed independently. This means that multiple threads can work on compressing different blocks simultaneously. Although it seems that Zstd should scale well with the number of threads, it doesn’t. Performance scaling stops at around 5 threads, sooner than in the previous two benchmarks. As you will see, the dynamic interaction between Zstd worker threads is quite complicated.
First of all, performance of Zstd depends on the compression level. The higher the compression level, the more compact the result. Lower compression levels provide faster compression, while higher levels yield better compression ratios. In our case study, we used compression level 3 (which is also the default level) since it provides a good trade-off between speed and compression ratio.
Here is the high-level algorithm of Zstd compression:1
To visualize the work of the Zstd algorithm on a timeline, we instrumented the Zstd source code with Vtune’s ITT markers.2 The timeline of compressing Silesia corpus using 8 threads is shown in Figure 4. Using 8 worker threads is enough to observe thread interaction in Zstd while keeping the image less noisy than when all 16 threads are active. The second half of the timeline was cut to make the image fit on the page.
Figure 4. Timeline view of compressing Silesia corpus with Zstandard using 8 threads. |
On the image, we have the main thread at the bottom (TID 913273), and eight worker threads at the top. The worker threads are created at the beginning of the compression process and are reused for multiple compressing jobs.
On the worker thread timeline (top 8 rows) we have the following markers:
On the main thread timeline (row 9, TID 913273) we have the following markers:
With a quick glance at the image, we can tell that there are many ww
periods when worker threads are waiting. This negatively affects performance of Zstandard compression. Let’s progress through the timeline and try to understand what’s going on.
p0
- p7
), which were picked up by worker threads immediately. Notice, that the main thread also prepared job8
(p8
), but it hasn’t posted it in the worker queue yet. This is because all workers are still busy.p8
, it flushed the data already produced by job0
. Notice, that by this time, job0
has already delivered five portions of compressed data (first five notches below job0
bar). Now, the main thread enters its first fw
period and starts to wait for more data from job0
.45ms
, one more chunk of compressed data is produced by job0
, and the main thread briefly wakes up to flush it, see (1). After that, it goes to sleep again.Job3
is the first to finish, but there is a couple of milliseconds delay before TID 913309 picks up the new job, see (2). This happens because job8
was not posted in the queue by the main thread. Luckily, the new portion of compressed data comes from job0
, so the main thread wakes up, flushes it, and notices that there are idle worker threads. So, it posts job8
to the worker queue and starts preparing the next job (p9
).job10
could have been picked up by either TID 913314 or TID 913312 since they were both idle at the time job10
` was pushed to the job queue.job11
immediately after job10
was posted in the queue as it did before. But it didn’t. This happens because there are no available input buffers. We will discuss it in more detail shortly.job0
finishes, the main thread was able to acquire a new input buffer and start preparing job11
(5).As we just said, the reason for the 20-40ms delays between jobs is the lack of input buffers, which are required to start preparing a new job. Zstd maintains a single memory pool, which allocates space for both input and output buffers. This memory pool is prone to fragmentation issues, as it has to provide contiguous blocks of memory. When a worker finishes a job, the output buffer is waiting to be flushed, but it still occupies memory. And to start working on another job, it will require another pair of buffers.
Limiting the capacity of the memory pool is a design decision to reduce memory consumption. In the worst case, there could be many “run-away” buffers, left by workers that have completed their jobs very fast, and move on to process the next job; meanwhile, the flush queue is still blocked by one slow job. In such a scenario, the memory consumption would be very high, which is undesirable. However, the downside here is increased wait time between the jobs.
The Zstd compression algorithm is a good example of a complex interaction between threads. It is a good reminder that even if you have a parallelizable workload, performance of your application can be limited by the synchronization between threads and resource availability.
->
part 4
Huge thanks to Yann Collet, the author of Zstandard, for providing us with the information about the internal workings of Zstandard. ↩
VTune ITT instrumentation - https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-1/instrumenting-your-application.html ↩