Performance analysis and tuning challenge #6.

Categories: challenge

28 May 2022

Contents:

Prizes
Target configuration
Submissions
Where to get help?
P.S. Spread the word
UPD: July 20th, 2022: results

Welcome to the 6th edition of our performance analysis and tuning challenge. If you haven’t participated in our challenges before, I highly encourage you to read the introductory post first.

The benchmark for the 6th edition is wordcount, which was suggested to me by Karthik Tadinada. The task is very simple: you need to split a text and count each word’s frequency, then print the list sorted by the frequency of each word. It doesn’t sound hard, probably an “Easy” on Leetcode, but you can be creative with implementing it. And guess what? Performance could be way better than what you would typically do at Leetcode.

This challenge was inspired by the following repo on Github. If you take a look at the tables on the front page of that repo, you will see that solution in C++ is in the 4th place, while Java is a leader. I think that C++ can do much better and can gain the lead back. Just to be fair to other languages, I’m sure solutions in other languages like Rust can be improved as well.

The previous challenge ended in one of the participants submitting our best patch to the Kaldi repo, which provided 13.5% speedup. Let’s keep the good performance engineering going and showcase its real power!

Without further ado, here is the link to the challenge: https://github.com/dendibakh/perf-challenge6. There you can find more instructions on how to set up the environment and build and benchmark your solution.

Prizes

Yay! I have prizes!

1st place - $50
2nd place - $30
3rd place - $20

Note that to receive your prize you must have a Paypal account. Crypto transaction in the same $ equivalent is also an option. For those who are interested, I fund it from the money I get from my Patreon and Github sponsors.

Also, to everyone in the US who would be able to break the “100 seconds” mark, I will send a signed book “Performance Analysis and Tuning on Modern CPUs” that I’ve written. Currently, the baseline code runs at 164 seconds on the Linux+Intel target machine and at 141 seconds¹ on Windows+AMD.

Target configuration

Target configurations for this challenge are:

Intel + Linux:

Intel Core i5-8259U CPU @ 2.30GHz Base (3.80GHz Turbo), 6MB L3-cache
16GB RAM
256GB NVME INTEL SSDPEKKW256G8
64-bit Ubuntu 20.04, kernel version 5.13.0-27
Clang C++ compiler version 14.0.

AMD + Win:

AMD Ryzen 7 3700X 8Core @ 3.60GHz Base (4.40GHz Turbo), 32MB L3-cache
64GB RAM
ADATA XPG SX8200 Pro 1TB 3D NAND NVMe SSD
Windows 11 Version 21H2
Clang C++ compiler version 14.0.

Thanks to Mansur for providing AMD machine for the challenge.

You don’t have to have those exact configurations. Feel free to use whatever machine you have access to. It’s fine if you solve the challenge on another Intel, AMD, or even ARM CPUs. You can do your experiments on Mac as well. The reason why I define the target configuration is to have a unified way to assess all the submissions.

Submissions

I’m waiting for your submissions until June 30th 2022.

The best score will be determined as the lowest average time measured on the two target machines.

The general rules and guidelines for submissions are described here. Send your patch(es) via email to me with the subject “PerfChallenge6-Solution”. Do not open pull requests as other participants will see your solution.

In the end, I will write a post summarizing all the submissions. By submitting your code you’re giving your consent for me to share it if needed. I also ask you to provide a textual description of all the changes you’ve made if they are non-trivial. It will be much easier for me to analyze your code.

Where to get help?

If you feel you’re stuck, don’t hesitate to ask questions or look for support on our discord channel. I and other participants will do our best to answer your questions. Also, it’s a good idea to subscribe to my mailing list (if you haven’t already) to get updates about the challenge and more. I also run a monthly newsletter about SW and HW performance through my mailing list.

P.S. Spread the word

If you know someone who might be interested in participating in this challenge, please spread the word about it. Good luck and have fun!

I’m open to your comments and suggestions. Especially if you have a proposal for a benchmark for the next edition of the challenge, please let us know.

Finally, if you like such puzzles you can also check out my free online course “Performance Ninja” here. We have many small lab assignments dedicated to certain low-level performance optimization.

UPD: July 20th, 2022: results

Thanks to everyone who participated in this challenge and sent me their solutions! And of course, congratulations to the winners! I determined the best solutions based on the maximum (Linux speedup + Windows speedup).

|     Name                    |  Solution Linux (sec) |     Linux speedup    | Solution Windows (sec)  |     Windows speedup    |
|-----------------------------|-----------------------|----------------------|-------------------------|------------------------|
|     Robert Burke            |     13                |     12.5x            |     16                  |     13.7x              |
|     Stellaris62             |     25                |     6.5x             |     27                  |     8.1x               |
|     Andrey Evstyukhin       |     47                |     3.5x             |     55                  |     4x                 |
|     Jakub Gałecki           |     52                |     3.1x             |     55                  |     4x                 |
|     Mansur Mavliutov        |     52                |     3.1x             |     59                  |     3.7x               |
|     Adam Richardson         |     44                |     3.7x             |     77                  |     2.8x               |
|     Franek Korta            |     64                |     2.5x             |     83                  |     2.6x               |
|     Alexey Shmelev          |     82                |     2x               |     96                  |     2.3x               |
|     Adam Folwarczny         |     94                |     1.7              |     85                  |     2.6x               |
|     Ole Schulz-Trieglaff    |     103               |     1.6x             |     210                 |     1.1x               |

I’m very impressed with the solutions I’ve received! For me, it’s just another confirmation against a popular belief that “most software is optimal by default”. I think that most existing SW is far from optimal and sometimes the headroom is huge, as we showed in this challenge (>10x).

Here is the recording of our summary Zoom call, where we explored all the optimizations that were found during this challenge. (Youtube) (Slides)

Some of the techniques that we showcased:

mmap file into the address space of the process
SIMD-based word parsing
Use large pages for mitigating DTLB misses when accessing hash table entries
Prefetching hash table entries
and many others

Even if you haven’t participated, I still encourage you to watch the recording as we tried to make it useful even for people not familiar with the challenge.

With this, I officially declare Performance Challenge #6 closed. :)

If you have ideas for future performance tuning challenges, please share them with me.

This measurement was done on Linux (AMD machine is in dual boot). There must be some issue with running it on Windows, which shows 220 seconds. We are looking into this. ↩

Denis Bakhvalov