Denis Bakhvalov

Performance analysis and tuning challenge #5.

Categories: challenge

16 Jul 2021 by Ivica Bogosavljevic from


Welcome to the 5th edition of our performance analysis and tuning challenge. If you haven’t participated in our challenges before, we highly encourage you to read the introductory post first.

The fifth edition of the contest will be run by Ivica Bogosavljevic from Johny’s Software Lab blog. Ivica also writes about software performance, so feel free to go and check out his blog, there is a ton of useful content there.

The benchmark for the 5th edition is KALDI. Kaldi is an open-source toolkit for speech recognition written in C++ intended to be used by researchers. In the essence, Kaldi takes an input model and recorded speech and then it converts it to a textual representation. Speech recognition is a complex topic, but as always, we focus on hardware efficiency and general data processing efficiency, not on the speech recognition algorithm itself.

Ivica recently had a conversation with the developers of the Kaldi project and they showed a huge interest in improving its performance. So, your participation in this challenge can have a big impact.


On Linux

Here are instructions on how to build Kaldi on Linux. To download and build kaldi for the first time do the following:

$ sudo apt install automake autoconf sox libtool subversion gfortran python
$ git clone
$ cd kaldi
$ git checkout ca6d133262aa183b23f6daba48995bd7576fb572
$ cd tools
$ extras/
$ make -j8
$ extras/
$ cd OpenBLAS
$ make PREFIX=`pwd` install
$ cd ../../src/
$ ./configure --shared --openblas-root=../tools/OpenBLAS/

The configure script generates the file called Open the file, find the line that starts with CXXFLAGS, and replace default optimization level -O1 with -O3.

$ make clean -j8
$ make depend -j8
$ make -j8

The first time you build kaldi it can take a lot of time to compile everything, so it is a good idea to let it compile and go do something else. Later, when doing incremental builds, just run make -j8. If you added or removed headers (changed dependencies), you should also run make depend -j8 before make -j8.

On Windows

Installation of CygWin is required before we can use adopted instructions from Linux. To install Cygwin with all the necessary packages do:

$ setup-x86_64.exe -P clang -P autoconf -P automake -P make -P patch -P unzip -P wget -P sox -P libtool -P subversion -P zlib -P zlib-devel -P gcc-fortran -P cmake -P liblapack0

Start c:\cygwin64\Cygwin.bat and do cd to the working directory. Clone and checkout the kaldi repository as usual. Manually append the following lines to the .gitattributes file:

$ *.sh eol=lf
$ configure eol=lf

Then execute following commands to normalize endings of lines:

$ git add .gitattributes
$ git commit -m "EOL"
$ git config --local core.autocrlf true
$ git rm --cached -r . 
$ git reset --hard

After that, follow instructions on Linux. However for configure step use mandatory static linking:

$ ./configure --shared --openblas-root=../tools/OpenBLAS/ --static --static-fst

Autogenerated file src/base/version.h can have a problem - carriage return in #define KALDI_VERSION "5.5\r". Repair literal and disable generation of version.h in script src/base/

Downloading and running the benchmark

Under the assumption you are in kaldi/src directory, to download the test you will do the following;

$ wget
$ unzip

To run the benchmark, execute:

$ cd test-speed
$ ../online2bin/online2-wav-nnet3-latgen-faster --word-symbol-table=graph/words.txt --config=conf/model.conf am/final.mdl graph/HCLG.fst ark:test.utt2spk scp:test.scp ark,t:output-baseline.txt

This creates the file called output-baseline.txt. You will later use this file to compare your changes against it to make sure there are no functional regressions.

Rules of the game

When you profile the example, you will notice that a lof time is spent in sgemm_kernel and sgemm_copy functions. Unfortunately, these functions do not belong to kaldi, they belong to the OpenBLAS library and they are not the goal of optimization in this contest.

To limit the scope of the changes, you are allowed to:

  • Modify only files: and lattice-faster-decoder.h, files which are part of kaldi
  • You are allowed to introduce new .h files, but not new .cc files (you will need this if you want to introduce a custom allocator or custom data structures)
  • You are not allowed to spawn an additional thread to offload the work. Neither OpenMP, nor POSIX threads nor std::thread. All the work needs to be done in a single thread with no help from the outside.

You are not allowed to modify the compilation flags, modify configuration files, etc. Please note that this rule is not written in stone, if you believe that for good performance it is necessary to change other files as well, let us know and we can agree to change this rule.

This task is tough and you will want to cooperate with other participants. We created a discord channel to facilitate cooperation, and we will be answering all questions there. Join the channel using this link.

The target configuration for this challenge is Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz, 6MB L3-cache + 8GB RAM + 64-bit Ubuntu 20.04 + Clang C++ compiler version 12.0. Although you are free to use whatever environment you have access to. It’s fine if you solve the challenge on other Intel, AMD, or ARM CPU. Also, you can do your experiments on Windows or Mac. The reason why we define the target configuration is to have a unified way to assess all the submissions. In the end, it is not about getting the best score, but about practicing performance optimizations.

General Recommendations

I also have a few general hints:

  • Do not try to understand the whole algorithm. For some people, it’s crucial to understand how every piece of code works. For the purposes of optimizing it will be wasted effort. There are CPU benchmarks with thousands of LOC (like SPEC2017) it’s absolutely impossible to understand them in a reasonable time. What you need to familiarize yourself with, are hotspots. That’s it. You most likely need to understand one function/loop which is not more than 100 LOC.
  • You have a specific workload for which you optimize the benchmark. You don’t need to optimize it for any other input/workload. The main principle behind Data-oriented design is that you know the data of your application.

If you feel you’re stuck, don’t hesitate to ask questions or look for support on our discord channel. We and other participants will do our best to answer your questions. Also, we will be giving hints to you every week (or two weeks) as we did for the previous challenge.


The results must match the baseline. There shouldn’t be any memory leaks (check with gperftools). To compare against the baseline, run:

$ ../online2bin/online2-wav-nnet3-latgen-faster --word-symbol-table=graph/words.txt --config=conf/model.conf am/final.mdl graph/HCLG.fst ark:test.utt2spk scp:test.scp ark,t:output.txt
$ diff output-baseline.txt output.txt

In case you didn’t make any mistakes in your modifications, files output-baseline.txt and output.txt should be identical and diff won’t print anything. If they are different, then there is a bug.


We will not use submissions for any commercial purposes. However, a good and maintainable solution can be merged back to Kaldi source tree.

The baseline we will be measuring against is Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz, 6MB L3-cache + 8GB RAM + 64-bit Ubuntu 20.04 + Clang C++ compiler version 12.0.

We conduct performance challenges via Denis’ mailing list, so it’s a good idea to subscribe (if you haven’t already) to get updates about the challenge and submit your solution. Send your patch(es) via email both to Ivica and Denis. The general rules and guidelines for submissions are described here. We also ask you to provide a textual description of all the transformations you have made. It will be much easier for us to analyze your submission.

We are collecting submissions until 29th August 2021.

P.S. Spread the word

If you know someone who might be interested in participating in this challenge, please spread the word about it. Good luck and have fun!

We are open to your comments and suggestions. Especially if you have a proposal of a benchmark for the next edition of the challenge, please let us know. Finding a good benchmark isn’t easy.

comments powered by Disqus

Subscribe to get more updates from me:

If you like this blog, support me on Patreon, Github, or by PayPal donation.

All content on Easyperf blog is licensed under a Creative Commons Attribution 4.0 International License