Welcome to the 5th edition of our performance analysis and tuning challenge. If you haven’t participated in our challenges before, we highly encourage you to read the introductory post first.
The fifth edition of the contest will be run by Ivica Bogosavljevic from Johny’s Software Lab blog. Ivica also writes about software performance, so feel free to go and check out his blog, there is a ton of useful content there.
The benchmark for the 5th edition is KALDI. Kaldi is an open-source toolkit for speech recognition written in C++ intended to be used by researchers. In the essence, Kaldi takes an input model and recorded speech and then it converts it to a textual representation. Speech recognition is a complex topic, but as always, we focus on hardware efficiency and general data processing efficiency, not on the speech recognition algorithm itself.
Ivica recently had a conversation with the developers of the Kaldi project and they showed a huge interest in improving its performance. So, your participation in this challenge can have a big impact.
Here are instructions on how to build Kaldi on Linux. To download and build kaldi for the first time do the following:
$ sudo apt install automake autoconf sox libtool subversion gfortran python $ git clone https://github.com/kaldi-asr/kaldi.git $ cd kaldi $ git checkout ca6d133262aa183b23f6daba48995bd7576fb572 $ cd tools $ extras/check_dependencies.sh $ make -j8 $ extras/install_openblas.sh $ cd OpenBLAS $ make PREFIX=`pwd` install $ cd ../../src/ $ ./configure --shared --openblas-root=../tools/OpenBLAS/
The configure script generates the file called
kaldi.mk. Open the file, find the line that starts with
CXXFLAGS, and replace default optimization level
$ make clean -j8 $ make depend -j8 $ make -j8
The first time you build kaldi it can take a lot of time to compile everything, so it is a good idea to let it compile and go do something else. Later, when doing incremental builds, just run
make -j8. If you added or removed headers (changed dependencies), you should also run
make depend -j8 before
Installation of CygWin is required before we can use adopted instructions from Linux. To install Cygwin with all the necessary packages do:
$ setup-x86_64.exe -P clang -P autoconf -P automake -P make -P patch -P unzip -P wget -P sox -P libtool -P subversion -P zlib -P zlib-devel -P gcc-fortran -P cmake -P liblapack0
c:\cygwin64\Cygwin.bat and do
cd to the working directory. Clone and checkout the kaldi repository as usual. Manually append the following lines to the
$ *.sh eol=lf $ configure eol=lf
Then execute following commands to normalize endings of lines:
$ git add .gitattributes $ git commit -m "EOL" $ git config --local core.autocrlf true $ git rm --cached -r . $ git reset --hard
After that, follow instructions on Linux. However for
configure step use mandatory static linking:
$ ./configure --shared --openblas-root=../tools/OpenBLAS/ --static --static-fst
src/base/version.h can have a problem - carriage return in
#define KALDI_VERSION "5.5\r". Repair literal and disable generation of
version.h in script
Under the assumption you are in
kaldi/src directory, to download the test you will do the following;
$ wget https://johnysswlab.com/downloads/test-speed.zip $ unzip test-speed.zip
To run the benchmark, execute:
$ cd test-speed $ ../online2bin/online2-wav-nnet3-latgen-faster --word-symbol-table=graph/words.txt --config=conf/model.conf am/final.mdl graph/HCLG.fst ark:test.utt2spk scp:test.scp ark,t:output-baseline.txt
This creates the file called
output-baseline.txt. You will later use this file to compare your changes against it to make sure there are no functional regressions.
When you profile the example, you will notice that a lof time is spent in
sgemm_copy functions. Unfortunately, these functions do not belong to kaldi, they belong to the OpenBLAS library and they are not the goal of optimization in this contest.
To limit the scope of the changes, you are allowed to:
lattice-faster-decoder.h, files which are part of kaldi
.hfiles, but not new
.ccfiles (you will need this if you want to introduce a custom allocator or custom data structures)
std::thread. All the work needs to be done in a single thread with no help from the outside.
You are not allowed to modify the compilation flags, modify configuration files, etc. Please note that this rule is not written in stone, if you believe that for good performance it is necessary to change other files as well, let us know and we can agree to change this rule.
This task is tough and you will want to cooperate with other participants. We created a discord channel to facilitate cooperation, and we will be answering all questions there. Join the channel using this link.
The target configuration for this challenge is Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz, 6MB L3-cache + 8GB RAM + 64-bit Ubuntu 20.04 + Clang C++ compiler version 12.0. Although you are free to use whatever environment you have access to. It’s fine if you solve the challenge on other Intel, AMD, or ARM CPU. Also, you can do your experiments on Windows or Mac. The reason why we define the target configuration is to have a unified way to assess all the submissions. In the end, it is not about getting the best score, but about practicing performance optimizations.
I also have a few general hints:
If you feel you’re stuck, don’t hesitate to ask questions or look for support on our discord channel. We and other participants will do our best to answer your questions. Also, we will be giving hints to you every week (or two weeks) as we did for the previous challenge.
The results must match the baseline. There shouldn’t be any memory leaks (check with gperftools). To compare against the baseline, run:
$ ../online2bin/online2-wav-nnet3-latgen-faster --word-symbol-table=graph/words.txt --config=conf/model.conf am/final.mdl graph/HCLG.fst ark:test.utt2spk scp:test.scp ark,t:output.txt $ diff output-baseline.txt output.txt
In case you didn’t make any mistakes in your modifications, files
output.txt should be identical and
diff won’t print anything. If they are different, then there is a bug.
We will not use submissions for any commercial purposes. However, a good and maintainable solution can be merged back to Kaldi source tree.
The baseline we will be measuring against is Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz, 6MB L3-cache + 8GB RAM + 64-bit Ubuntu 20.04 + Clang C++ compiler version 12.0.
We conduct performance challenges via Denis’ mailing list, so it’s a good idea to subscribe (if you haven’t already) to get updates about the challenge and submit your solution. Send your patch(es) via email both to Ivica and Denis. The general rules and guidelines for submissions are described here. We also ask you to provide a textual description of all the transformations you have made. It will be much easier for us to analyze your submission.
We are collecting submissions until 8th August 2021.
If you know someone who might be interested in participating in this challenge, please spread the word about it. Good luck and have fun!
We are open to your comments and suggestions. Especially if you have a proposal of a benchmark for the next edition of the challenge, please let us know. Finding a good benchmark isn’t easy.