How to Build a Speech Recognition System for the Sinhala language with Kaldi (Part 2)

Hirunika Karunathilaka
4 min readApr 5, 2020

Hello !!!. This tutorial series is a walkthrough about how we can develop a speech recognition system for the Sinhala language. I will try to address each and every issue I came across and include the references I followed.

This is the 2nd tutorial of the series and I hope you have followed my 1st tutorial.

In this tutorial, we will prepare the language and acoustic data for the folders that we created earlier in the first tutorial.

Special thanks go to University of Colombo School of Computing LTRL for providing the speech corpus, language resources, and computational resources

Prepare language data

Inside the local folder, create a corpus.txt file.

corpus.txt: This file contains every single utterance transcription that can occur in the ASR system. That means, it is really important if we can include every possible utterance we speak in Sinhala. But that is really really hard :-P.

Part of corpus.txt

Inside the local folder, create another subfolder and there will store the local information of our Sinhala speech data.

cd SinhalaASR/data/local
mkdir dict

Inside the dict folder, we’ll create the following .txt files,

  1. nonsilence_phones.txt
  2. silence_phones.txt
  3. lexicon.txt

Other than the above, you can create files for extra_questions.txt, optional_silence.txt if want.

I suggest going through the Kaldi for Dummies tutorial (https://kaldi-asr.org/doc/kaldi_for_dummies.html) to get a thorough understanding of what and why these files are for.

nonsilence_phones.txt: This file contains the phones list in Sinhala language.

Pattern: <phone>

Part of nonsilence_phones.txt

silence_phones.txt: This file lists silent phones. Mostly it includes one phone which is SIL

Pattern: <phone>

silence_phones.txt

Lexicon.txt: This file contains every word from your dictionary with its ‘phone transcriptions’. It also includes SIL (Silence) and UNK(Unknown) phones.

Pattern: <word> <phone 1> <phone 2> …

Part of the lexicon.txt

Now that we have completed preparing required language resources, next, we have to prepare our acoustic data.

Prepare Acoustic Data

So, earlier you created the train, dev and test folders inside the data folder. Inside each those 3 folders create following files,

  1. wav.scp
  2. text
  3. utt2spk
  4. spk2gender

wav.scp: This file connects every utterance (sentence said by one person during the particular recording session) with an audio file related to this utterance Pattern: <uterranceID> <full_path_to_audio_file>

Part of wav.scp file for the train set

text: This file contains every utterance matched with its text transcription.

Pattern: <uterranceID> <text_transcription>

Part of text file for the train set

utt2spk: This file tells the ASR system which utterance belongs to particular speaker.

Pattern: <uterranceID> <speakerID>

Part of utt2spk file for the train set (*Utterance with id F070_001 has the speaker ID F070)

spk2gender: This file informs about speakers gender

Pattern: <speakerID> <gender>

Part of spk2gender file for the train set (*speaker with id F146 is a female)

So like above, you have to create a total of 12 files for train, dev and test folders. Usually, the speech/acoustic data is divided into 80%,10%,10% for train,dev and test datasets respectively.

Good job :-D. For this tutorial, that’s all folks. In the third tutorial, we’ll create our first running script and build a simple ASR system.

References

--

--