Languages of the Indian subcontinent have millions of speakers, but do not have a lot of resources and data. Some of the linguistic phenomena that occur in Indic languages include schwa deletion in Indo-Aryan languages, voicing rules in Tamil, stress patterns etc. We have a common Indic front end for building voices in these languages, with support for many of these phenomena.
Currently, we have explicit support for Hindi, Bengali, Kannada, Tamil, Telugu and Gujarati. Rajasthani and Assamese are also included and they use the same rules as Hindi and Bengali respectively although this may not be completely accurate.
This chapter describes how to build an Indic voice for the languages that are supported explicitly in Festvox and also how to add support for a new Indic language. We will use Hindi as the example language for this tutorial.
Part 1: Building voices that have explicit support (Hindi, Bengali, Kannada, Tamil, Telugu, Gujarati, Rajasthani, Assamese)
This will set up a base Indic voice. The file that is Indic-voice specific is festvox/indic_lexicon.scm in the voice directory. The default language in the Indic lexicon is Hindi. You should change this to the language that you are working with so that the right language-specific rules are used.
$FESTVOXDIR/src/clustergen/setup_cg cmu indic ss
(defvar lex:language 'Hindi)
We assume that you have already prepared a list of utterances and recorded them. See the Chapter called Corpus development for instructions on designing a corpus. We also assume that you have a prompt file is the txt.done.data format (with your Indic prompts encoded as unicode).
Assuming the recordings might not be as good as the could be you can power normalize them.
cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/
Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by
This uses F0 extraction to estimate where the speech starts and ends. This typically works well, but you should listen to the results to ensure it does the right thing. The original unpruned files are saved in unpruned/. A third pre-pocessing option is to shorted intrasentence silences, this is only desirable in non-ideal recorded data, but it can help labeling.
Note if you do not require these three stages, you can put your wavefiles directly into wav/
Then do feature extraction
./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
Build the models
./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v
And generate some test examples, the first to give MCD and F0D objective measures, the second to generate standard tts output
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train
./bin/do_clustergen cg_test resynth cgp etc/txt.done.data.test
./bin/do_clustergen cg_test tts tts etc/txt.done.data.test