In this chapter we work through a full example of creating a voice given that most of the basic construction work (model building) has been done. Pariticularly this discusses the scheme files, and conventions for keeping a voices together and how you can go about packaging it for general use.
Ultimately a voice in Festival will consist of a diphone database, a lexicon (and lts rules) and a number of scheme files that offer the complete voice. When people other than the developer of a voice wish to use your newly developed voice it is only that small set of files that are required and need to be distributed (freely or otherwise). By convention we have distributed diphone group files, a single file holding the index, and diphone data itself, and a set scheme files that describe the voice (and its necessary models).
Basic skeleton files are included in the festvox distribution. If you are unsure how to go about building the basic files it is recommended you follow this schema and modify these to your particular needs.
By convention a voice name consist of an institution name (like cmu, cstr, etc), if you don't have an insitution just use net. Second you need to identify the language, there is an ISO two letter standard for it fails to distinguish dialects (such as US and UK English) so it need not be strictly followed. However a short identifier for the language is probably prefered. Third you identify the speaker, we have typically used three letter initials which are the initials of the person speaker but any name is reasonable. If you are going to build a US or UK English voice you should look section 5.10 US/UK English Walkthrough.
The basic processes you will need to address
As with all parts of `festvox': you must set the following enviroment variables to where you have installed versions of the Edinburgh Speech Tools and the festvox distribution
export ESTDIR=/home/awb/projects/1.4.1/speech_tools export FESTVOXDIR=/home/awb/projects/festvox
In this example we will build a Japanese voice based on awb (a gaijin). First create a directory to hold the voice.
mkdir ~/data/cmu_ja_awb_diphone cd ~/data/cmu_ja_awb_diphone
You will need in the regions of 500M of space to build a voice. Actually for Japanese its probably considerably less, but you must be aware that voice building does require disk space.
Construct the basic directory structure and skeleton files with the command
$FESTVOXDIR/src/diphones/setup_diphone cmu ja awb
The three arguments are, institution, language and speaker name.
The next stage is define the phoneset in `festvox/cmu_ja_phones.scm'. In many cases the phoneset for a language has been defined, and it is wise to follow convention when it exists. Note that the default phonetic features in the skeleton file may need to be modified for other languages. For Japanese, there are standards and here we use a set similar to the ATR phoneset used by many in Japan for speech processing. (This file is included, but not automatically installed, in `$FESTVOXDIR/src/vox_diphone/japanese'
Now you must write the code that generates the diphone schema file.
You can look at the examples in `festvox/src/diphones/*_schema.scm'.
This stage is actually the first difficult part, getting
thsi right can be tricky. Finding all possible phone-phone in a language
isn't as easy as it seems (especially as many possible ones
don't actually exist). The file `bin/ja_schema.scm' is created
providing the function
diphone-gen-list which returns
a list of nonsense words, each consisting of a list of, list of diphones
and a list of phones in the nonsense word. For example
festival> (diphone-gen-list) ((("k-a" "a-k") (pau t a k a k a pau)) (("g-a" "a-g") (pau t a g a g a pau)) (("h-a" "a-h") (pau t a h a h a pau)) (("p-a" "a-p") (pau t a p a p a pau)) (("b-a" "a-b") (pau t a b a b a pau)) (("m-a" "a-m") (pau t a m a m a pau)) (("n-a" "a-n") (pau t a n a n a pau)) ...)
In addition to generating the diphone schema the `ja_schema.scm'
also should provied the functions
is called before generating the prompts, and
which is called before waveform synthesis of each nonsense word.
Diphone_Prompt_Setup, should be used to select a speaker to
generate the prompts. Note even though you may not use the prompts when
recording they are necessary for labelling the spoken speech, so you
still need to generate them. If you haeva synthesizer already int eh
language use ti to generate the prompts (assuming you can get it to
generate from phone lists also generate label files). Often the MBROLA
project already has a waveform synthesizer for the language so you can
use that. In this case we are going to use a US English voice
(kal_diphone) to generate the prompts. For Japanese that's probably ok
as the Japanese phoneset is (mostly) a subset of the English phoneset,
though using the generated prompts to prompt the user is probably not a
The second function
Diphone_Prompt_Word, is used to map the
Japanese phone set to the US English phone set so that waveform
synthesis will work. In this case a simple map of Japanese phone
to one or more English phones is given and the code simple
changes the phone name in the segment relation (and adds a new
new segment in the multi-phone case).
Now we can generate the diphone schema list.
festival -b bin/diphlist.scm bin/ja_schema.scm \ "(diphone-gen-schema \"awb\" \"etc/awbdiph.list\")"
Its is worth checking `etc/awbdiph.list' by hand to you are sure it contains all the diphone you wish to use.
The diphone schema file, in this case `etc/awbdiph.list', is a feindamentally key file for almost all the following scripts. Even if you generate the diphone list by some method other than described above, you should generate a schema list in exactly this format so that everything esle will work, modifying the other scripts for some other format is almost certainly a waste of your time.
The schema file has the following format
( awb_0001 ("k-a" "a-k") (pau t a k a k a pau) ) ( awb_0002 ("g-a" "a-g") (pau t a g a g a pau) ) ( awb_0003 ("h-a" "a-h") (pau t a h a h a pau) ) ( awb_0004 ("p-a" "a-p") (pau t a p a p a pau) ) ( awb_0005 ("b-a" "a-b") (pau t a b a b a pau) ) ( awb_0006 ("m-a" "a-m") (pau t a m a m a pau) ) ( awb_0007 ("n-a" "a-n") (pau t a n a n a pau) ) ( awb_0008 ("r-a" "a-r") (pau t a r a r a pau) ) ( awb_0009 ("t-a" "a-t") (pau t a t a t a pau) ) ...
In this case it has 297 nonsense words.
Next we can generate the prompts and their label files with the following command The to synthesize the prompts
festival -b bin/diphlist.scm bin/ja_schema.scm \ "(diphone-gen-waves \"prompt-wav\" \"prompt-lab\" \"etc/awbdiph.list\")"
Occasionally when you are building the prompts some diphones requested in the prompt voice don't actually exists (especially when you are doing cross-language prompting). Thus the generated prompt has some default diphone (typically silence-silence added). This is mostly ok, as long as its not happening multiple times in the same nonsence word. The speaker just should be aware that some prompts aren't actually correct (which of course is going to be true for all prompts in the cross-language prompting case).
The stage is to record the prompts. See section 11.3 Recording under Unix for details on how to do this under Unix (and in fact other techniques too). This can done with the command
Depending on whether you want the prompts actually to be played or not, you can edit `bin/prompt_them' to comment out the playing of the prompts.
Note a third argument can be given to state which nonse word to begin prompting from. This if you have already recorded the first 100 you can continue with
bin/prompt_them etc/awbdiph.list 101
The recorded prompts can the be labelled by
And the diphone index may be built by
bin/make_diph_index etc/awbdiph.list dic/awbdiph.est
If no EGG signal has been collected you can extract the pitchmarks by
Then build the pitch synchronous LPC coefficients
This should get you to the stage where you can test the basic waveform synthesizer. There is still much to do but initial tests (and correction of labelling errors etc) can start now. Start festival as
festival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)"
and then enter string of phones
festival> (SayPhones '(pau k o N n i ch i w a pau))
In addition to the waveform generate part you must also provide text analysis for your language. Here, for the sake of simplicity we assume that the Japanese is provided in romanized form with spaces between each word. This is of course not the case for normal Japanese (and we are working on a proper Japanese front end). But at present this shows the general idea. Thus we edit `festvox/cmu_ja_token.scm' and add (simple) support for numbers.
As the relationship between romaji (romanized Japanese) and phones is almost trivial we write a set of letter to sound rules, by hand that expand words into their phones. This is added to `festvox/cmu_ja_lex.scm'.
For the time being we just use the default intonation model, though simple rule drive improvements are possible. See `festvox/cmu_ja_awb_int.scm'. For duration, we add a mean value for each phone in the phoneset to `fextvox/cmu_ja_awb_dur.scm'.
These three japanese specific files are included in the distribution in `festvox/src/vox_diphone/japanese/'.
Now we have a basic synthesizer, although there is much to do, we can now type (romanized) text to it.
festival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)" ... festival> (SayText "boku wa gaijin da yo.")
The next part is to test and improve these various initial subsystems, lexicons, text analysis prosody, and correct waveform synthesis problem. This is ane endless task but you should spend significantly more time on it that we have done for this example.
Once you are happy with the completed voice you can package it for distribution. The first stage is to generate a group file for the diphone database. This extracts the subparts of the nonsense words and puts them into a single file offering something smaller and quicker to access. The groupfile can be built as follows.
festival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)" ... festival (us_make_group_file "group/awblpc.group" nil) ...
us_ in the function names stands for
(the unit concatenation subsystem in Festival) and nothing to
do with US English.
To test this edit `festvox/cmu_ja_awb_diphone.scm' and change the choice of databases used from separate to grouped. This is done by commenting out the line (around line 81)
(set! cmu_ja_awb_db_name (us_diphone_init cmu_ja_awb_lpc_sep))
and uncommented the line (around line 84)
(set! cmu_ja_awb_db_name (us_diphone_init cmu_ja_awb_lpc_group))
The next stage is to integrate this new voice so that festival may find it automatically. To do this you should add a symbolic link from the voice directory of Festival's English voices to the directory containing the new voice. Frist cd to festival's voice directory (this will vary depending on where your version of festival is installed)
creating the language directory if it does not already exists. Add a symbolic link back to where your voice was built
ln -s /home/awb/data/cmu_ja_awb_diphone
Now this new voice will be available for anyone runing that version festival started from any directory, without the need for any explicit arguments
festival ... festival> (voice_cmu_ja_awb_diphone) ... festival> (SayText "ohayo gozaimasu.") ...
The final stage is to generate a distribution file so the voice may be installed on other's festival installations. Before you do this you must add a file `COPYING' to the directory you built the diphone database in. This should state the terms and conditions in which people may use, distribute and modify the voice.
Generate the distribution tarfile in the directory above the festival installation (the one where `festival/' and `speech_tools/' directory is).
cd /home/awb/projects/1.4.1/ tar zcvf festvox_cmu_ja_awb_lpc.tar.gz \ festival/lib/voices/japanese/cmu_ja_awb_diphone/festvox/*.scm \ festival/lib/voices/japanese/cmu_ja_awb_diphone/COPYING \ festival/lib/voices/japanese/cmu_ja_awb_diphone/group/awblpc.group
The completed files from building this crude Japanese example are available at http://www.festvox.org/examples/cmu_ja_awb_diphone/.
Go to the first, previous, next, last section, table of contents.