This method, inspired the work of Keiichi Tokuda and NITECH's HMM Speech Synthesis Toolkit, is a method for building statistical parametric synthesizers from databases of natural speech. Although the result is still not as crisp as a well done unit selection voice, this method is much easier to get a nice clear synthetic voice that models the original speaker well.
Although this method is partially "tagged on to" the clunits method, it is actually quite independent. The tasks are as follows.
Read and understand all the issues regarding the following steps
Set up the directory structure
Record or import the prompts and prompt list
Label the data with the HMM-state sized segments
Build utterance structures for recorded utterances
Extract F0, voicing and mcep coefficients.
Build a CLUSTERGEN voice
Build an HMM-state duration model
In you already have an existing voice running setup_cg will only copy in the necessary files for clustergen, however I recommend starting from scratch as I don't know when you created your previous voice and I'm not sure of its exact state.
$FESTVOXDIR/src/clustergen/setup_cg cmu us awb_arctic
Now you need to get your waveform files and prompt file. Put your waveform files in the wav/ and your prompt file in etc/txt.done.data. Note you should probably use bin/get_wavs to copy the wavefiles so that they get power normalized and get changed to a reasonable format (16KHz, 16bit, RIFF format).
first to generate example waveforms, then use
To prompt you and record the prompts. You must check that the recording actually works. It should generate recordings in the wav/. You can use $ESTDIR/bin/na_play to play the waveform files. prompt_them can be stopped with ctrl-c and restarted at the line number given as the second argument.
./bin/prompt_them etc/txt.done.data 1
which will generate the prompt utterances (which are used to find the expected phones), but more the prompt waveforms.
The next stage is to label the data. If you aren't very knowledgeable about labeling in clustergen, you should use the EHMM labeler. EHMM constructs the labels in the right format for segments and HMM states. and matches them properly with what the synthesizer generates for the prompts. Using other labels is likely to cause more problems. Even if you already have other labels use EHMM first.
The EHMM labeler has been shown to be very reliable, and can nicely deal with silence insertion. It isn't very fast though and will take several hours. You can check the file ehmm/mod/log100.txt to see the Baum-Welch iterations, there will probably be 20-30. The ARCTIC a-set takes about 3-4 hours to label.
Parametric synthesis require a reversible parameterization, this set up here uses a form of mel cepstrum, the same version that is used by NITECH's basic HTS build. Parameter build is in two parts building the F0 and building the mceps themselves. Then these are combined into a single parameter file for each utterance in the database.
The mcep part takes the longest. Note that the F0 part now tries to estimate the range of the F0 on the speaker and modifies parameters for the F0 extraction program. (The F0 params are saved in etc/f0.params.)
If you want to have a test set of utterances, you can separate out some of your prompt list. The test set should be put in the file etc/txt.done.data.test The follow commands will make a training and test set (every 10th prompt in the test set, the other 9 in the training set).
cat etc/txt.done.data.train >etc/txt.done.data
The next stage is to generate is to build the parametric model. There parts are required for this. This first is very quick and simply puts the state (and phone) names into their respective files. It assumes a file etc/statenames which is generate by EHMM. The second stage build the parametric models itself. The last builds a duration model for the state names
festival> (SayText "This is a little example.")
This will generation festvox_cmu_us_awb_arctic_cg.tar.gz which will be quite small compared to a clunit voice made with the same databases. Because only the parameters are kept (in fact only means and standard deviations of clusters of of parameters) which do not include residual or excitation information the result is something orders of magnitude smaller that a full unit selection voices.
There two other options in the clustergen voice build. These involve modeling trajectories rather than individual vectors. They give objectively better results (though marginal subjectively better results for the voices we have tested). Instead of the line
You can run
or the slightly better
These two options may run after the simple version of the voice.
NOTE: This no longer works automatically, as you need static mceps and ccoefs for this to work. This will create parameter files (and waveform files) in test/cgp. The output of the cg_test is also four measures the mean difference for all features in the parameter vector, for F0 alone, for all but F0, and MCD (mel ceprstral distortion).
$FESTVOXDIR/src/clustergen/cg_test resynth cgp