It is very easy to build a voice and get it to say a few phrases and think that the job is done. As you build the voice it is worth testing each part as you built it to ensure it basically performs as expected. But once its all together more general tests are needed. Before you submit it to any formal tests that you will use for benchmarking and grading progrees in the voice, more basic tests should be carried out.
In fact it is stating such initial tests more concretely. Every we have ever built has always had a number mistakes in it that can be trivially fixed. Such as the mfccs were not generated after fixing the pitchmarks. Therefore you syould go through each stage of the build procedure and ensure it really did do what you though it should do, especially if you are totally convinced that section worked perfectly.
Try to find around 100-500 sentences to play through it. It is amazing
home many general problems are thrown up when you extend your test
set. The next stage is to play so real
text. That may be news text from the web, output from your speech
translation system, or some email. Initially it is worth just
synthesizing the whole set without even listening to it. Problems in
analysis and missing diphones etc may be shown up just in the
processing of the text. Then you want to listen to the output and
identify problems. This make take some amount of investigation. What
you want to do is identify where the problem is,
is it bad tex analysis, bad lexical entry, a prosody problem, or a
waveform synthesis problem. You may need to synthesizes parts of the
text in isolation (e.g. using the Festival function
SayText and look at the structure of the utterance
generated, e.g. using the function
utt.features. For example to see what words have
been identified from the text analysis
Or to see the phones generated
(utt.features utt1 'Word '(name))
Thus you can view selected parts of an utterance and find out if it is being created as you intended. For some things a graphical display of the utterance may help.
(utt.features utt1 'Segment '(name))
Once you identify where the problem is you need to decide how to fix it (or if it is worth fixing). The problem may be a number of different places:
Phonetic error: the acoustics of a unit doesn't match the label. This may be because the speaker said the wrong word/phoneme or the labeller had the wrong. Or possible some other acoustic variant that has not been considered
Lexical error: the word is pronounced with the wrong string of phonemes/stress/tone. Either the lexical entry is wrong or the letter to sound rules are not doing ht right thing. Or there are multiple valid pronunciations for that word (homographs) and the wrong one is selectec because the homograph disambiguation is wrong, or there is not a disambiguator.
Text error: the text analysis doesn't deal properly with the word. It may be that a punctuation system is spoken (or not spoken) as expected, titles, symbols, compounds etc aren't dealt with properly
Some other error: some error that is not one of the above. As you progress in correction and tuningm errors in the category will grow and you must find some way to avoid such errors.
Before rushing out and getting one hundred people to listen to your new synthetic voice, it is worth doing significant internal testing and evaluation, informally to find errors and test them. Remember the purpose of evaluation in this case is to find errors and fix them. We are not, at least not at this stage, evaluating the voices on an abstract scale, where unseen test data, and blind testing is important.