The goal of building such limited domain synthesizer is not just to show off good synthesis. We followed this route as we see this as a very practical method for building speech output systems.
For practical reasons, the default configures includes the possibility of
a back-up voice that will be called to do synthesis if the limited
domain synthesizer fails, which for this default setup means the phrase
includes a out of vocabulary word. It would perhaps be more useful if
the fall position just required synthesis of the out of vocabulary word
itself rather than the whole phrase, but that isn't as trivial as it
might be. The limit domain synthesis does not prosody modification of
the selections, except for pitch smooth at joins, thus slotting in a
diphone one word would sound very bad. At present each limited domain
synthesizer has an explicitly defined
closest_voice. This voice
is used when the limited domain synthesis fails and also when generating
the prompts, which can be looked upon as absolute minimal case when the
synthesizer has no data to synthesize from.
There are also issues in speed here, which we are still trying to improve. This technique should in fact be fast but it is still slower than our diphone synthesizer. One significant reason is the cost if finding the optimal join put in selected units. Also this synthesizer technique require more memory that diphones as the cepstrum parameters for the whole database are required at run time, in addition to the full waveforms. These issues we feel can and should be addressed as these techniques are not fundamentally computationally expensive so we intend to work on these aspect in later releases.