Go to the first, previous, next, last section, table of contents.


10 Limited domain synthesis

This chapter discusses and gives examples of building synthesis systems for limited domains. By limited domain, we mean applications where the speech output is constrained. Such domains may still be infinite but they may be target to specific vocabulary and phrases. In fact with today's current speech system such limited domain applications are in fact the most common. Some typical examples are telling the time, reading telephone numbers. However from experience we can see that this technique can be extended to include more general information giving systems and dialog systems, such as reading the weather, or even the DARPA Communicator domain (flight information dialog system).

Limited domains are discussed here as it is felt that it should be easier to build unit selection type synthesizers for domains where there are a much smaller and controlled number of units. The second reason is that general TTS systems (e.g. diphone systems) still sound synthetic. General unit selection when its good, offers near human quality, but when its bad it is usually much worse than a diphone synthesizer. Hybrid systems look interesting but as we cannot yet automatically detect when general unit selection systems go bad, its not clear when a diphone system should be swapped in. But as unit selection offers so much promise, it is hoped that in a limited domain we can get the unit selection good quality, and avoid the bad quality. Finally, although full TTS systems may be our ultimate goal actually for many existing systems a limited domain synthesizer is adequate.

There is a stage beyond limited domain, but falling short of general one synthesis, where the most common phrases are the best synthesized and the quality gracefully degrades as the phrases become less common. Some hybrid recorded prompts/unit selection/diphone systems have been proposed and should be able to deliver and answer but we will not deal directly with those here.

However one point you quickly find is that although most speech dialog systems are very constrained in their vocabulary many require the hardest class of words: proper names.

In continuing in mode of tutorial this chapter first gives a complete walkthrough of a talking clock. This is a small example which will probably work. Following through this example will give you a good idea of what is involved in building a limited domain synthesizer. Also in the following section problems and modifications can be better discussed with respect to this complete example.

10.1 designing the prompts

To get a good limited domain synthesizer, it is important to understand what is going on so you can properly tailor your application to take proper advantage of what can be good, and to avoid the limitations these methods impose.

Note that you may wish to change your application to take better advantage of this by making its output forms more regular, or at least use a smaller vocabulary. As a first basic approximation the techniques here require that the training contain all words which are to be actually synthesized. Therefore if a word does not appear in the training data it cannot be synthesized. We do provide fall back positions, using a diphone voice, but that will always be worse than the more natural unit selection synthesis. Often this isn't much of a restriction, or you can tailor your application to avoid having a large vocabulary. For example if you are going to build a system for reading weather reports you can make the weather reports not actually name the city/town they refer to and just use phrases like "This city ..." and depend of context for the user to know which actually city is being talked about.

Of course many speech applications have limited vocabularies in all but a very few places. Proper names such as places, people, movie names, etc are in general complete open classes. Building a speech application around those aspects isn't easy and may make a limit domain synthesizer just impractical. But it should be noted that those open classes are also the classes that more general synthesizers will often fail on too. Some hybrid system may better solve that, which we will not really deal with here.

For almost closed class, recording and modify the data may be a solution but we have not yet got enough experience to comment on this yet but we feel that may be a reasonable compromise.

The most difficult part of building a limited domain synthesizer is designed the database to record that best covers what you wish the synthesizer to say. Sometimes this is fairly easy in that you wish the synthesizer to simple read utterances in a very standard form where slot will be filled with varying values (such as, dates numbers etc.) Such as

The area code you require is NUMBER NUMBER NUMBER.

The prompts can be devised to fill in values for each of the NUMBER variables.

More complex utterance can still be viewed in this way

The weather at TIME on DATE: outlook OUTLOOK, NUMBER degrees.

But once we move into more general dialog its appears initially harder to properly find the utterances that cover the domain.

The first important observation to make is that in such systems where limited domain synthesis is practical the phrases to be spoken are almost certainly generated by a computer. That is there exists an explicit function which the language generated by the applications. In some case this will take the form of an explicit grammar. In this case we can use that grammar to generate phrase language and then select from them utterance that adequately cover the domain. However even when there is an explicit grammar it usually will not allow explicitly encode the frequency of each generated utterance. As we wish to ensure that the most common phrases are synthesized best we ideally need to know which utterances are to be synthesized most often to properly select which utterance to record.

Where a system is already running with a standard synthesizer it is possible to record what is currently being said and how often. We can then use such logs of current system usage to select which utterance should be in our set of prompts to record.

In practice you will be design the system's output at the same time as the limited domain synthesizer so some combination, and guessing of the frequency and cover will be necessary.

In general you should design your databases to have at least 2 (and probably 5) examples of each word in you vocabulary. Secondly you should select utterances that maximise bi-gram coverage. That is try to ensure as many different word-word pairings over your corpus. We have used techniques based on these recommendations to greedily select utterances from larger corpora to record.

10.2 customizing the synthesizer front end

Once you decided on a set of utterances that appropriately cover the domain you also need to consider how those particular text strings are synthesized. For example if the data contains flight numbers, dates, times etc, you must ensure that festival properly renders those. As we are discussing a limited domain the distribution of token types will be different from standard text but also more constrained so simple changes to the lexicon, token to word rules, etc. will allow properly synthesis of these utterances.

One particular area of customization we have noted is worthwhile is that of phrasing. It seems important to explicitly mark phrasing in the prompts, and have the speaker follow such phrasing as it allows for much better joins in unit selection, as well as consist prosody from the speaker. Thus in the default code provided below the normal phrasing module in festival is replaced with one that treat punctuation as phrasal markers.

10.3 autolabelling issues

We currently use autolabel the recorded prompts using the alignment technique based on malfrere97 that we discussed above for diphone labelling. Although this technique works its is not as robust for general speech as it is for carefully articulated nonsense words. Specifically this technique does not allow for alternative pronunciations which are common.

Ideally we should use a more generally speech recognition systems. In fact for labelling of unit selection database the bets results would be to train a recognition using Baum-Welch on the database until convergence. This would give the most consistent labelling. We have started initial experiments with using the open source CMU Sphinx recognition system are are likely to provide scripts to do this in later releases.

In the mean time the greatest problem in predict phone list must be the same (or very similar) to what was actually spoken. This can be achieved by a knowledge speaker, and by customizing the front end of the synthesizer so it produces more natural segments.

Two observations are worth mentioning here. First if the synthesizer makes a mistake in pronunciation and the human speaker does not we have found the things may work out anyway. For example we found the word "hwy" appeared in the prompts, which the synthesizer pronounced as "H W Y" while the speaker said "highway". The forced alignment however cause the phones in the pronunciation of the letters "H W Y" to be aligned with the phones in "highway" thus on selection appropriate, though misnamed, phones were selected. However you should depend on such compound errors.

The second observation is that festival includes prediction of vowel reduction. We are beginning to feel that such prediction is unnecessary in limited domain synthesizer or in unit selection in general. This vowel variation can itself be achieve but the clustering technique themselves and hence allows a reasonable back-off selection strategy than making firm type decisions.

10.4 unit size and type

The basic cluster unit selection code available in festival uses segments as the size of unit. However the acoustic distance measure used in cluster uses significant portions of the previous segment. Thus the cluster unit selection effectively selects diphones from the database.

The type of units the cluster selection code uses is based on the segment name, by default. In the case of limited domain synthesis we have found that constraining this further gives both better, and faster synthesis. Thus we allow for the unit type to be defined by an arbitrary feature. In the default limited domain set up we use

SEGMENT_WORD

That is, the segment plus the word the segment comes from. Note this doesn't mean we are doing word concatenation in our synthesizer. We are still selecting phone units but that the these phone are differentiated depending on the word they come from thus a /t/ from the word "unit" cannot be used to synthesis a /t/ in "table". The primary reason for us doing this was to cut down the search, though it notable improves synthesis quality to. As we have constructed the database to have good coverage this is a practical thing to do.

The feature function clunit_name constructs the unit type for a particular segment item. We have provided the above default (segment name plus (downcased) word name), but it is easy to extend this.

In one domain we have worked in we wish to differentiate between words in different prosody contexts. Particularly we wished to mark words us "questionable" so we can ask users for confirmation. To do this we marked the "questionable" words in the prompts with a question mark prefix. We then recorded them with appropriate intonation and then defined our clunit_name function in include "C_" is the word was prefixed by a question mark. For example the following two prompts will be read in a different manner

theater is Squirrel Hill Theater
theater is ?Squirrel ?Hill ?Theater

Likewise in unit selection the units in the word "Squirrel" will not be used to synthesize the word "?Squirrel" and vice versa. Although crude, this does give simple control over prosody variation though this technique can require the vocabulary of the units to increase to where this technique ceases to be practical.

It would be good if this technique had a back-off strategy where if no unit can be found for a particular word it would allow other words to contribute candidates. This is ultimately what general unit selection is. We do consider this our goal in unit type but in the interest of building quick and reliable limited domain synthesizers we do not yet do this but consider it an area we will experiment with. One specific area that only partially cross this line is in the synthesis of numbers. It seem very reasonable to allow selection of units from simple numbers (e.g. "seven" and "seventy") but we have not experimented on that yet.

One further important point should be highlights about this method for defining unit types. Although including the word name in the unit name does greatly encourage whole words to be selected it does not mean that joins in the synthesize utterances only occur at word boundaries. It is common that contiguous units are selection from different occurrences of the same word. Mid-word (e.g. within vowels, or at stops) joins at stable places are common. The optimal coupling technique selects the best place within a word for the cross over between two different parts of the database.

10.5 using limited domain synthesizers

The goal of building such limited domain synthesizer is not just to show off good synthesis. We followed this route as we see this as a very practical method for building speech output systems.

For practical reasons, the default configures includes the possibility of a back-up voice that will be called to do synthesis if the limited domain synthesizer fails, which for this default setup means the phrase includes a out of vocabulary word. It would perhaps be more useful if the fall position just required synthesis of the out of vocabulary word itself rather than the whole phrase, but that isn't as trivial as it might be. The limit domain synthesis does not prosody modification of the selections, except for pitch smooth at joins, thus slotting in a diphone one word would sound very bad. At present each limited domain synthesizer has an explicitly defined closest_voice. This voice is used when the limited domain synthesis fails and also when generating the prompts, which can be looked upon as absolute minimal case when the synthesizer has no data to synthesize from.

There are also issues in speed here, which we are still trying to improve. This technique should in fact be fast but it is still slower than our diphone synthesizer. One significant reason is the cost if finding the optimal join put in selected units. Also this synthesizer technique require more memory that diphones as the cepstrum parameters for the whole database are required at run time, in addition to the full waveforms. These issues we feel can and should be addressed as these techniques are not fundamentally computationally expensive so we intend to work on these aspect in later releases.

10.6 Telling the time

Festival includes a very simple little script that speaks the current time (`festival/examples/saytime'). This section explains how to replace the synthesizer used from this script with one that talks with your own voice. This is an extreme example of a limited domain synthesizer but it is a good example as it allows us to give a walkthrough of the stages involved in building a limited domain synthesizer. This example is also small enough that it can be done in well under an hour.

Following through this example will give a reasonable understanding of the relative importance of many important steps in the voice building process.

The following tasks are required:

Before starting set the environment variables `FESTVOXDIR' and `ESTDIR' to the directories which contain the festvox distribution and the Edinburgh Speech Tools respectively. Under bash and other good shells this may be done by commands like

export FESTVOXDIR=/home/awb/projects/festvox
export ESTDIR=/home/awb/projects/1.4.1/speech_tools

In earlier releases we only offered a command line based method for building voices and limited domain synthesizers. In order to make the process easier and less prone to error we have introduced and graphical front end to these scripts. This front end is called pointyclicky (as it offers a pointy-clicky interface). It is particularly useful in the actual prompting and recording. Although pointyclicky is the recommend route in the section we go through the process step by step to give a better understanding of what is required and where problems may lie that require attention.

A simple script is provided setting up the basic directory structure and copying in some default parameter files. The festvox distribution includes all the setup for the time domain. When building for your domain, you will need to provide the file `etc/DOMAIN.data' contains your prompts (as described below).

mkdir ~/data/time
cd ~/data/time
$FESTVOXDIR/src/ldom/setup_ldom cmu time awb

As in the definition of diphone databases we require three identifiers for the voice. These are (loosely) institution, domain and speaker. Use `net' if you feel there isn't an appropriate institution for you, though we have also use the project name that the voice is being build for here. The domain name seems well defined. For speaker name we have also used style as opposed to speaker name. The primary reason for these to so that people do not all build limited domain synthesizer with the same thus making it not possible to load them into the same instance of festival.

This setup script makes the directories and copies basic scheme files into the `festvox/' directory. You may need to edit these files later.

10.6.1 Designing the prompts

In this saytime example the basic format of the utterance is

The time is now, <exactness> <minute info> <hour info>, in the <day info>.

For example

The time is now, a little after five to ten, in the morning.

In all there are 1152 (4x12x12x2) utterances (although there are three possible day info parts (morning, afternoon and evening) they only get 12 hours, 6 hours and 6 hours respectively). Although it would technically be possible to record all of these we wish to reduce the amount of recording to a minimum. Thus what we actually do is ensure there is at least one example of each value in each slot.

Here is a list of 24 utterances that should cover the main variations.

The time is now, exactly five past one, in the morning
The time is now, just after ten past two, in the morning
The time is now, a little after quarter past three, in the morning
The time is now, almost twenty past four, in the morning
The time is now, exactly twenty-five past five, in the morning
The time is now, just after half past six, in the morning
The time is now, a little after twenty-five to seven, in the morning
The time is now, almost twenty to eight, in the morning
The time is now, exactly quarter to nine, in the morning
The time is now, just after ten to ten, in the morning
The time is now, a little after five to eleven, in the morning
The time is now, almost twelve.
The time is now, just after five to one, in the afternoon
The time is now, a little after ten to two, in the afternoon
The time is now, exactly quarter to three, in the afternoon
The time is now, almost twenty to four, in the afternoon
The time is now, just after twenty-five to five, in the afternoon
The time is now, a little after half past six, in the evening
The time is now, exactly twenty-five past seven, in the evening
The time is now, almost twenty past eight, in the evening
The time is now, just after quarter past nine, in the evening
The time is now, almost ten past ten, in the evening
The time is now, exactly five past eleven, in the evening
The time is now, a little after quarter to midnight.

These examples are first put in the prompt file with an utterance number and the prompt in double quotes like this.

(time0001 "The time is now ...")
(time0002 "The time is now ...")
(time0003 "The time is now ...")
...

These prompt should be put into `etc/DOMAIN.data'. This file is used by many of the following sub-processes.

10.6.2 Recording the prompts

The best way to record the prompts is to use a professional speaker in a professional recording studio (anechoic chamber) using dual channel (one for audio and the other for the electroglottograph signal) direct to digital media using a high quality head mounted microphone.

However most of us don't have such equipment (or voice talent) so readily available so whatever you do will probably have to be a compromise. The head mounted mike requirement is the cheapest to meet and it is pretty important so you should at least meet that requirement. Anechoic chambers are expensive, and even professional recording studios aren't easy to access (though most Universities will have some such facilities). It is possible to do away with the EGG reading if a little care is taken to ensure pitchmarks are properly extracted from the waveform signal alone.

We have been successful in recording with a standard PC using a standard soundblaster type 16bit audio card though results do vary from machine to machine. Before attempting this you should record a few examples on the PC to see how much noise is being picked up by the mike. For example try the following

$ESTDIR/bin/na_record -f 16000 -time 5 -o test.wav -otype riff

This will record 5 seconds from the microphone in the machine you run the command on. You should also do this to test that the microphone is plugged in (and switched on). Play back the recorded wave with `na_play' and perhaps play with the mixer levels until you get the least background noise with the strongest spoken signal. Now you should display the waveform to see (as well as hear) how much noise is there.

$FESTVOXDIR/src/general/display_sg test.wav

This will display the waveform and its spectrogram. Noise will show up in the silence (and other) parts.

There a few ways to reduce noise. Ensure the microphone cable isn't wrapped around other cables (especially power cables). Turning the computer 90 degrees may help and repositioning things can help too. Moving the sound board to some other slot in the machine can also help as well as getting a different microphone (even the same make).

There is a large advantage in recording straight to disk as it allows the recording to go directly into right files. Doing off-line recording (onto DAT) is better in reducing noise but transferring it to disk and segmenting it is a long and tedious process.

Once you have checked your recording environment you can proceed with the build process.

First generate the prompts with the command

festival -b festvox/build_ldom.scm '(build_prompts "etc/time.data")'

and prompt and record them with the command

bin/prompt_them etc/time.data

You may or may not find listening to the prompts before speaking useful. Simply displaying them may be adequate for you (if so comment out the `na_play' line in `bin/prompt_them').

10.6.3 Autolabelling the prompts

The recorded prompt can be labelled by aligning them against the synthesize prompts. This is done by the command

bin/make_labs prompt-wav/*.wav

If the utterances are long (> 10 seconds of speech) you may require lots of swap space to do this stage (this could be fixed).

Once labelled you should check that they are labelled reasonable. The labeller typically gets it pretty much correct, or very wrong, so a quick check can often save time later. You can check the database using the command

emulabel etc/emu_lab

Once you are happy with the labelling you can construct the whole utterance structure for the spoken utterances. This is done by combining the basic structure from the synthesized prompts and the actual times from the automatically labelled ones. This can be done with the command

festival -b festvox/build_ldom.scm '(build_utts "etc/time.data")'

10.6.4 Extracting pitchmarks and building LPC coefficients

Getting good pitchmarks is important to the quality of the synthesis, see section 14.4 Extracting pitchmarks from waveforms for more detailed discussion on this issue. For the limited domain synthesizers the pitch extract is a little less crucial that for diphone collection. Though spending a little time on this does help.

If you have recorded EGG signals the you can use `bin/make_pm' from the `.lar' files. Note that you may need to add (or remove) the option `-inv' depending on the updownness of your EGG signal. However so far only the CSTR larygnograph seems to produce inverted signals so the default should be adequate. Also note the parameters that specify the pitch period range, `-min' and `max' the default setting are suitable for a male speaker, for a female you should modify these to something like

-min 0.0033 -max 0.0875 -def 0.005

The changing from a range of (male) 200Hz-80Hz with a default of 100Hz, to a female range of 300Hz-120Hz and default of 200Hz.

If you don't have an EGG signal you must extract the pitch from the waveform itself. This works though may require a little modification of parameters, and it is computationally more expensive (and wont be as exact as from an EGG signal). There are two methods, one using Entropic's `epoch' program which work pretty well without tuning parameters. The second is to use the free Speech Tools program `pitchmark'. The first is very computationally expensive, and as Entropic is no longer in existence, the program is no longer available (though rumours circulate that it may appear again for free). To use `epoch' use the program

bin/make_pm_epoch wav/*.wav

To use `pitchmark' use the command

bin/make_pm_wave wav/*.wav

As with the EGG extraction `pitchmark' uses parameters to specify the range of the pitch periods, you should modify the parameters to best match your speakers range. The other filter parameters also can make a different to the success. Rather than try to explain what changing the figures mean (I admit I don't fully know), the best solution is to explain what you need to obtain as a result.

Irrespective of how you extract the pitchmarks we have found that a post-processing stage that moves the pitchmarks to the nearest peak is worthwhile. You can achieve this by

bin/make_pm_fix pm/*.pm

At this point you may find that your waveform file is upside down. Normally this wouldn't matter but due to the basic signal processing techniques we used to find the pitch periods upside down signals confuse things. People tell me that it shouldn't happen but some recording devices return an inverted signal. From the cases we've seen the same device always returns the same form so if one of your recordings is upside down all of them probably are (though there are some published speech databases e.g. BU Radio data, where a random half are upside down).

In general the higher peaks should be positive rather than negative. If not you can invert the signals with the command

for i in wav/*.wav
do
   ch_wave -scale -1.0 $i -o $i
done

If they are upside, invert them and re-run the pitch marking. (If you do invert them it is not necessary to re-run the segment labelling.)

Power normalization can help too. This can be done globally by the function

bin/simple_powernormalize wav/*.wav

This should be sufficient for full sentence examples. In the diphone collection we take greater care in power normalization but that vowel based technique will be too confused by the longer more varied examples.

Once you have pitchmarks, next you need to generate the pitch synchronous MELCEP parameterization of the speech used in building the cluster synthesizer.

bin/make_mcep wav/*.wav

10.6.5 Building a clunit based synthesizer from the utterances

Building a full clunit synthesizer is probably a little bit of over kill but the technique basically works. See section 9 Unit selection databases for a more detailed discussion of this technique. The basic parameter file `festvox/time_build.scm', is reasonable as a start.

festival -b festvox/build_ldom.scm '(build_clunits "etc/time.data")'

If all goes well this should create a file `festival/clunits/cmu_time_awb.catalogue' and set of index trees in `festival/trees/cmu_time_awb_time.tree'.

10.6.6 Testing and tuning

To test the new voice start Festival as

festival festvox/cmu_time_awb_ldom.scm '(voice_cmu_time_awb_ldom)'

The function `(saytime)' can now be called and it should say the current time.

Note this synthesizer can only say the phrases that it has phones for which basically means it can only say the time in the format given at the start of this chapter. Thus although you can use `SayText' you can only five it text in the write form if you expect it to works. That's what limited domain synthesis is.

A full directory structure of this example with the recordings and parameters files is available at http://www.festvox.org/examples/cmu_time_awb_ldom/. And an on-line demo of this voice in that directory is available at http://www.festvox.org/ldom/index.html.

10.7 Making it better

The above walkthough is to give you a basic idea of the stages involved in building a limited domain synthesizer. The quality of a limited domain synthesizer will most likely be excellent in parts and very bad in others which is typical of techniques like this. Each stage is, of course, more complex than this and there are a number of things that can be done to improve it.

For limited domain synthesize it should be possible to correct the errors such that it is excellent always. To do so though requires being able to diagnose where the problems are. The most likely problems are listed here

The line between limited domain synthesis and unit selection is fuzzy. The more complex and varied the phrases you synthesize are, the more difficult it is to produce reliable synthesis.


Go to the first, previous, next, last section, table of contents.