Go to the first, previous, next, last section, table of contents.


5 Diphone databases

This chapter describes the processes involved in designing, listing recording, and using a diphone database for a language.

5.1 Diphone introduction

The basic idea behind building diphone databases is to explicitly list all possible phone-phone transitions in a language. This makes the wrong, but practical, assumption that co-articulatory effects never go over more than two phones. The exact definition of phone here is in general non-trivial as in addition to what one may define as a standard phone set, various allophonic variation may in some cases be also included. Unlike generalized unit selection where multiple occurrences of phones may exists with various distinguishing features, in a diphone database only one occurrence of each diphone is recorded. This makes selection much easier but also makes for a large laborious collection task.

In general the number of diphones in a language is the number of phones squared. However often there are various restrictions in phone-phone combinations and whole classes of phones-phone combinations may be deemed never to exist. The exact definition of never exist is problematic. Humans can often generate those so-called non-existent diphones if they try and one must always think about phone-phone transitions over word boundaries as well but even then certain combinations cannot exist for example /hh/ /ng/ in English probably impossible (if forced we would probably insert a schwa). /ng/ may really only appear in coda position, however, in foreign words it can appear in syllable initial position. /hh/ cannot appear in syllable final position, though sometimes it may be pronounced when trying to add aspiration to open vowels.

Diphone synthesis, and in general any concatenative synthesis method, make an absolute fixed choice about which units exist and in circumstances where something else is required a mapping is necessary. When humans speak and are given a context where an unusual phone is desired they will (often) attempt to produce it even though it falls outside their basic phonetic vocabulary. Humans of course have an articulatory system which allows them enough control to produce (or attempt to produce) unfamiliar phones. Concatenative synthesizers however make a fixed decision and they cannot reasonably produce anything outside their pre-defined vocabulary. Formant and articulatory synthesizers of course have the advantage here.

However because we wish to build voices for arbitrary text to speech systems which may include unusual phones some mapping (typically at the lexical level) can be used to ensure all actually required diphones lie within the recorded inventory. The resulting voice will therefore be limited and not compensate for unusual phones that lie outside its range. This in many cases if acceptable though if the voice is specifically to be used for pronouncing Scottish place names it would be advisable to include the /X/ phone as in "loch".

In addition to what may be considered as basic phones, various allophonic variations may also be considered. Flaps in American English are a good example of a reduction which makes speech more natural. Stressed and unstressed vowels in Spanish, consonant cluster /r/ verses lone /r/ in English, inter-syllabic diphones verses intra-syllabic ones. All these variations are worth considering. Ideally all such possible variations should be included in a diphone list but the more variations you include the larger the diphone set. This includes recording time, labelling time and ultimate database size. Adding an extra phone will add another 2n diphones (where n is the number of phones). Duplicating all the vowels (e.g. stressed/unstressed versions) could significantly increase the database size.

These questions are open and depending on resources you are willing to devote, more and more variations can be added. However this list should be seen as a basic set. Alternative synthesis methods may produce better results for the amount of work (or data collected). demi-syllable based databases or mixed inventory methods such as Hadifix portele96 may give better results. Still controlling the inventory but using acoustic measures rather than linguistic knowledge to define the space of possible units in your inventory is work like Whistler huang97. The most extreme view where the unit inventory is not predefined at all but based solely on what is available in general speech databases is CHATR campbell96.

Although generalized unit selection can produce much better synthesis than diphone techniques, the cost of using more units is the increase in complexity in selecting appropriate ones. In the basic strategy presented in this section selection of the appropriate unit from the diphone inventory is trivial while in a system like CHATR selection of the appropriate unit is a significantly difficult problem. (See section 6 Unit selection databases for more discussion of such techniques). With a harder selection task it is more likely that mistakes will be made which in unit selection can given some selections which are significanlt worse that diphones, even some examples may be better.

5.2 Defining a diphone list

Because diphones need to be cleanly articulated various techniques have been proposed to elicit them from subjects. One technique uses words within carrier sentences to ensure that the diphones are pronounced with acceptable duration and prosody (i.e. consistent). At the University of Edinburgh we have typically used nonsense words that iterate through all possible combinations. The advantage of nonsense words is that you don't need to search for natural examples that have the desired diphone, the list can be more easily checked and the presentation is less prone to pronunciation errors than if real words were presented. The words look unnatural but collecting all diphones in not a particularly natural thing to do. See isard86 or stella83 for some more discussion on the use of nonsense words for collecting diphones.

For best results we believe the words should be pronounced with the same vocal effort, with as little prosodic variation as possible. In fact pronouncing them in a monotone is ideal. Our nonsense words consist of a simple carrier form with the diphones (where appropriate) being taken from a middle syllable. Except where schwa and syllabic consonants are involved that syllable should normally be a full stressed one.

Some example code is given in `src/diphone/darpaschema.scm'. The basic idea is to define classes of diphones, for example: vowel consonant, consonant vowel, vowel vowel and consonant consonant. Then define carrier contexts for these and list the cases. Here we use Festival's Scheme interpreter to generate the list though any scripting language is suitable. Our intention is that the diphone will come from a middle syllable of the nonsense word so it is fully articulated and minimize the articulatory effects at the start and end of the word.

For example to generate all vowel vowel diphone we define a carrier

(set! vv-carrier '((pau t aa t) (t aa pau)))

And we define a simple function that will enumerate all vowel vowel transitions

(define (list-vvs)
  (apply
   append
   (mapcar
    (lambda (v1)
      (mapcar 
       (lambda (v2) 
         (list
          (string-append v1 "-" v2)
          (append (car vv-carrier) (list v1 v2) (car (cdr vv-carrier)))))
       vowels))
    vowels)))

For those of you who aren't used to reading Lisp this simple lists all possible combinations or in some potentially more readable format (in an imaginary language)

for v1 in vowels
   for v2 in vowels
     print pau t aa t $v1 $v2 t aa pau

The actual Lisp code returns a list of diphone names and phone string. To be more efficient the darpa example produces consonant-vowel and vowel-consonant diphones in the same nonsense word thus reducing the number of words quite significantly.

Although the idea seems simple to begin with, simply listing all contexts and pairs, there are other constraints. Some consonants can only appear in onset position while others are restricted to the coda.

Next there is the issue about collecting diphones for more than simply all phone-phone pairs. Consonant clusters are the obvious next set to consider. the /p/ in pat isn't really the same as the /p/ in prat. Thus the example darpa schema include simple consonant clusters with explicit syllable boundaries. We also include syllabic consonants though these may be harder to pronounce in all contexts. You can also add other phenomena but this is at the cost of not only making the list longer and hence taking longer to record, but also you must consider how easy it is for your speaker to pronounce them (and how consistent they can be). For example not all American speaker produce flaps (/dx/) in all contexts and its quite difficult for some to pronounce them so some of the nonsense words they produce could be wrong.

A second related problem is language interference and phoneme crossover. Because of the prevalence of English, especially in electronic text, how many "foreign" phone should be considered. For example should /w/ be include for German speakers, (maybe), /t-i/ for Japanese (probably) or both /b/ and /v/ for Spanish speakers. This problem is made difficult by the fact that the people you are recording will often be fluent or near fluent in English and hence already have reasonably ability in phones that are not in their native language. To some degree foreign phones should be considered if the text that will be spoken will contain loan words that would normally require them, remember that in most languages, nowadays, making no attempt to accommodate foreign phones is considered ignorant at least and possibly even arrogant.

Ultimately when more complex forms are desired extending the "diphone" set becomes prohibitive and has diminishing returns. Obviously there are phonetic differences between onset and coda positions, co-articulatory effects which go over more then one phone, stress differences, intonational accent differences, phrase final, middle and initial difference to name but a few. Explicitly enumerating all these or even deciding on which is more important than others is a difficult research question and arguably shouldn't be done in an abstract linguistically generated fashion. Identifying these potential differences and finding an inventory which takes into account the actual distinctions a speaker makes is far more productive and is the fundamental part of many new research directions in concatenative speech synthesis. (See the discussion in the introduction above).

However you choose to construct your diphone list and whatever examples you choose to include you should (if you wish to use the other tools and scripts included with this document) construct a file of the following format. Each line should contain a file id, a diphone name (or list of names if more than one diphone is being extracted from that file) and a list of phones in the nonsense word. The file id is used to in the filename for the waveform, label file, and any other parameters files associated with the nonsense word. We usually make this distinct for the particular speaker we are going to record, e.g. their initials and possible the language they are speaking. For example the following is taken from the darpa generated list

( awb_0001 ("b-aa" "aa-b")  (pau t aa b aa b aa pau) )
( awb_0002 ("p-aa" "aa-p")  (pau t aa p aa p aa pau) )
( awb_0003 ("d-aa" "aa-d")  (pau t aa d aa d aa pau) )
( awb_0004 ("t-aa" "aa-t")  (pau t aa t aa t aa pau) )
( awb_0005 ("g-aa" "aa-g")  (pau t aa g aa g aa pau) )
( awb_0006 ("k-aa" "aa-k")  (pau t aa k aa k aa pau) )
...
( awb_0466 "eh-aa"          (pau t aa t eh aa t aa pau) )
( awb_0467 "eh-ae"          (pau t aa t eh ae t aa pau) )
( awb_0468 "eh-ah"          (pau t aa t eh ah t aa pau) )
...
( awb_0621 "p-v"            (pau t aa p - v aa t aa pau) )
( awb_0622 "p-s"            (pau t aa p - s aa t aa pau) )
( awb_0623 "p-z"            (pau t aa p - z aa t aa pau) )
( awb_0624 "p-sh"           (pau t aa p - sh aa t aa pau) )
( awb_0627 "p-r"            (pau t aa p - r aa t aa pau) )
...

Note the explicit syllable boundary marking - for the consonant consonant diphones is used to distinguish them from the consonant cluster examples that appear later.

5.2.1 Synthesizing prompts

To help keep pronunciation consistent we suggest synthesizing prompts. This helps the speaker in two ways, if they mimic the prompt they are more likely to keep a fixed prosody and secondly it reduces the number of errors where the speaker vocalizes the wrong diphone. Of course for new languages where a set of diphones doesn't already exists, producing prompts is not easy, however giving approximations with diphone sets from other languages may work. The problem then is that in producing prompts from a different phone set, the speaker is likely to mimic the prompts hence the diphone set will probably seem to have a foreign pronunciation, especially for vowels section 11.1 Selecting a speaker.

Even when synthesizing prompts from an existing diphone set you must be aware that that diphone set may contain errors or that certain examples will not be synthesized appropriately (e.g. consonant clusters). Because of this, it is still worthwhile monitoring the speaker to ensure they say things correctly.

The basic code for generating the prompts is in `src/diphone/diphlist.scm', and a specific example for darpa phone set (American English) in `src/diphone/us_schema.scm'. The prompts can be generated from the diphone list as described above (or at the same time). The example code produces the prompts and phone labels files which can be used by the aligning tool described below.

Before synthesizing, the function Diphone_Prompt_Setup is called, if defined. You should define this to setup the appropriate voices in Festival as well as any other setup that may be required, for example setting the F0 for the monotonic prosody. This value is set through the variable FP_F0 and should be in the middle of the range for the speaker. For example for the darpa diphone list for KAL.

(define (Diphone_Prompt_Setup)
 "(Diphone_Prompt_Setup)
Called before synthesizing the prompt waveforms.  Defined for KAL
speaker using ked diphone set (US English) and setting F0."
 (voice_ked_diphone)  ;; US male voice
 (set! FP_F0 90)      ;; lower F0 than ked
 )

Also if the function Diphone_Prompt_Word is defined it will be called after the basic prompt word utterance has been created and before the actual waveform synthesis. This may be used to map phones to other phones, set durations or whatever you feel appropriate for your speaker/diphone set. For example for the KAL set we redefined the syllabic consonants to their full consonant forms because the ked diphone database doesn't actually include syllabics. Also in the below example instead of using fixed (100ms) durations we make the diphones use their average duration (but also make it 1.2 times their basic average).

(define (Diphone_Prompt_Word utt)
  "(Diphone_Prompt_Word utt)
Specify specific modifications of the utterance before synthesis
specific to this particular phone set."
  ;; No syllabics in ked so flip them to non-syllabic form
  (mapcar
   (lambda (s)
     (let ((n (item.name s)))
       (cond
        ((string-equal n "el")
         (item.set_name s "l"))
        ((string-equal n "em")
         (item.set_name s "m"))
        ((string-equal n "en")
         (item.set_name s "n")))))
   (utt.relation.items utt 'Segment))
  (set! phoneme_durations kd_durs)
  (Parameter.set 'Duration_Stretch '1.2)
  (Duration_Averages utt))

By convention the prompt waveforms should be saved in `prompt-wav/' and their labels in `prompt-lab/'. The prompts may be generated when the diphone list is generated by the following command

$ festival bin/us_chema.scm bin/diphlist.scm
festival> (diphone-gen-schema "kal" \"etc/kaldiph.list")

If you already have a diphone list schema generated in the file `etc/kaldiphlist' you can do the following

$ festival bin/us_schema.scm bin/diphlist.scm
festival> (diphone-gen-waves "prompt-wav" "prompt-lab" "etc/kaldiph.list")

Another example of the use of these set up functions is to generate prompts for a language for which there doesn't yet exists a synthesizer. A simple mapping can be provided between the target phoneset and an existing synthesizer's phone set. We don't know if this will be sufficient to actually use as prompts but it appears it is suitable to use these prompts for automatic alignment.

The example here is using the voice_kal_diphone speaker, a US English speaker, to produce prompts for japanese phone set, this code is in `src/diphones/ja_schema.scm'

The function Diphone_Prompt_Setup calls the kal (US) sets a suitable F0 value, and sets the option diph_do_db_boundaries to nil. This option allows the diphone boundaries to be dumped into the prompt label files, but this doesn't work when cross language prompting is done as the actual phones don't match the desired ones.

(define (Diphone_Prompt_Setup)
 "(Diphone_Prompt_Setup)
Called before synthesizing the prompt waveforms.  Cross language prompts
from US male (for gaijin male)."
 (voice_kal_diphone)  ;; US male voice
 (set! FP_F0 90)
 (set! diph_do_db_boundaries nil) ;; cross-lang confuses this
 )

At synthesis time, each Japanesephone must be mapped to an equivalent (one or more) US phone. This is done though a simple table. set in nhg2radio_map which gives the closest phone or phones for the Japanese phone (those unlisted remain the same).

Our mapping table looks like this

(set! nhg2radio_map
      '((a aa)
	(i iy)
	(o ow)
	(u uw)
	(e eh)
	(ts t s)
	(N n)
	(h hh)
	(Qk k)
	(Qg g)
	(Qd d)
	(Qt t)
	(Qts t s)
	(Qch t ch)
	(Qj jh)
	(j jh)
	(Qs s)
	(Qsh sh)
	(Qz z)
	(Qp p)
	(Qb b)
	(Qky k y)
	(Qshy sh y)
	(Qchy ch y)
	(Qpy p y ))
	(ky k y)
	(gy g y)
	(jy jh y)
	(chy ch y)
	(shy sh y)
	(hy hh y)
	(py p y)
	(by b y)
	(my m y)
	(ny n y)
	(ry r y)))

We assume that those phones not explicitly mentioned map to themselves (e.g. most of the consonants).

Finally we define Diphone_Prompt_Word to actually do the mapping. Where the mapping involves more than one US phone we add an extra segment to the Segment relation and split the duration equally between them. The basic function looks like

(define (Diphone_Prompt_Word utt)
  "(Diphone_Prompt_Word utt)
Specify specific modifications of the utterance before synthesis
specific to this particular phone set."
  (mapcar
   (lambda (s)
     (let ((n (item.name s))
	   (newn (cdr (assoc_string (item.name s) nhg2radio_map))))
       (cond
	((cdr newn)  ;; its a dual one
	 (let ((newi (item.insert s (list (car (cdr newn))) 'after)))
	   (item.set_feat newi "end" (item.feat s "end"))
	   (item.set_feat s "end"
			  (/ (+ (item.feat s "segment_start")
				(item.feat s "end"))
			     2))
	   (item.set_name s (car newn))))
	(newn
	 (item.set_name s (car newn)))
	(t
	 ;; as is
	 ))))
   (utt.relation.items utt 'Segment))
  utt)

The label file produded from this will have the original desired language phones, while the waveform will actually consist of phones in the foreign language. Although this may seem like cheating we found that this works over at least Korean and Japanese from English, and is likely to work over many other language combination pairs. For autolabelling as the nonse word phone names are pre-defined alignment just needs to be the best matching path and as long as the phones are distinctive from the ones around them this alignment method is likely to work.

5.3 Recording the diphones

See the general notes on speaker selection and recording in the previous chapter. But lets reiterate some points. The object of recording diphones is to get as uniform a set of pronunciations as possible. Your speaker should be relaxed, not be suffering for a cold, or cough, or a hangover. If something goes wrong with the recording and some of the examples need to be re-recorded it is important that the speaker has as similar a voice as with the original recording, waiting for another cold to come along is not reasonable, (though some may argue that the same hangover can easily be induced). Also to try to keep the voice potentially repeatable it is wise to record at the same time of day, morning is a good idea.

The recording environment should be repeatable, which basically means as defined as possible. Anechoic chambers are best, but general recording studios will do. We've even done recording in an open room, with care this works (make sure there's little background noise from computers, ait conditioning, outside traffic etc). Of course open rooms aren't ideal but they are better than open noisey rooms.

The distance between the speaker and the microphone is crucial. A head mounted mike keeps this constant. Considering the cost and availability of headmounted mike's you have no excuse not to meet this criteria.

Ultimately you need to split the recordings into individual files, one for each nonsense word. Ideally this can be done while recording but as that may not be practical in some (many ?) cases some other technique is required. At CSTR we typically record onto DAT and transfer the data to disk (and down sample) later. Files typically contain 50-100 nonsense words each. We hand label the words taking into account any duplicates caused be errors in the recording. The EST program `ch_wave' offers a function to split a large file into individual files based on a label file. We can use this to get our individual files. Others (OGI) add an identifiable noise during recording and automatically detect that as a split point. They typically use two different noises that can easily be distinguished and use one for `OK' and `BAD' this can make the splitting of the files into the individual nonsense words easier. Note you that will also need to split the EGG signal exactly the same way.

No matter how you split these you should be aware that there will still often be mistakes and checking by listening will help.

We are no moving towards recording directly to machines, see section 11.3 Recording under Unix. There is a cost in the (potential) quality of the recording due to poorly quality audio hardware in computers (and often too much noise). But the advantage of being able to record directly into the appropriate files is not to be belittled.

5.4 Labelling the diphones

Labelling nonsense words is much easier than labelling continuous speech, whether it is by hand or automatic. With nonsense words it is completely defined which phones are there (and if not it is an error) and they are clearly articulated.

We have had significant experience in hand labelling diphones and with the right tools it can be done fairly quickly (e.g. 20 hours for 2500 nonsense words) even if it is mind-numbingly boring and can't realistically be done for more than an hour at any one time. As a minimum, the start of the preceding phone to the first phone in the diphone, the changeover, and the end of the second phone in the diphone should be labelled. Note we recommend phone boundary labelling as that is much better defined than phone middle marking. The diphone will, by default be extracted from the middle of phone one to the middle of phone two.

Our (hand) labelling conventions include labelling of closures within stops explicitly. Thus we expect the label tcl at the end of the silence part of a /t/ and a label t after the burst. This way we can automatically make the diphone boundary within the silence part of the stop. Also we support the label DB when explicit diphone boundaries are required. This is useful within phones such as diphthongs where the temporal middle need not be the most stable part.

Another place when specific diphone boundaries are recommended is in the phone to silence diphones. The phones at the end of words are typically longer than word internal phones. They also tend to trail off in energy. Thus the mid-point of a phone immediately before a silence has typically a much lower energy than the mid point of a word internal phone. Thus when a diphone is to be concatenated to a diphone of the form phone-silence there would be a big jump in energy (as well as other related spectral characteristics. Our solution to this is explicitly label a diphone boundary near the beginning of the phone before the silence (about 20% in) where the energy is much closer to what it will be in the diphone that will precede it.

Another point worth noting is that stops at the start of words don't seem to have a closure part. However it is a good idea to actually label one anyway, if you are doing this by hand. Just "steal" a suitable short piece of silence from the preceding part of the waveform.

Because the words will often have very varying amounts of silence around them it is a good idea to label multiple silences around the word so that the silence immediately before the first phone is about 200-300 ms and labelling the silence before that as another phone. Likewise with the final silence. Also note as the final phone before the end silence may trail off it is recommend that the end of the last phone come at the very end of any signal thus appear to include silence within it. Then label the real silence (200-300 ms) after it. The reason for this is if the end silence happens to include some part of the spoken signal and if duplicated (as is the case with duration modification) an audible buzz is often introduced.

Because labelling of diphone nonsense words is such a constrained task we have included a program for automatically providing a labelling for the spoken prompts. This requires that prompts can be generated for the diphone database. The aligner uses those prompts to do the aligning. Though its not actually necessary that the prompts were used as prompts they do need to be generated for the alignment process.

The idea behind the aligner is to take the prompt and the spoken form and build melcep (and delta melcep) parameterizations of the files. Then a DTW algorithm is used to find the best alignment between these two sets of parameters. Then the prompt label file is used to index through the alignment to give a label file for the spoken nonsense word. This is largely based on the techniques described in malfrere97.

We have tested this aligner on a number of existing hand labelled database to compare how good its alignment is with respect to the hand labelling. We have also tested aligning prompts generated from a language different from that being recorded. To do this there needs to be reasonable mapping between the language phonesets.

Here are results for automatically finding labels for the ked (US English) by aligning them against prompts generated by three different voices

ked itself
mean error 14.77ms stddev 17.08
mwm (US English)
mean error 27.23ms stddev 28.95
gsw (UK English)
mean error 25.25ms stddev 23.923

Note that gsw actually gives better results than mwm even though it is a different dialect of English. We built three diphone index files from each of the label sets generated from there alignment processes. ked-to-ked was the best, and only marginally worse that the database made from the manually produced labels. The database from mwm and gsw produced labels were a little worse but not unacceptably so. Considering a significant amount of careful corrections were made to the manually produced labels, these automatically produced labels are still significantly better than the first pass of hand labels.

A further experiment was made across languages. The ked diphones were used as prompts to align a set of Korean diphones. Even though there are a number of phones in Korean not present in English (various forms of aspirated consonants) the results are quite usable.

Whether you use hand labelling or aligning it is always worthwhile doing some hand correction after the basic database is built. Mistakes (sometimes systematic) always occur and listening to substantial subset of the diphones (or them all if you resynthesize the nonsense words) is definitely worth the time in finding bad diphones.

The script `festvox/src/diphones/make_labs' will process a set of prompts and their spoken form generating a set of label files, to the best of its ability. The script expects the following to already exist

`prompt-wav/'
The waveforms as synthesized by Festival
`prompt-lab/'
The label files corresponding to the synthesized prompts in `prompt-wav'.
`prompt-cep/'
The directory where the cepstrum parameters for each prompt will be saved.
`wav/'
The directory holding the nonsense words spoken by your speaker The should have the same fileid as the waveforms in `prompt-wav/'.
`cep/'
The directory where the cepstrum parameters for the spoken nonsense words will be save.
`lab/'
The directory where the generated label files for the spoke words in `wav/' will be saved.

To run the script over the prompt waveforms

make_lab prompt-wav/*.wav

The script is written so it may be use used at once on multiple machines if you want to parallelize the process. On a Pentium Pro 200MHz a 2000 word diphone databases can be labelled in about 30 minutes. Most of that time is in generating the cepstrum coefficients.

Once the nonsense words are labelled you need to build a diphone index. The index identifies which diphone comes from which files and where. This can be automatically built from the label files (mostly). The script `festvox/src/diphones/make_diph_index' is a Festival script that will take the diphone list (as used above) and find the occurrence of each diphone in the label files and build an index. The index consists of a simple header followed by a line for each diphone. This consists of diphone name fileid, start time, mid-point (i.e. the phone boundary) and end time. The times are given in seconds (note previous versions of Festival using a different diphone synthesizer module used milliseconds at this point).

An example from the start of a diphone index file is

EST_File index
DataType ascii
NumEntries  1610
IndexName ked2_diphone
EST_Header_End
y-aw kd1_002 0.435 0.500 0.560
y-ao kd1_003 0.400 0.450 0.510
y-uw kd1_004 0.345 0.400 0.435
y-aa kd1_005 0.255 0.310 0.365
y-ey kd1_006 0.245 0.310 0.370
y-ay kd1_008 0.250 0.320 0.380
y-oy kd1_009 0.260 0.310 0.370
y-ow kd1_010 0.245 0.300 0.345
y-uh kd1_011 0.240 0.300 0.330
y-ih kd1_012 0.240 0.290 0.320
y-eh kd1_013 0.245 0.310 0.345
y-ah kd1_014 0.305 0.350 0.395
...

Note the number of entries field must be correct, if it is too small it will (often confusingly) ignore the entries after that point.

This file can be created with a diphone list file and the lab files in by the command

make_diph_index kaldiph.list dic/kaldiphindex.est

You should check that this has successfully found all the named diphones. When an diphone is not found in a label file and entry with zeroes for the start middle and end is generated which will produce a warning when being used in Festival, but it worth while checking that before hand.

The `make_diph_index' program will take the mid point between phone boundaries for the diphone boundary unless otherwise specified (by the label DB). Also it will automatically remove underscores and dollar symbols from the diphone names before searching for the diphone in the label file. It will also only find the first occurrence of the diphone.

5.5 Extracting the pitchmarks

Festival in its publically distributed form currently only supports residual excited LPC resynthesis hunt89. Festival also supports PSOLA moulines90 though this is not distributed in the public version. Both of these techniques are pitch synchronous, that is there require information about where pitch period occur in the waveform signal. Where possible it is better to record a larynograph signal (electro-glottal graph -- EGG) at the same time as the voice signal. This signal from a throat microphone records the electrical activity in the glottis during speech.

Although extracting pitch periods from the LAR signal is not trivial is it fairly straightforward. The Edinburgh Speech Tools provides a program `pitchmark' which will process the LAR signal giving a set of pitchmarks. However it is not fully automatic and requires someone to look at the result and make some decisions to change parameters that may improve the result.

The first major issue in processing the signal is deciding which way up it is. From our experience we have seen the signal inverted in some cases and it is necessary to identify the direction in order for the rest of the processing to work properly. In general we've found the CSTR's LAR output is upside down while OGI's and CMU's output is the right way up. Thus if you are using CSTR's recording facilities then you should add -inv to the arguments to `pitchmark'.

The object is produce a single mark at the peak of each pitch period and "fake" periods during unvoiced regions. The basic command we have found that works for us is

pitchmark lar/file001.lar -o pm/file001.pm -otype est \
     -min 0.005 -max 0.012 -fill -def 0.01 -wave_end

It is worth doing one or two by hand and confirming that a reasonable pitch period is found. Note that the -min and -max arguments are speaker depended. This can be moved towards the fixed F0 point used in the prompts, though remember the speaker will not have been exactly constant. The script `festvox/src/general/make_pm' can be copied and modified (for the particular pitch range) and run to generate the pitchmarks

make_pm lar/*.lar

If you don't have an LAR signal for your diphones the alternative is to extract the pitch periods using some other signal processing function. Finding the pitch periods is similar to finding the F0 contour and although hard than finding it from the LAR signal with clean laboratory speech such as recorded diphones it is possible. The following script is a modification of the `make_pm' script above for extracting pitchmarks from a raw waveform signal. It is not as good as extracting from the LAR file but is better than nothing at all. It is more computationally intensive due to requiring high order filters. The value should change depending on the speaker's pitch range.

for i in $*
do
   fname=`basename $i .wav`
   echo $i
   $ESTDIR/bin/ch_wave -scaleN 0.9 $i -o /tmp/tmp$$.wav
   $ESTDIR/bin/pitchmark /tmp/tmp$$.wav -o pm/$fname.pm \
             -otype est -min 0.005 -max 0.012 -fill -def 0.01 \
             -wave_end -lx_lf 200 -lx_lo 71 -lx_hf 80 -lx_ho 71 -med_o 0
done

If you are extracting pitch periods automatically it is worth taking more care to check the signal. We have found that recording consistency and bad pitch extraction the major two reasons for poor quality synthesis.

See section 11.4 Extracting pitchmarks from waveforms for a more detailed discussion on how to do this.

5.6 Building LPC parameters

As the only publically distributed signal processing method in Festival residual LPC you must create LPC parameters and LPC residual files for each file in the diphone database. Ideally the LPC analysis should be done pitch synchronously thus requiring that pitch marks are created before the LPC analysis takes place.

A script suitable for generating the LPC coefficients and residuals is given in `festvox/src/general/make_lpc' and is repeated here.

for i in $*
do
   fname=`basename $i .wav`
   echo $i

   # Potential normalise the power (a hack)
   #$ESTDIR/bin/ch_wave -scaleN 0.5 $i -o /tmp/tmp$$.wav
   # resampling can be done now too
   #$ESTDIR/bin/ch_wave -F 11025 $i -o /tmp/tmp$$.wav
   # Or use as is
   cp -p $i /tmp/tmp$$.wav
   $ESTDIR/bin/sig2fv /tmp/tmp$$.wav -o lpc/$fname.lpc \
             -otype est -lpc_order 16 -coefs "lpc" \ 
             -pm pm/$fname.pm -preemph 0.95 -factor 3 \
             -window_type hamming
   $ESTDIR/bin/sigfilter /tmp/tmp$$.wav -o lpc/$fname.res \
              -otype nist -lpcfilter lpc/$fname.lpc -inv_filter
   rm /tmp/tmp$$.wav
done

Note the (optional) use of `ch_wave' to attempt to normalize the power in the wave to a percentage of its maximum. This is a very crude method for making the waveforms have a reasonably equivalent power. Wildly different power fluctuations in power between segments is likely to be noticed when they are joined. Differing power in the nonsense words may occur if not enough care has been taking in the recording. Either the settings on the recording equipment have been changed (bad) or the speaker has changed their vocal effort (worse). It is important that this should be avoided as the above normalization does not make the problem of different power go away it only makes the problem slightly less bad.

A more elaborate power normaliziation has been successful but it is a little harder, though it was definitely successful for the KED US American voice that had major power fluctuations over different recording sesssions. The idea is to find the power during vowels in each nonsense word, then find the mean power for each vowel. Then for each file find the average factor difference for each actual vowel with the mean for that vowel. Scale the wave for according to that value. We don't provide the hacky little shell scripts that do this (because they are hacky). We generate a set of `ch_wave' commands that extract the parts of the wave from that are vowels (using `-start' and `-end' options. We make the output be in ascii `-otype raw' `-ostype ascii' and use a simple awk script to calculate the RMS power. We then calculate the mean power for each vowel with another awk script using the result as a table, then finally we process the fileid, actual vowel power information to generate a power factor to by averaging the ration of each vowel's actual power to the mean power for that vowel. You may wish to still modify the power further after this if it is cosidered too low or high.

Note power normalization is intended to remove artifacts caused by different recording environment, i.e. the person moved from the microphone, the levels were changed etc. they should not modify the intrinsic power differences in the phones themselves. The above techniques try to preserve the intrinsic power, that's why we take the average over all vowels in a nonsense word, though you should listen to the results and make the ultimate decision yourself.

If all has been recorded properly now individual power modification should be necessary.

Also if you wish to generate a database in a different sample rate from what was recorded this is the time to resample. For example an 8KHz or 11.025KHz will be smaller than a 16KHz database. If the eventual voice is to be played over the telephone, for example, there is little point in generating anything but 8Khz. Also it will be faster to synthesize 8Khz utterances than 16Khz ones.

The number of LPC coefficients used to represent each pitch period can be changed depending on sample rate you choose. I have heard the number should be

(sample_rate/1000)+2

But that should only be taken as a rough guide though a larger sample rate deserves a greater number of coeeficients.

5.7 Defining a diphone voice

The easiest way to define a voice is to start from the skeleton scheme files distributed. For English voices see section 5.10 US/UK English Walkthrough, and for non-English voices see section 12 Full example.

Although in many case you'll want to modify these files (sometimes quite substantially) the basic skelton files will give yo8u a good grounding, and they follow some basic conventions of voice files that will make it easier to integrate your voice into the Festival system.

5.8 Checking and correcting diphones

Once you have the basic diphone database working it is worthwhile systematically testing it as it is common to have mistakes. These may be mislabelling, and mispronunciation for the phones themselves. Two possible strategies are possible for testing both of which have their advantages. This first is a simple exhaustive synthesis of all diphones. Ideally the diphone prompts are exactly the set of utterances that test each and every diphone. using the SayPhones function you can synthesize and listen to each prompt. Actually for a first pass it may even be useful to synthesize each nonsense word without listening as some of the problems missing files, missing diphones, badly extracted pitchmarks will show up without you having to listen to at all.

When a problem occurs trace back why, check the entry in the diphone index, then check the label for the nonsense word, then check how that label matches the actually waveform file itself (display the waveform with the label file and spectrogram to see if the label is correct).

Listing all the problems that could occur is impossible, what you need to do is break down the problem and find out where it might be occurring. If you just get apparent garbage being synthesized, take a look at the synthesized waveform

(set! utt1 (SayPhones '(pau hh ah l ow pau)))
(utt.save.wave utt1 "hello.wav")

Is it garbage, can you recognized any part of it? It could be a byte swap problem or a format problem for your files. Can your nonsense word file be played and displayed as is? Can your LPC residual files be played and displayed. Residual files should look like very low powered waveform files and sound very buzzy when played but basically recognizable if you know what is being said (sort of like Kenny from South Park).

If you can recognize some of what is being said but it is fairly uniformly garbled it is possible your pitchmarks are not being aligned properly. Use some display mechanism to see where the pitchmarks are. These should be aligned (during voiced speech) with the peaks in the signal.

If all is well except for some parts of the signal are bad or overflowed, then check the diphone where the errors occur.

There are a number of solutions to problems that may save you some time, for the most case they should be considered cheating, but they may save having to re-record, something that you will probably want to avoid if at all possible.

Note that some phones are very similar, particular the left half side of most stops are indistinguishable, as the consist of mostly silence. Thus if you find you didn't get a good <something>-p diphone you can easily make it use the <something>-b diphone instead. You can do this by hand editing the diphone index file accordingly.

The linguists among you may not find that acceptable but you can go further, the burst part of /p/ and /b/ isn't that different when it comes down to it and if is it just one or two diphones you can simply map those too. Considering problems are often in one or two badly articulated phones replace a /p/ with a /b/ (or similar) in one or two diphones may not be that bad.

Once however the problems become systematic over a number of phones re-recording them should be considered. Though remember if you do have to re-record you want to have as similar an environment as possible which is not always easy. Eventually you may need to re-record the whole database again.

Recording diphone databases is not an exact science, although we have a fair amount of experience in recording these databases, they never completely go as planned. Some apparently minor problem often occurs, noise on the channel, slightly different power over two sessions. Even when everything seems the same and we can't identify any difference between two recording environments we have found that some voices are better than others for building diphone databases. We can't immediately say why, we discussed some of these issues above in selecting a speaker but there is still some other parameters which we can't identify so don't be disheartened when you database isn't as good as you hoped, ours sometimes fail too.

5.9 Diphone check list

The section contains a quick check list of the processes required to constructing a working diphone database. Each part is discussed in detail above.

5.10 US/UK English Walkthrough

When building a new diphone based voice for a supported language, such as English, the upper parts of the systems can mostly be taken from existing voices, thus making the building task much more mechanistic. Of course things can still go wrong and its worth checking everything at each stage. This section gives the basic walkthrough for build a new US English voice. Support for building UK (southern, RP dialect) is also provided this way. For building non-US/UK synthesizers see section 12 Full example for a similar walkthrough but less language specific.

Recording a whole diphone usually takes a number of hours, if everything goes to plan. Construction of the voice after recording will take another couple of hours, though much of this is CPU bound. Then hand correction may take another few hours (depending on the quality). Thus if all goes well it is possible to construct a new voice in a days work though usually something goes wrong and it takes longer.

For those of you who have ignored the rest of this document and are just hoping to get by by reading this, good luck. It may be possible to do that, but considering the time you'll need to invest to build a voice, being familar with the comments, at least in the rest of this chapter may be well worth the time invested.

The tasks you will need to do are:

As with all parts of `festvox', you must set the following environment variables to where you have installed versions of the Edinburgh Speech Tools and the festvox distribution

export ESTDIR=/home/awb/projects/1.4.1/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox

The next stage is to select a directory to build the voice. You will need in the order of 500M of diskspace to do this, it could be done in less, but its better to have enough to start with. Make a new directory and cd into it

mkdir ~/data/cmu_us_awb_diphone
cd ~/data/cmu_us_awb_diphone

By convention the directory is named for the institution, the language (`us' English) and the speaker (`awb', who actually speaks with a Scottish accent). Although it can be fixed later the directory name is used when festival searches for available voices so it is good to follow this convention.

Build the basic directory structure

$FESTVOXDIR/src/diphones/setup_diphone cmu us awb

the arguments to `setup_diphone' are, the institution building the voice, the language, and the name of the speaker. If you don't have a institution we recommend you use `net'. There is an ISO standard for language names, though unfortunately it doesn't allow distinction between US and UK English, so in general we recommend you use the two letter form, though for US English use `us' and UK English use `uk'. The speaker name may or may nor be there actually name.

The setup script builds the basic directory structure and copies in various skeleton files. For languages `us' and `uk' it copies in files with much of the details filled in for those languages, for other languages the skeleton files are much more skeletal.

For constructing a `us' voice you must have the following installed in your version of festival

festvox_kallpc16k
festlex_POSLEX
festlex_CMU

And for a UK voice you need

festvox_rablpc16k
festlex_POSLEX
festlex_OALD

At run-time the two appropriate festlex packages (POSLEX + dialect specific lexicon) will be required but not the existing kal/rab voices.

To generate the nonsense word list

festival -b bin/diphlist.scm bin/us_schema.scm \
     "(diphone-gen-schema \"awb\" \"etc/awbdiph.list\")"

The to synthesize the prompts

festival -b bin/diphlist.scm bin/us_schema.scm \
      "(diphone-gen-waves \"prompt-wav\" \"prompt-lab\" \"etc/awbdiph.list\")"

Now record the prompts. Care should be taken to set up the recording environment so it is best. Note all power levels so that if more than one session is required you can continue and still get the same recording quality. Given the lengh tof the US English list its unlikely a person can say allow of these in one session so ensuring the environment can be duplicated is important.

bin/prompt_them etc/awbdiph.list

Note a third argument can be given to state which nonse word to begin prompting from. This if you have already recorded the first 100 you can continue with

bin/prompt_them etc/awbdiph.list 101

See section 15.1 US phoneset for notes on pronunciation (or section 15.2 UK phoneset for the UK version).

The recorded prompts can the be labelled by

bin/make_labs prompt-wav/*.wav

And the diphone index may be built by

bin/make_diph_index etc/awbdiph.list dic/awbdiph.est

If no EGG signal has been collected you can extract the pitchmarks by

bin/make_pm_wave wav/*.wav

Then build the pitchsynchronous LPC coefficients

bin/make_lpc wav/*.wav

Now the database is ready for its initial tests.

festival festvox/cmu_us_awb_diphone.scm "(voice_cmu_us_awb_diphone)"

Test its basic functionality with

festival> (SayPhones '(pau hh ax l ow pau))

festival> (intro)

As the autolabelling is unlikely to work completely you should listen to a number of examples to find out what diphones have gone wrong.

Final once you have corrected the errors you can build a final voice suitable distribution. First you need to create a group file which contains only the subparts of spoken words which contain the diphones.

festival festvox/cmu_us_awb_diphone.scm "(voice_cmu_us_awb_diphone)"
...
festival (us_make_group_file "group/awblpc.group" nil)
...

The us_ in the function names stands for UniSyn (the unit concatenation subsystem in Festival) and nothing to do with US English.

To test this edit `festvox/cmu_us_awb_diphone.scm' and change the choice of databases used from separate to grouped. This is done by commenting out the line (around line 81)

(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_sep))

and uncommented the line (around line 84)

(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_group))

The next stage is to integrate this new voice so that festival may find it automatically. To do this you should add a symbolic link from the voice directory of Festival's English voices to the directory containing the new voice. First cd to festival's voice directory (this will vary depending on where you installed festival)

cd /home/awb/projects/1.4.1/festival/lib/voices/english/

add a symbolic link back to where your voice was built

ln -s /home/awb/data/cmu_us_awb_diphone

Now this new voice will be available for anyone runing that version festival (started from any directory)

festival
...
festival> (voice_cmu_us_awb_diphone)
...
festival> (intro)
...

The final stage is to generate a distribution file so the voice may be installed on other's festival installations. Before you do this you must add a file `COPYING' to the directory you built the diphone database in. This should state the terms and conditions in which people may use, distribute and modify the voice.

Generate the distribution tarfile in the directory above the festival installation (the one where `festival/' and `speech_tools/' directory is).

cd /home/awb/projects/1.4.1/
tar zcvf festvox_cmu_us_awb_lpc.tar.gz \
  festival/lib/voices/english/cmu_us_awb_diphone/festvox/*.scm \
  festival/lib/voices/english/cmu_us_awb_diphone/COPYING \
  festival/lib/voices/english/cmu_us_awb_diphone/group/awblpc.group

The complete files from building an example US voice based on the KAL recordings is available at http://www.festvox.org/examples/cmu_us_kal_diphone/.


Go to the first, previous, next, last section, table of contents.