Go to the first, previous, next, last section, table of contents.

6 Unit selection databases

This chapter discussed some of the options for building waveform synthesizers using unit selection techniques in Festival. This is still very much an on-going research question and we are still adding new techniques as well as improving existing ones often so the techniques described here are not as mature as the techniques as described in previous diphone chapter.

By "unit selection" we actually mean the selection of some unit of speech which may be anything from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple case of this. However typically what we mean is unlike diphone selection, in unit selection there is more than one example of the unit and some mechanism is used to select between them at run-time.

ATR's CHATR hunt96 system and earlier work at that lab nuutalk92 is an excellent example of one particular method for selecting between mutiple examples of a phone within a database. For a discussion of why a more generalized inventory of units is desired see campbell96 though we will reiterate some of the points here. With diphones a fixed view of the possible space of speech units has been made which we all know is not ideal. There are articulatory effects which go over more than one phone, e.g. /s/ can take on artifacts of the roundness of the following vowel even over an itermediate stop, e.g. `spout' vs `spit'. But its not just obvious segmental effects that cause variation in pronunciation, syllable position, word/phrase initial and final position have typically a different level of articulation from segments taken from word internal position. Stressing and accents also cause differences. Rather than try to explicitly list the desired inventory of all these phenomena and then have to record all of them a potential alternative is to take a natural distribution of speech and (semi-)automatifcally find the distinctions that actually exist rather predefining them.

The theory is obvious but the design of such systems and finding the approrpiate selection criteria, weighting the costs of relative candidates is a non-trivial problem. However tecniques like this often produce very high quality, very natural sounding synthesis. However they also can produce some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.

Two forms of unit selection will discussed here, not because we feel they are the best but simply because they are the ones actually implemented by us and hence can be distributed. These should still be considered research systems. Unless you are specifically interested or have the expertise in developing new selection techniques it is not recommended that you try these, if you need a working voice within a month and can't afford to miss that deadline then the diphone option is safe, well tried and stable.

6.1 Cluster unit selection

This is a reimplementation of the techniques as described in black97c. The idea is to take a database of general speech and try to cluster each phone type into groups of acoustically similar units based on the (non-acoustic) information available at synthesis time, such as phonetic context, prosodic features (F0 and duration) and higher level features such as stressing, word position, and accents. The actually features used may easily be changed an experimented with as can the definition of the definition of acoustic distance between the units in a cluster.

In some sense this work builds on the results of both the CHATR selection algorithm hunt96 and the work of donovan95, but differs in some important and significant ways. Specifically in contrast to hunt96 this cluster algorithm pre-builds CART trees to select the approriate cluster of candidate phones thus avoiding the computationally expensive function of calculating target costs (through linear regression) at selection time. Secondly because the clusters are built directly from the acoustic scores and target features, a target estimation function isn't required removing the need to calculate weights for each feature. This cluster method differs from the clustering method in donovan95 in that it can use more generalized features in clustering and uses a different acoustic cost function (Donovan uses HMMs), also his work is based on sub-phonetic units (HMM states). Also Donovan selects one candidate while here we select a group of candidates and finds the best overall selection by finding the best path through each set of candidates for each target phone, in a manner similar to hunt96 and iwahashi93 before.

The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows.

Collect the database of general speech.
Build utterance structures for your database using the techniques discussed in section 4.4 Utterance building
Building coefficients for acostic distances, typically some form of cepstrum plus F0, or some pitch synchronous analysis (e.g. LPC).
Build distances tables, precalculating the acoustic distance between each unit of the same phone type.
Dump selection features (phone context, prosodic, positional and whatever) for each unit type.
Build cluster trees using `wagon' with the features and acoustic distances dumped by the previous two stages
Building the voice description itself

6.1.1 Collecting databases for unit selection

Unlike diphone database which are carefully constructed to ensure specific coverage one of the advantages of unit selection is that a much more general database is desired. However, although voices may be built from existing data not specifically gathered for synthesis there are still factors about the data that will help make better synthesis.

Like diphone databases the more cleanly and carefully the speech is recorded the better the synthesized voice will be. As we are going to be selecting units from different parts of the database the more similar the recordings are, the less likely bad joins will occur. However unlike diphones database prosodic variation is probably a good thing, as it is those variations that can make synthesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone coverage if not complete diphone coverage. Also synthesis using these techniques seem to retain aspects of the original database. If the database is broadcast news stories, the synthesis from it will typically sound like read news stories (or more importantly will sound best when it is reading news stories).

Although it is too early to make definitive statements about what size and type of data is best for unit selection we do have some rough guides. A Timit like database of 460 phonetically balanced sentences (around 14,000 phones) is not an unreasonable first choice. If the text has not been specifically selected for phonetic coverage a larger database is probably required, for example the Boston Univeristy Radio News Corpus speaker f2b ostendorf95 has been used relatively successfully. Of course all this depends on what use you wish to make of the synthesizer, if its to be used in more restrictive environments (as is often the case) tailoring the database for the task is a very good idea. If you are going to be reading a lot of telephone numbers, having a significant number of examples of read numbers will make synthesis of numbers sound much better.

The database used as an example here is a TIMIT 460 sentence database read by an American male speaker.

Again the notes about recording the database apply, though it will sometimes be the case that the database is already recorded and beyond your control, in that case you will always have something legitimate to blame for poor quality synthesis.

6.1.2 Preliminaries

Throughout our dicussion we will assume the following database layout. It is highly recommended that you follow this format otherwise scripts, and examples will fail. There are many ways to organize databases and many of such choices are arbitrary, here is our "arbitrary" layout.

The basic database directory should contain the following directories

bin/

Any database specific scripts for processing. Typically this first contains a copy of standard scripts that are then customized when necessary to the particulary database

wav/

The waveform files. These should be headered, one utterances per file with a standard name convention. They should have the extention `.wav' and the fileid consistent with all other files through the database (labels, utterances, pitch marks etc).

lab/

The segmental labels. This is usually the master label files, these may contain more information that the labels used by festival which will be in `festival/relations/Segment/'.

wrd/

Word label files.

lar/

The EGG files (larynograph files) if collected.

pm/

Pitchmark files as generated from the lar files or from the signal directly.

festival/

Festival specific label files.

`festival/relations/': The processed labelled files for building Festival utterances, held in directories whose name reflects the relation they represent: `Segment/', `Word/', `Syllable/' etc.
`festival/utts/': The utterances files as generated from the `festival/relations/' label files.

Other directories will be created for various processing reasons.

6.1.3 Building utterance structures for unit selection

In order to make access well defined you need to construct Festival utterance structures for each of the utterances in your database. This (in is basic form) requires labels for: segments, syllables, words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labelled but in most cases that's impractical. There are ways to automatically obtain most of these labels but you should be aware of the inherit errors in the labelling system you use (including labelling systems that involve human labellers). Note that when a unit selection method is to be used that fundamentally uses segment boundaries its quality is going to be ultimately determined by the quality of the segmental labels in the databases.

For the unit selection algorithm described below the segemntal labels should be using the same phoneset as used in the actual synthesis voice. However a more detailed phonetic labelling may be more useful (e.g. marking closures in stops) mapping that information back to the phone labels before actual use. Autoaligned databases typically aren't acurate enough for use in unit selection. Most autoaligners are built using speech recognition technology where actual phone boundaries are not the primary measure of success. General speech recognition systems primarily measure words correct (or more usefully semantically correct) and do not require phone boundaries to be acurate. If the database is to be used for unit selection it is very important that the phone boundaries are accurate. Having said this though, we have successfully used the aligner described in the diphone chpater above to label general utterance where we knew which phone string we were looking for, using such an aligner may be a useful first pass, but the result should always be checked by hand.

It has been suggested that aligning techniques and unit selection training techniques can be used to judge the accuracy of the labels and basically exclude any segments that appear to fall outside the typical range for the segment type. Thus it, is believed that unit selection algorithms should be able to deal with a certain amount of noise in the labelling. This is the desire for researchers in the field, but we are some way from that and the easiest way at present to improve the quality of unit selection algorithms at present is to ensure that segmental labelling is as accurate as possible. Once we have a better handle on selection techniques themselves it will then be possible to start experimenting with noisy labelling.

However it should be added that this unit selection technique (and many others) support what is termed "optimal coupling" (conkie96) where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. This technique is inherently robust to at least a few tens of millisecond boundary labelling errors.

For the cluster method defined here it is best to construct more than simply segments, durations and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing allow a much richer set of features to be used for clusters. See section 4.4 Utterance building for a more general discussion of how to build utterance structures for a database.

6.1.4 Making cepstrum parameter files

In order to cluster similar units in a database we build an acoustic representation of them. This is is also still a research issue but in the example here we will use Mel cepsrtum plus delta Mel cepstrum plus F0. Though this is open for change (and can easily be done so).

Here is an example script which will generate these parameters for a database, it is included in `festvox/src/unitsel/make_mcep' The main loop here generates the cepstrum parameters and the F0 and then combinsthem into a single file with F0 as parameter 0. This format is assumed for the later acoustic measures though the number of cepstrum/delta cepstrum parameters may be changed if desired.

ESTDIR=/usr/awb/projects/speech_tools/main
PDA_PARAMS="-fmax 180 -fmin 80"
SIG2FV=$ESTDIR/sig2fv
SIG2FVPARAMS='-coefs melcep -delta melcep -melcep_order 12 \
          -fbank_order 24 -shift 0.01 -factor 2.5 -preemph 0.97'

for i in $*
do
  fname=`basename $i .wav`
  echo $fname
  $SIG2FV $SIG2FVPARAMS -otype ascii $i -o /tmp/tmp.$$.ascii
  if [ ! -f festival/f0/$fname.f0 ]
  then
     $ESTDIR/pda -s 0.01 -o festival/f0/$fname.f0 -otype ascii \
               $PDA_PARAMS wav/$fname.wav
  fi
  $ESTDIR/ch_track -pc first -itype ascii -s 0.010 -otype htk \
     festival/f0/$fname.f0 /tmp/tmp.$$.ascii \
     -o festival/coeffs/$fname.dcoeffs
  rm /tmp/tmp.$$.*
done

The above builds coefficients at fixed frames. We have also experiemented with building parameters pitch synchornously and have found a slight improvement in the usefulness of teh emasure based on this. We do not pretend that this part is particularly neat in the system but it does work. When pitch synchornous parameters are build the cluints module will automatically put the local F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The script in `festvox/src/general/make_lpc' can be used to generate the parameters, assuming you have already generated pitch marks.

Note the secondary advantage of using LPC coefficients is that they are requied any way for LPC resynthesis thus this allows less information about the database to be required at run time. We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be tried. Also a more general duration/number of pitch periods match algorithm is worth defining.

6.1.5 Building the clusters

Cluster building is mostly automatic. Of course you need the clunits modules compiled into your version of Festival. Version 1.3.1 or later is required, the version of clunits in 1.3.0 is buggy and incomplete and will not work. To compile in clunits, add

ALSO_INCLUDE += clunits

to the end of your `festival/config/config' file, nad recompile. To chaeck if an installation already has support for clunits check the value of the variable *modules*.

The file `festival/src/modules/clunits/acost.scm' contains the basic code to build a cluster model for a databases that has utterance structures and acoustic parameters. The function do_all will build the distance tables, dump the features and build the cluster trees. The many parameters are set for the particular database (and instance of cluster building) through the Lisp variable clunits_params. An example is given in `festival/src/modules/clunits/ked_params.scm' for the KED timit database.

The function do_all runs through all the steps but as some the steps are relatively time consuming there may be times when each of the steps needs to be run individually. We will go through each step and at that time explain which parameters affect the substep.

Ther first stage is to load in all the utterances in the database, sort them into segment type and name them with individual names (as <type>_<num>). This first stage is required for all other stages so that if you are not running do_all you still need to run this stage first. This is done by the calls

    (format t "Loading utterances and sorting types\n")
    (set! utterances (acost:db_utts_load dt_params))
    (set! unittypes (acost:find_same_types utterances))
    (acost:name_units unittypes)

Though the function do_init will do the same thing.

This uses the following parameters

name: A name for this database.
db_dir: This full pathname of the database
utts_dir: The directory contain the utterances.
utts_ext: The file extention for the utterance files
files: The list of file ids in the database.

For example for the KED example these parameters are

       (name 'ked_timit)
       (db_dir "/usr/awb/data/timit/ked/")
       (utts_dir "festival/utts/")
       (utts_ext ".utt")
       (files ("kdt_001" "kdt_002" "kdt_003" ... ))

The next stage is to load the accoustic parameters and build the distance tables. The acoustic distance between each segment of the same type is calculated and saved in the distance table. Precalculating this saves a lot of time as the cluster will require this number many times.

This is done by the following two function calls

    (format t "Loading coefficients\n")
    (acost:utts_load_coeffs utterances)
    (format t "Building distance tables\n")
    (acost:build_disttabs unittypes clunits_params)

The following parameters influence the behaviour.

coeffs_dir: The directory (from db_dir) that contains the acoustic coefficients as generated by the script `make_mcep'.
coeffs_ext: The file extention for the coefficient files
get_std_per_unit: Takes the value t or nil. If t the parameters for the type of segment are normalized by finding the menas and standard deviations for the class are used. Thus a mean mahalanobis euclidean distance is found between units rather than simply a euclidean distance.
ac_left_context <float>: The amount of the previous unit to be included in the the distance. 1.0 means all, 0.0 means none. This parameter may be used to make the acoustic distance sensitive to the previous acoustic context.
ac_duration_penality <float>: The penalty factor for duration mismatch between units.
ac_weights (<float> <float> ...): The weights for each parameter in the coefficeint files used while finding the acoustic distance between segments. There must be the same number of weights as there are parameters in the coefficient files.

An example from KED is

       (coeffs_dir "festival/coeffs/")
       (coeffs_ext ".dcoeffs")
       (dur_pen_weight 0.1)
       (get_stds_per_unit t)
       (ac_left_context 0.8)
       (ac_weights
         (1.0
           0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
           2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0))

The next stage is to dump the feature staht will be used to index the clusters. Remember the clusters are defined with respect to the acoustic distance between each unit in the cluster, but they are indexed by these features. These features are those which will be available at text to speech time when no acoustic information is available. Thus they include things like phonetic and prosodic context rather than spectral information. The name features may (and probabaly should) be over general alloing the decision tree building program wagon to decide which of theses feature actual does have an acoustic distinction in the units.

The function to dump the features is

    (format t "Dumping features for clustering\n")
    (acost:dump_features unittypes utterances clunits_params)

The parameters which affect this function are

fests_dir: The directory when the features will be saved (by segment type).
feats: The list of features to be dumped. These are standard festival feature names with respect to the Segment relation.

For our KED example these values are

       (feats_dir "festival/feats/")
       (feats 
             (occurid
               p.name p.ph_vc p.ph_ctype 
                   p.ph_vheight p.ph_vlng 
                   p.ph_vfront  p.ph_vrnd 
                   p.ph_cplace  p.ph_cvox    
               n.name n.ph_vc n.ph_ctype 
                   n.ph_vheight n.ph_vlng 
                   n.ph_vfront  n.ph_vrnd 
                   n.ph_cplace  n.ph_cvox
              segment_duration 
              seg_pitch p.seg_pitch n.seg_pitch
              R:SylStructure.parent.stress 
              seg_onsetcoda n.seg_onsetcoda p.seg_onsetcoda
              R:SylStructure.parent.accented 
              pos_in_syl 
              syl_initial
              syl_final
              R:SylStructure.parent.syl_break 
              R:SylStructure.parent.R:Syllable.p.syl_break
              pp.name pp.ph_vc pp.ph_ctype 
                  pp.ph_vheight pp.ph_vlng 
                  pp.ph_vfront  pp.ph_vrnd 
                  pp.ph_cplace pp.ph_cvox))

Now that we have the acoustic distances and the feature descriptions of each unit the next stage is to find a relationship between those features and the acoustic distances. This we do using the CART tree builder wagon. It will find out questions about which features best minimize the acoustic distance between the units in that class. wagon has many options many of which are apporiate to this task though it is interesting that this learning task is interestingly closed. That is we are trying to classify all the units in the database, there is no test set as such. However in synthesis there will be desired units whose feature vector didn't exist in the training set.

The clusters are built by the following function

    (format t "Building cluster trees\n")
    (acost:find_clusters (mapcar car unittypes) clunits_params)

The parameters that affect the tree building process are

tree_dir: the directory where the decision tree for each segment type will be saved
wagon_field_desc <file>: A filename of a weagon filed descriptor file. This is a standard field description (field name plus field type) that is require for wagon. This building process doesn't include an explicit building of this file and you must create it yourself. The script `make_wagon_desc' can aid this given the feature files in `festival/feats' and a file contains the feature names (one per line) as listed in the feats parameter above.
wagon_progname <file>: The pathname for the `wagon' CART building program. This is a string and may also include any extra parameters you wish to give to `wagon' e.g. -stepwise.
wagon_cluster_size <int>: The minimu cluster size (the wagon -stop value).
prune_reduce <int>: This number of elements in each cluster to remove in pruning. This removes the units in the cluster that are furthest from the centre.

Note that as the distance tables can be large there is an alternative function that does both the ditance table and clustering in one, deleting the distance table immediately after use, thus you only need enough disk space for the largest number of phones in any type. To do this

    (acost:disttabs_and_clusters unittypes clunits_params)

Removing the calls to acost:build_disttabs and acost:find_clusters.

In our KED example these have the values

       (trees_dir "festival/trees/")
       (wagon_field_desc "festival/clunits/all.desc")
       (wagon_progname "/usr/awb/projects/speech_tools/bin/wagon")
       (wagon_cluster_size 10)
       (prune_reduce 0)

The final stage in building a cluster model is collect the generated trees into a single file and dumping the uniot catalogue, i.e. the list of unit names and their files and position in them. This is doen by the lisp function

    (acost:collect_trees (mapcar car unittypes) clunits_params)
    (format t "Saving unit catalogue\n")
    (acost:save_catalogue utterances clunits_params)

The only parameter that affect this is

catalogue_dir: the directory where the catalogue will be save (the name parameter is used to name the file).

In the KED example this is

       (catalogue_dir "festival/clunits/")

There are a number of parameters that are specified with a cluster voice. These are related to the run time aspects of the cluster model. These are

join_weights: This are a set of weights, in the same format as ac_weights that are used in optimal coupling to find the best join point between two candidate units. This is different from ac_weights as it is likely different values are desried, particularl increasing the F0 value (column 0).
continuity_weight <float>: The factor to multiply the join cost over the target cost. This is probabaly not very relevant given the the target cost is merely the position from the cluster center.
optimal_coupling <int>: If 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find the best possible join point. This is computationally expensive (as well as having to load in lots of cepstrum files), but does give better results.
extend_selections <int>: If 1 then the selected cluster will be extended to include any unit from the cluster of the previous segments candidate units that has correct phone type. This is experimental but has shown its worth and hence is recommend. This means that instead of selecting just units selection is effectively selecting the beginings of multiple segment units. This option encourages far longer units.
pm_coeffs_dir <file>: The directory (from db_dir) where the pitchmarks are
pm_coeffs_ext <file>: The file extension for the pitchmark files.
sig_dir <file>: Directory containing waveforms of the units (or residuals if Residual LPC is being used, PCM waveforms is PSOLA is being used)
sig_ext <file>: File extention for waveforms/residuals
join_method <method>: Specify the method used for joining the selected units. Currently it supports simple, a very naive joining mechanism, and windowed, where the ends of the units are windowed using a hamming window then overlapped (no prosodic modification takes place though). The other two possible values for this feature are none which does nothing, and modified_lpc which uses the standard UniSyn module to modify the selected units to match the targets.

6.1.6 Defining a voice

This cluster method is just a waveform synthesizer it still requires a text analysis and prosodic component. The only restriction is that it must generate the same sort of utterance structures as in your database. This is because it features from utterances of that type which were used to train the selection trees. That is you can't use a front end that uses different relation names and features.

Here we simply use the same front end as ked_diphone as it is basically the same speaker.

6.1.7 Cluster Example

A simple example of building a cluster unit selection synthesizer is given in section 7 Limited domain synthesis. In that example the features used in selection have been reduced and a few other simplying assumptions have been made but the underlying structure is the same. That is a good example to start from, then change the parameters as fully described above to improve the selection criertia.

6.2 Diphones from general databases

As touched on above the choice of an inventory of units can be viewed as a line from a small inventory phones, to diphones, tripohones to arbitrary units. Though the direction you come from influences the selection of the units from the database. CHATR campbell96 lies firmly at the "arbitrary units" end of the spectrum. Although it can exclude bad units from its inventory it is very much `everything minus some' view of the world. Microsoft's Whistler huang97 on the other hand, starts off with a general database base but selects typical units from it. Thus its inventory is substantially smaller than the full general database the units are extracted from. At the other end of the spectrum we have the fixed pre-sepcified inventory like diphone synthesis as has bee described in the previous chapter.

In this section we'll give some examples of moving along the line from the fixed pre-specified inventory to the words the more general inventories but these techniques still have a strong component of prespecification.

Firstly lets us assume you have a general database that is labelled with utterances as described above. We can extract a standard diphone database from this general database, however unless the database was specifically desgined, a general database is unlikely to have diphone coverage. Even when phonetically rich databases are used such as Timit there is likely to be very few vowel-vowel diphones as they are comparatively rare. But as these diphone are rare we may be able to do with out them and hence it is at least an interesting exercise to extract an as complete as possible diphone index from a general database.

The simplest method is to linearly search for all phone-phone pairs in the phone set through all utterances simply taking the first example. Some same code is given in `src/diphone/make_diphs_index.scm'. This basic idea is to load in all the utterances in a database, and index each segment by is phone name and succeeding phone name. Then various selection techniques can be use to select from the multiple candidates of each diphone (or you can split the indexing futher). After selection a diphone index file can be saved.

The utterances to load are identified by a list of fileids. For example if the list of fileids (without parenthesis) is in the file `etc/fileids', the following will builds a diphone index.

festival .../make_diphs_utts.scm
...
festival> (set! fileids (load "etc/fileids" t))
...
festival> (make_diphone_index fileids "dic/f2bdiph.est")

Note that as this diphone index will contain a number of holes you will need to either augment it with `similar' diphones or process your diphone selections through UniSyn_module_hooks as described in the previous chapter.

As you complicate the selection, and the number of diphones you used from the database you will need to complicate the names used to identify the diphones themselves. The convention of using underscores for syllable internal consonant clusters and dollars for syllable initial consonants can be followed, but you will need to go further if you wish to start introducing new feature such as phrase finality and stress. Eventually going to a generized naming scheme (type and number) as used by the cluster selection technique described above, will prove worth while. Also using CART trees, through hand written and fully deterministic (one candidate at the leaves), will be a reasonable algorithm to select between hand stipulated alternatives with reasonable backoff strategies.

Another potential direction is to use the acoustic costs used in the clustering methods described in the previous section. These can be used to identify what the most typical unit in a cluster are (the mean distances from all other units are given in the leafs). Pruning these trees until the cluster only contain a single example should help to improve synthesis, in that variation in the feature in the "diphone" index will then be determined by the features specified in the cluster train algorithm. Of course though as you limit the number of distinct units types the more prosodic modification will be required by your signal processing algorithm, which requires that you have good pitch marks.

If you already have an existing database but don't wish to go to full unit selection, such techniques are probably quite feasible and worth further investigation.

Go to the first, previous, next, last section, table of contents.