This chapter discusses some of the options for building waveform synthesizers using unit selection techniques in Festival. This is still very much an on-going research question and we are still adding new techniques as well as improving existing ones often so the techniques described here are not as mature as the techniques as described in previous diphone chapter.
By "unit selection" we actually mean the selection of some unit of speech which may be anything from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple case of this. However typically what we mean is unlike diphone selection, in unit selection there is more than one example of the unit and some mechanism is used to select between them at run-time.
ATR's CHATR [hunt96] system and earlier work at that lab [nuutalk92] is an excellent example of one particular method for selecting between multiple examples of a phone within a database. For a discussion of why a more generalized inventory of units is desired see [campbell96] though we will reiterate some of the points here. With diphones a fixed view of the possible space of speech units has been made which we all know is not ideal. There are articulatory effects which go over more than one phone, e.g. /s/ can take on artifacts of the roundness of the following vowel even over an intermediate stop, e.g. "spout" vs "spit". But its not just obvious segmental effects that cause variation in pronunciation, syllable position, word/phrase initial and final position have typically a different level of articulation from segments taken from word internal position. Stressing and accents also cause differences. Rather than try to explicitly list the desired inventory of all these phenomena and then have to record all of them a potential alternative is to take a natural distribution of speech and (semi-)automatically find the distinctions that actually exist rather predefining them.
The theory is obvious but the design of such systems and finding the appropriate selection criteria, weighting the costs of relative candidates is a non-trivial problem. However techniques like this often produce very high quality, very natural sounding synthesis. However they also can produce some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.
Two forms of unit selection will discussed here, not because we feel they are the best but simply because they are the ones actually implemented by us and hence can be distributed. These should still be considered research systems. Unless you are specifically interested or have the expertise in developing new selection techniques it is not recommended that you try these, if you need a working voice within a month and can't afford to miss that deadline then the diphone option is safe, well tried and stable. In you need higher quality and know something about what you need to say, then we recommend the limited domain techniques discussed in the following chapter. The limited domain synthesis offers the high quality of unit selection but avoids much (all ?) of the bad selections.
This is a reimplementation of the techniques as described in [black97c]. The idea is to take a database of general speech and try to cluster each phone type into groups of acoustically similar units based on the (non-acoustic) information available at synthesis time, such as phonetic context, prosodic features (F0 and duration) and higher level features such as stressing, word position, and accents. The actually features used may easily be changed and experimented with as can the definition of the definition of acoustic distance between the units in a cluster.
In some sense this work builds on the results of both the CHATR selection algorithm [hunt96] and the work of [donovan95], but differs in some important and significant ways. Specifically in contrast to [hunt96] this cluster algorithm pre-builds CART trees to select the appropriate cluster of candidate phones thus avoiding the computationally expensive function of calculating target costs (through linear regression) at selection time. Secondly because the clusters are built directly from the acoustic scores and target features, a target estimation function isn't required removing the need to calculate weights for each feature. This cluster method differs from the clustering method in [donovan95] in that it can use more generalized features in clustering and uses a different acoustic cost function (Donovan uses HMMs), also his work is based on sub-phonetic units (HMM states). Also Donovan selects one candidate while here we select a group of candidates and finds the best overall selection by finding the best path through each set of candidates for each target phone, in a manner similar to [hunt96] and [iwahashi93] before.
The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows. A high level walkthrough of the scripts to run is given after these lower level details.
Collect the database of general speech.
Building utterance structures for your database using the techniques discussed in the Section called Utterance building in the Chapter called A Practical Speech Synthesis System.
Building coefficients for acoustic distances, typically some form of cepstrum plus F0, or some pitch synchronous analysis (e.g. LPC).
Build distances tables, precalculating the acoustic distance between each unit of the same phone type.
Dump selection features (phone context, prosodic, positional and whatever) for each unit type.
Build cluster trees using wagon with the features and acoustic distances dumped by the previous two stages
Building the voice description itself
before you start you must make a decision about what unit type you are going to use. Note there are two dimensions here. First is size, such as phone, diphone, demi-syllable. The second type itself which may be simple phone, phone plus stress, phone plus word etc. The code here and the related files basically assume unit size is phone. However because you may also include a percentage of the previous unit in the acoustic distance measure this unit size is more effectively phone plus previous phone, thus it is somewhat diphone like. The cluster method has actual restrictions on the unit size, it simply clusters the given acoustic units with the given feature, but the basic synthesis code is currently assuming phone sized units.
The second dimension, type, is very open and we expect that
controlling this will be a good method to attained high quality
general unit selection synthesis. The parameter
clunit_name_feat may be used define the unit type.
The simplest conceptual example is the one used in the limited domain
synthesis. There we distinguish each phone with the word it comes
from, thus a d from the word
limited is distinct from the
d in the word domain. Such
distinctions can hard partition up the space of phones into types that
can be more manageable.
The decision of how to carve up that space depends largely on the
intended use of the database. The more distinctions you make less you
depend on the clustering acoustic distance, but the more you depend on
your labels (and the speech) being (absolutely) correct. The
mechanism to define the unit type is through a (typically) user
defined feature function. In the given setup scripts this feature
function will be called
lisp_INST_LANG_NAME::clunit_name. Thus the voice
simply defines the function
INST_LANG_NAME::clunit_name to return the
unit type for the given segment. If you wanted to make
a diphone unit selection voice this function could simply be
This the unittype would be the phone plus its previous phone. Note that the first part of a unit name is assumed to be the phone name in various parts of the code thus although you make think it would be neater to return
(define (INST_LANG_NAME::clunit_name i)
(item.feat i "p.name")))
previousphone_phonethat would mess up some other parts of the code.
In the limited domain case the word is attached to the phone. You can also consider some demi-syllable information or more to differentiate between different instances of the same phone.
The important thing to remember is that at synthesis time the same function is called to identify the unittype which is used to select the appropriate cluster tree to select from. Thus you need to ensure that if you use say diphones that the your database really does not have all diphones in it.
Unlike diphone database which are carefully constructed to ensure specific coverage, one of the advantages of unit selection is that a much more general database is desired. However, although voices may be built from existing data not specifically gathered for synthesis there are still factors about the data that will help make better synthesis.
Like diphone databases the more cleanly and carefully the speech is recorded the better the synthesized voice will be. As we are going to be selecting units from different parts of the database the more similar the recordings are, the less likely bad joins will occur. However unlike diphones database, prosodic variation is probably a good thing, as it is those variations that can make synthesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone coverage if not complete diphone coverage. Also synthesis using these techniques seems to retain aspects of the original database. If the database is broadcast news stories, the synthesis from it will typically sound like read news stories (or more importantly will sound best when it is reading news stories).
Although it is too early to make definitive statements about what size
and type of data is best for unit selection we do have some rough
guides. A Timit like database of 460 phonetically balanced sentences
(around 14,000 phones) is not an unreasonable first choice. If the
text has not been specifically selected for phonetic coverage a larger
database is probably required, for example the Boston University Radio
News Corpus speaker
f2b [ostendorf95] has been used
relatively successfully. Of course all this depends on what use you
wish to make of the synthesizer, if its to be used in more restrictive
environments (as is often the case) tailoring the database for the task
is a very good idea. If you are going to be reading a lot of telephone
numbers, having a significant number of examples of read numbers will
make synthesis of numbers sound much better (see the following
chapter on making such design more explicit).
The database used as an example here is a TIMIT 460 sentence database read by an American male speaker.
Again the notes about recording the database apply, though it will sometimes be the case that the database is already recorded and beyond your control, in that case you will always have something legitimate to blame for poor quality synthesis.
Throughout our discussion we will assume the following database layout. It is highly recommended that you follow this format otherwise scripts, and examples will fail. There are many ways to organize databases and many of such choices are arbitrary, here is our "arbitrary" layout.
The basic database directory should contain the following directories
Any database specific scripts for processing. Typically this first contains a copy of standard scripts that are then customized when necessary to the particular database
The waveform files. These should be headered, one utterances per file with a standard name convention. They should have the extension .wav and the fileid consistent with all other files through the database (labels, utterances, pitch marks etc).
The segmental labels. This is usually the master label files, these may contain more information that the labels used by festival which will be in festival/relations/Segment/.
The EGG files (larynograph files) if collected.
Pitchmark files as generated from the lar files or from the signal directly.
Festival specific label files.
The processed labeled files for building Festival utterances, held in directories whose name reflects the relation they represent: Segment/, Word/, Syllable/ etc.
The utterances files as generated from the festival/relations/ label files.
In order to make access well defined you need to construct Festival utterance structures for each of the utterances in your database. This (in is basic form) requires labels for: segments, syllables, words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labeled but in most cases that's impractical. There are ways to automatically obtain most of these labels but you should be aware of the inherit errors in the labeling system you use (including labeling systems that involve human labelers). Note that when a unit selection method is to be used that fundamentally uses segment boundaries its quality is going to be ultimately determined by the quality of the segmental labels in the databases.
For the unit selection algorithm described below the segmental labels should be using the same phoneset as used in the actual synthesis voice. However a more detailed phonetic labeling may be more useful (e.g. marking closures in stops) mapping that information back to the phone labels before actual use. Autoaligned databases typically aren't accurate enough for use in unit selection. Most autoaligners are built using speech recognition technology where actual phone boundaries are not the primary measure of success. General speech recognition systems primarily measure words correct (or more usefully semantically correct) and do not require phone boundaries to be accurate. If the database is to be used for unit selection it is very important that the phone boundaries are accurate. Having said this though, we have successfully used the aligner described in the diphone chapter above to label general utterance where we knew which phone string we were looking for, using such an aligner may be a useful first pass, but the result should always be checked by hand.
It has been suggested that aligning techniques and unit selection training techniques can be used to judge the accuracy of the labels and basically exclude any segments that appear to fall outside the typical range for the segment type. Thus it, is believed that unit selection algorithms should be able to deal with a certain amount of noise in the labeling. This is the desire for researchers in the field, but we are some way from that and the easiest way at present to improve the quality of unit selection algorithms at present is to ensure that segmental labeling is as accurate as possible. Once we have a better handle on selection techniques themselves it will then be possible to start experimenting with noisy labeling.
However it should be added that this unit selection technique (and many others) support what is termed "optimal coupling" [conkie96] where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. This technique is inherently robust to at least a few tens of millisecond boundary labeling errors.
For the cluster method defined here it is best to construct more than simply segments, durations and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing allow a much richer set of features to be used for clusters. See the Section called Utterance building in the Chapter called A Practical Speech Synthesis System for a more general discussion of how to build utterance structures for a database.
In order to cluster similar units in a database we build an acoustic representation of them. This is is also still a research issue but in the example here we will use Mel cepstrum. Interestingly we do not generate these at fixed intervals, but at pitch marks. Thus have a parametric spectral representation of each pitch period. We have found this a better method, though it does require that pitchmarks are reasonably identified.
for i in $*
fname=`basename $i .wav`
echo $fname MCEP
$SIG2FV $SIG2FVPARAMS -otype est_binary $i -o mcep/$fname.mcep -pm pm/$fname.pm -window_type hamming
The above builds coefficients at fixed frames. We have also experimented with building parameters pitch synchronously and have found a slight improvement in the usefulness of the measure based on this. We do not pretend that this part is particularly neat in the system but it does work. When pitch synchronous parameters are build the clunits module will automatically put the local F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The script in festvox/src/general/make_lpc can be used to generate the parameters, assuming you have already generated pitch marks.
Note the secondary advantage of using LPC coefficients is that they are required any way for LPC resynthesis thus this allows less information about the database to be required at run time. We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be tried. Also a more general duration/number of pitch periods match algorithm is worth defining.
Cluster building is mostly automatic. Of course you need the
clunits modules compiled into your version of
Festival. Version 1.3.1 or later is required, the version of
clunits in 1.3.0 is buggy and incomplete and will
not work. To compile in
to the end of your festival/config/config file, nad recompile. To check if an installation already has support for
ALSO_INCLUDE += clunits
clunitscheck the value of the variable
The file festvox/src/unitsel/build_clunits.scm
contains the basic parameters to build a cluster model for a databases
that has utterance structures and acoustic parameters. The function
build_clunits will build the distance tables, dump
the features and build the cluster trees. There are many parameters
are set for the particular database (and instance of cluster building)
through the Lisp variable
reasonable set of defaults is given in that file, and reasonable
run-time parameters will be copied into
festvox/INST_LANG_VOX_clunits.scm when a
new voice is setup.
build_clunits runs through all the
steps but in order to better explain what is going on, we will go
through each step and at that time explain which parameters affect the
The first stage is to load in all the utterances in the database, sort
them into segment type and name them with individual names (as
TYPE_NUM. This first stage is required for all
other stages so that if you are not running
you still need to run this stage first. This is done by the calls
Though the function
(format t "Loading utterances and sorting types\n")
(set! utterances (acost:db_utts_load dt_params))
(set! unittypes (acost:find_same_types utterances))
build_clunits_initwill do the same thing.
This uses the following parameters
A name for this database.
This pathname of the database, typically . as in the current directory.
The directory contain the utterances.
The file extention for the utterance files
The list of file ids in the database.
In the examples below the list of fileids is extracted from the given prompt file at call time.
(files ("kdt_001" "kdt_002" "kdt_003" ... ))
The next stage is to load the acoustic parameters and build the distance tables. The acoustic distance between each segment of the same type is calculated and saved in the distance table. Precalculating this saves a lot of time as the cluster will require this number many times.
The following parameters influence the behaviour.
(format t "Loading coefficients\n")
(format t "Building distance tables\n")
(acost:build_disttabs unittypes clunits_params)
The directory (from db_dir) that contains the acoustic coefficients as generated by the script make_mcep.
The file extention for the coefficient files
Takes the value
t the parameters for the type of segment are
normalized by finding the means and standard deviations for the class
are used. Thus a mean mahalanobis euclidean distance is found between
units rather than simply a euclidean distance. The recommended value
The amount of the previous unit to be included in the the distance.
1.0 means all, 0.0 means none. This parameter may be used to make the
acoustic distance sensitive to the previous acoustic context. The
recommended value is
The penalty factor for duration mismatch between units.
The penalty factor for F0 mismatch between units.
ac_weights (FLOAT FLOAT ...)
The weights for each parameter in the coefficeint files used while finding the acoustic distance between segments. There must be the same number of weights as there are parameters in the coefficient files. The first parameter is (in normal operations) F0. Its is common to give proportionally more weight to F0 that to each individual other parameter. The remaining parameters are typically MFCCs (and possibly delta MFCCs). Finding the right parameters and weightings is one the key goals in unit selection synthesis so its not easy to give concrete recommendations. The following aren't bad, but there may be better ones too though we suspect that real human listening tests are probably the best way to find better values.
(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5))
The next stage is to dump the features that will be used to index the
clusters. Remember the clusters are defined with respect to the
acoustic distance between each unit in the cluster, but they are
indexed by these features. These features are those which will be
available at text-to-speech time when no acoustic information is
available. Thus they include things like phonetic and prosodic context
rather than spectral information. The name features may (and probably
should) be over general allowing the decision tree building program
wagon to decide which of theses feature actual does
have an acoustic distinction in the units.
The parameters which affect this function are
(format t "Dumping features for clustering\n")
(acost:dump_features unittypes utterances clunits_params)
The directory when the features will be saved (by segment type).
The list of features to be dumped. These are standard festival feature names with respect to the Segment relation.
p.name p.ph_vc p.ph_ctype
n.name n.ph_vc n.ph_ctype
seg_pitch p.seg_pitch n.seg_pitch
seg_onsetcoda n.seg_onsetcoda p.seg_onsetcoda
pp.name pp.ph_vc pp.ph_ctype
Now that we have the acoustic distances and the feature descriptions
of each unit the next stage is to find a relationship between those
features and the acoustic distances. This we do using the CART tree
wagon. It will find out questions about
which features best minimize the acoustic distance between the units
in that class.
wagon has many options many of
which are apposite to this task though it is interesting that this
learning task is interestingly closed. That is we are trying to
classify all the units in the database, there is
no test set as such. However in synthesis there will be desired units
whose feature vector didn't exist in the training set.
(format t "Building cluster trees\n")
(acost:find_clusters (mapcar car unittypes) clunits_params)
The parameters that affect the tree building process are
the directory where the decision tree for each segment type will be saved
A filename of a wagon field descriptor file. This is a standard field description (field name plus field type) that is require for wagon. An example is given in festival/clunits/all.desc which should be sufficient for the default feature list, though if you change the feature list (or the values those features can take you may need to change this file.
The pathname for the wagon CART building program. This is a string and may also include any extra parameters you wish to give to wagon.
The minimum cluster size (the wagon
This number of elements in each cluster to remove in pruning. This removes the units in the cluster that are furthest from the center. This is down within the wagon training.
This is a post wagon build operation on the generated trees (and perhaps a more reliably method of pruning). This defines the maximum number of units that will be in a cluster at a tree leaf. The wagon cluster size the minimum size. This is usefully when there are some large numbers of some particular unit type which cannot be differentiated. Format example silence segments without context of nothing other silence. Another usage of this is to cause only the center example units to be used. We have used this in building diphones databases from general databases but making the selection features only include phonetic context features and then restrict the number of diphones we take by making this number 5 or so.
When making complex unit types this defines the minimal number of units of that type required before building a tree. When doing cascaded unit selection synthesizers its often not worth excluding large stages if there is say only one example of a particular demi-syllable.
Note that as the distance tables can be large there is an alternative function that does both the distance table and clustering in one, deleting the distance table immediately after use, thus you only need enough disk space for the largest number of phones in any type. To do this
Removing the calls to
(acost:disttabs_and_clusters unittypes clunits_params)
The final stage in building a cluster model is collect the generated trees into a single file and dumping the unit catalogue, i.e. the list of unit names and their files and position in them. This is done by the lisp function
The only parameter that affect this is
(acost:collect_trees (mapcar car unittypes) clunits_params)
(format t "Saving unit catalogue\n")
(acost:save_catalogue utterances clunits_params)
the directory where the catalogue will be save (the
parameter is used to name the file).
There are a number of parameters that are specified with a cluster voice. These are related to the run time aspects of the cluster model. These are
This are a set of weights, in the same format as
that are used in optimal coupling to find the best join point between two
candidate units. This is different from
ac_weights as it
is likely different values are desired, particularly increasing the
F0 value (column 0).
The factor to multiply the join cost over the target cost. This is probably not very relevant given the the target cost is merely the position from the cluster center.
If specified the joins scores are converted to logs. For databases that have a tendency to contain non-optimal joins (probably any non-limited domain databases), this may be useful to stop failed synthesis of longer sentences. The problem is that the sum of very large number can lead to overflow. This helps reduce this. You could alternatively change the continuity_weight to a number less that 1 which would also partially help. However such overflows are often a pointer to some other problem (poor distribution of phones in the db), so this is probably just a hack.
1 this uses optimal coupling and searches the cepstrum
vectors at each join point to find the best possible join point.
This is computationally expensive (as well as having to load in lots
of cepstrum files), but does give better results. If the
2 this only checks the coupling distance at the
given boundary (and doesn't move it), this is often adequate in
good databases (e.g. limited domain), and is certainly faster.
1 then the selected cluster will be extended
to include any unit from the cluster of the previous segments
candidate units that has correct phone type (and isn't already included
in the current cluster). This is experimental
but has shown its worth and hence is recommended. This means
that instead of selecting just units selection is effectively
selecting the beginnings of multiple segment units. This option
encourages far longer units.
The directory (from
db_dir where the pitchmarks are
The file extension for the pitchmark files.
Directory containing waveforms of the units (or residuals if Residual LPC is being used, PCM waveforms is PSOLA is being used)
File extension for waveforms/residuals
Specify the method used for joining the selected units. Currently it
simple, a very naive joining mechanism,
windowed, where the ends of the units are
windowed using a hamming window then overlapped (no prosodic
modification takes place though). The other two possible values for
this feature are
none which does nothing, and
modified_lpc which uses the standard UniSyn module
to modify the selected units to match the targets.
With a value of
1 some debugging information is
printed during synthesis, particularly how many candidate phones
are available at each stage (and any extended ones). Also where
each phone is coming from is printed.
With a value of
2 more debugging information
is given include the above plus joining costs (which are very readable