Like the above prosody phenomena, very simple solutions to predicting durations work surprisingly well, though very good solutions are extremely difficult to achieve.
Again the basic strategy is assigning fixed models, simple rules models, complex rule modules, and trained models using the features in the complex rule models. The choice of where to stop depends on the resources available to you and time you wish to spend on the problem. Given a reasonably sized database training a simple CART tree for durations achieves quite acceptable results. This is currently what we do for our English voices in Festival. There are better models out there but we have not fully investigated them or included easy scripts to customize them.
The simplest model for duration is a fixed duration for each phone. A
value of 100 milliseconds is a reasonable start. This type of model is
only of use at initial testing of a diphone database beyond that it
sounds too artificial. The Festival function
SayPhones uses a
fixed duration model, controlled by the value (in ms) in the variable
FP_duration. Although there is a fixed duration module in
Festival (see the manual) its worthwhile starting off with something
a little more interesting.
The next level for duration models is to use average durations for the phones. Even when real data isn't available to calculate averages, writing values by hand can be acceptable, basically vowels are longer than consonants, and stops are the shortest. Estimating values for a set of phones can be done by looking at data from another language, (if you are really stuck, see festival/lib/mrpa_durs.scm}, to get the basic idea of average phone lengths.
In most languages phones are longer at the phrase final and to a lesser extent phrase initial positions. A simple multiplicative factor can be defined for these positions. The next stage from this is a set of rules that modify the basic average based on the context they occur in. For English the best definition of such rules is the duration rules given in chapter 9, [allen87] (often referred to as the Klatt duration model). The factors used in this may also apply to other languages. A simplified form of this, that we have successfully used for a number of languages, and is often used as our first approximation for a duration rule set is as follows.
You may modify this adding more conditions as much as you want. In addition to the tree you need to define the averages for each phone in your phone set. For reasons we will explain below the format of this information is "segname 0.0 average" as in
((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
((R:SylStructure.parent.stress is 1)
((R:SylStructure.parent.syl_break > 1) ;; clause final
((R:SylStructure.parent.stress is 1)
((R:SylStructure.parent.stress is 1)
((ph_vc is +)
With both these expressions loaded in your voice you may set the following in your voice definition function. setting up this tree and data as the standard and the appropriate duration module.
(# 0.0 0.250)
(a 0.0 0.080)
(e 0.0 0.080)
(i 0.0 0.070)
(o 0.0 0.080)
(u 0.0 0.070)
(i0 0.0 0.040)
Though in your voice use voice specific names for the
;; Duration prediction
(set! duration_cart_tree simple_dur_tree)
(set! duration_ph_info simple_phone_data)
(Parameter.set 'Duration_Method 'Tree_ZScores)
simple_variables otherwise you may class with other voices.
It has been shown [campbell91] that a better representation for
duration for modeling is zscores, that is number of standard
deviations from the mean. The duration module used in the above is
actually designed to take a CART tree that returns zscores and uses the
duration_ph_info to change that into an absolute
duration. The two fields after the phone name are mean and standard
deviation. The interpretation of this tree and this phone info happens
to give the right result when we use the tree to predict factors and
have the stddev field contain the average duration, as we did above.
However no matter if we use zscores or absolutes, a better way to build a duration model is to train from data rather than arbitrarily selecting modification factors.
Given a reasonable sized database we can dump durations and features for each segment in the database. Then we can train a model using those samples. For our English voices we have trained regression models using wagon, though we include the tools for linear regression models too.
By convention we build duration models in festival/dur/. We will save the above feature names in dur.featnames. We can dump the features with the command
This will put all the features in the file dur.feats. For
dumpfeats -relation Segment -feats dur.featnames -output dur.feats \
wagonwe need to build a feature description file, we can build a first approximation with the make_wagon_desc script available with the speech tools
You will then need to edit dur.desc to change a number of features from their categorical list (lots of numbers) into type
make_wagon_desc dur.feats dur.featnames dur.desc
float. Specifically for the above list the features
R:SylStructure.parent.ssyl_outshould be declared as floats.
We can no build a model using wagon
You may wish to remove all examples of silence from the data as silence durations typically has quite a different distribution from other phones. In fact it is common that databases include many examples of silence which are not of natural length as they are arbitrary parts of the initial and following silence around the spoken utterances. Their durations are not something that should be trained for.
wagon -data dur.feat.train.train -desc dur.desc \
-test dur.feats.train.test -stop 10 -stepwise \
wagon_test -data dur.feats.test -tree dur.10.tree -desc dur.desc
These instructions above will build a tree that predicts absolute values. To get such a tree to work with the zscore module simply make the stddev field above 1. As stated above using zscores typically give better results. Although the correlation of these duration models in the zscore domain may not be as good as training models predicting absolute scores when those predicted scores are convert back into the absolute domain we have found (for English) that the correlations are better, and RMSE smaller.
In order to train a zscore model you need to convert the absolute segment durations, to do that you need the means and standard deviations for each segment in your phoneset.
There is a whole branch of possible mappings for the distribution of durations: zscores, logs, logs-zscores, etc or even more complex functions [bellegarda98]. These variations do give some improvements. The intention is to map the distribution to a normal distribution which makes it easier to learn.
Other learning techniques, particularly Sums of Products model ([sproat98] chapter 5), which has been shown to training better even on small amounts of data.
Another technique, which although shouldn't work is to borrow a models trained for another language for which data is available. Actually the duration model used in Festival for the US and UK voices is the same, it was in fact trained from the f2b database, a US English database. As the phone sets are different for US and UK English we trained the models using phonetic features rather than phone names, and trained them in the zscore domain keeping the actual phone names and means and standard deviations separate. Although the models were slightly better if we included the phone names themselves, it was only slightly better and the models were also substantially larger (and took longer to train). Using the phonetic feature offers a more general model (it works for UK English), more compact, quicker learning time and with only a small cost in performance.
Also in the German voice developed at OGI, the same English duration model was used. The results are acceptable and are at least better than any hand written rule system that could be written. Improvements in that model are probably only possible by training on real German data. Note however such cross language borrowing of models is unlikely to work in general but there may be cases where it is a reasonable fall back position.