Duration

Like the above prosody phenomena, very simple solutions to predicting durations work surprisingly well, though very good solutions are extremely difficult to achieve.

Again the basic strategy is assigning fixed models, simple rules models, complex rule modules, and trained models using the features in the complex rule models. The choice of where to stop depends on the resources available to you and time you wish to spend on the problem. Given a reasonably sized database training a simple CART tree for durations achieves quite acceptable results. This is currently what we do for our English voices in Festival. There are better models out there but we have not fully investigated them or included easy scripts to customize them.

The simplest model for duration is a fixed duration for each phone. A value of 100 milliseconds is a reasonable start. This type of model is only of use at initial testing of a diphone database beyond that it sounds too artificial. The Festival function SayPhones uses a fixed duration model, controlled by the value (in ms) in the variable FP_duration. Although there is a fixed duration module in Festival (see the manual) its worthwhile starting off with something a little more interesting.

The next level for duration models is to use average durations for the phones. Even when real data isn't available to calculate averages, writing values by hand can be acceptable, basically vowels are longer than consonants, and stops are the shortest. Estimating values for a set of phones can be done by looking at data from another language, (if you are really stuck, see festival/lib/mrpa_durs.scm}, to get the basic idea of average phone lengths.

In most languages phones are longer at the phrase final and to a lesser extent phrase initial positions. A simple multiplicative factor can be defined for these positions. The next stage from this is a set of rules that modify the basic average based on the context they occur in. For English the best definition of such rules is the duration rules given in chapter 9, [allen87] (often referred to as the Klatt duration model). The factors used in this may also apply to other languages. A simplified form of this, that we have successfully used for a number of languages, and is often used as our first approximation for a duration rule set is as follows.

Here we define a simple decision tree that returns a multiplication factor for a segment

(set! simple_dur_tree
 '
   ((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
    ((R:SylStructure.parent.stress is 1)
     ((1.5))
     ((1.2)))
    ((R:SylStructure.parent.syl_break > 1)   ;; clause final
     ((R:SylStructure.parent.stress is 1)
      ((1.5))
      ((1.2)))
     ((R:SylStructure.parent.stress is 1)
      ((ph_vc is +)
       ((1.2))
       ((1.0)))
      ((1.0))))))

You may modify this adding more conditions as much as you want. In addition to the tree you need to define the averages for each phone in your phone set. For reasons we will explain below the format of this information is "segname 0.0 average" as in

(set! simple_phone_data
'(
   (# 0.0 0.250)
   (a 0.0 0.080)
   (e 0.0 0.080)
   (i 0.0 0.070)
   (o 0.0 0.080)
   (u 0.0 0.070)
   (i0 0.0 0.040)
   ...
 ))

With both these expressions loaded in your voice you may set the following in your voice definition function. setting up this tree and data as the standard and the appropriate duration module.

  ;; Duration prediction
  (set! duration_cart_tree simple_dur_tree)
  (set! duration_ph_info simple_phone_data)
  (Parameter.set 'Duration_Method 'Tree_ZScores)

Though in your voice use voice specific names for the simple_ variables otherwise you may class with other voices.

It has been shown [campbell91] that a better representation for duration for modeling is zscores, that is number of standard deviations from the mean. The duration module used in the above is actually designed to take a CART tree that returns zscores and uses the information in duration_ph_info to change that into an absolute duration. The two fields after the phone name are mean and standard deviation. The interpretation of this tree and this phone info happens to give the right result when we use the tree to predict factors and have the stddev field contain the average duration, as we did above.

However no matter if we use zscores or absolutes, a better way to build a duration model is to train from data rather than arbitrarily selecting modification factors.

Given a reasonable sized database we can dump durations and features for each segment in the database. Then we can train a model using those samples. For our English voices we have trained regression models using wagon, though we include the tools for linear regression models too.

An initial set of features to dump might be

segment_duration
name
p.name
n.name
R:SylStructure.parent.syl_onsetsize
R:SylStructure.parent.syl_codasize
R:SylStructure.parent.R:Syllable.n.syl_onsetsize
R:SylStructure.parent.R:Syllable.p.syl_codasize
R:SylStructure.parent.position_type
R:SylStructure.parent.parent.word_numsyls
pos_in_syl
syl_initial
syl_final
R:SylStructure.parent.pos_in_word
p.seg_onsetcoda
seg_onsetcoda
n.seg_onsetcoda
pp.ph_vc 
p.ph_vc 
ph_vc 
n.ph_vc 
nn.ph_vc
pp.ph_vlng 
p.ph_vlng 
ph_vlng 
n.ph_vlng 
nn.ph_vlng
pp.ph_vheight
p.ph_vheight
ph_vheight
n.ph_vheight
nn.ph_vheight
pp.ph_vfront
p.ph_vfront
ph_vfront
n.ph_vfront
nn.ph_vfront
pp.ph_vrnd
p.ph_vrnd
ph_vrnd
n.ph_vrnd
nn.ph_vrnd
pp.ph_ctype 
p.ph_ctype 
ph_ctype 
n.ph_ctype 
nn.ph_ctype
pp.ph_cplace
p.ph_cplace
ph_cplace
n.ph_cplace
nn.ph_cplace
pp.ph_cvox 
p.ph_cvox 
ph_cvox 
n.ph_cvox 
nn.ph_cvox
R:SylStructure.parent.R:Syllable.pp.syl_break
R:SylStructure.parent.R:Syllable.p.syl_break
R:SylStructure.parent.syl_break
R:SylStructure.parent.R:Syllable.n.syl_break
R:SylStructure.parent.R:Syllable.nn.syl_break
R:SylStructure.parent.R:Syllable.pp.stress 
R:SylStructure.parent.R:Syllable.p.stress 
R:SylStructure.parent.stress 
R:SylStructure.parent.R:Syllable.n.stress 
R:SylStructure.parent.R:Syllable.nn.stress 
R:SylStructure.parent.syl_in
R:SylStructure.parent.syl_out
R:SylStructure.parent.ssyl_in
R:SylStructure.parent.ssyl_out
R:SylStructure.parent.parent.gpos

By convention we build duration models in festival/dur/. We will save the above feature names in dur.featnames. We can dump the features with the command

dumpfeats -relation Segment -feats dur.featnames -output dur.feats \
         ../utts/*.utt

This will put all the features in the file dur.feats. For wagon we need to build a feature description file, we can build a first approximation with the make_wagon_desc script available with the speech tools

make_wagon_desc dur.feats dur.featnames dur.desc

You will then need to edit dur.desc to change a number of features from their categorical list (lots of numbers) into type float. Specifically for the above list the features segment_duration, R:SylStructure.parent.parent.word_numsyls, pos_in_syl, R:SylStructure.parent.pos_in_word, R:SylStructure.parent.syl_in, R:SylStructure.parent.syl_out, R:SylStructure.parent.ssyl_in and R:SylStructure.parent.ssyl_out should be declared as floats.

We then need to split the data into training and test sets (and further split the train set if we are going to use stepwise CART building.

traintest dur.feats
traintest dur.feats.train

We can no build a model using wagon

wagon -data dur.feat.train.train -desc dur.desc \
        -test dur.feats.train.test -stop 10 -stepwise \
        -output dur.10.tree 
wagon_test -data dur.feats.test -tree dur.10.tree -desc dur.desc

You may wish to remove all examples of silence from the data as silence durations typically has quite a different distribution from other phones. In fact it is common that databases include many examples of silence which are not of natural length as they are arbitrary parts of the initial and following silence around the spoken utterances. Their durations are not something that should be trained for.

These instructions above will build a tree that predicts absolute values. To get such a tree to work with the zscore module simply make the stddev field above 1. As stated above using zscores typically give better results. Although the correlation of these duration models in the zscore domain may not be as good as training models predicting absolute scores when those predicted scores are convert back into the absolute domain we have found (for English) that the correlations are better, and RMSE smaller.

In order to train a zscore model you need to convert the absolute segment durations, to do that you need the means and standard deviations for each segment in your phoneset.

There is a whole branch of possible mappings for the distribution of durations: zscores, logs, logs-zscores, etc or even more complex functions [bellegarda98]. These variations do give some improvements. The intention is to map the distribution to a normal distribution which makes it easier to learn.

Other learning techniques, particularly Sums of Products model ([sproat98] chapter 5), which has been shown to training better even on small amounts of data.

Another technique, which although shouldn't work is to borrow a models trained for another language for which data is available. Actually the duration model used in Festival for the US and UK voices is the same, it was in fact trained from the f2b database, a US English database. As the phone sets are different for US and UK English we trained the models using phonetic features rather than phone names, and trained them in the zscore domain keeping the actual phone names and means and standard deviations separate. Although the models were slightly better if we included the phone names themselves, it was only slightly better and the models were also substantially larger (and took longer to train). Using the phonetic feature offers a more general model (it works for UK English), more compact, quicker learning time and with only a small cost in performance.

Also in the German voice developed at OGI, the same English duration model was used. The results are acceptable and are at least better than any hand written rule system that could be written. Improvements in that model are probably only possible by training on real German data. Note however such cross language borrowing of models is unlikely to work in general but there may be cases where it is a reasonable fall back position.