Go to the first, previous, next, last section, table of contents.


8 Building prosodic models

8.1 Phrasing

Prosodic phrasing in speech synthesis makes the whole speech more understandable. Due to the size of peoples lungs there is a finite length of time people can talk before they can take a breath, which defines an upper bound on prosodic phrases. However we rarely make our phrases this maximum length and use phrasing to mark groups within the speech. There is the apocryphal story of the speech synthesis example with an unnaturally long prosodic phrase played at a conference presentation. At the end of the phrase the audience all took a large in-take of breathe.

For the most case very simple prosodic phrasing is sufficient. A comparison of various prosodic phrasing techniques is discussed in taylor98a, though we will cover some of them here also.

For English (and most likely many other language too) simple rules based on punctuation is a very good predictor of prosodic phrase boundaries. It is rare that punctuation exists where there is no boundary, but there will be a substantial number of prosodic boundaries which are not explicitly marked with punctuation. Thus a prosodic phrasing algorithm solely based on punctuation will typically under predict but rarely make a false insertion. However depending on the actual application you wish to use the synthesizer for it may be the case that explicitly adding punctuation at desired phrase breaks is possible and a prediction system based solely on punctuation is adequate.

Festival basically supports two methods for predicting prosodic phrases, though any other method can easily be used. Note that these do not necessary entail pauses in the synthesized output. Pauses are further predicted from prosodic phrase information.

The first basic method is by CART tree. A test is made on each word to predict it is at the end of a prosodic phrase. The basic CART tree returns B or BB (though may return what you consider is appropriate form break labels as long as the rest of your models support it). The two levels identify different levels of break, BB being a used to denote a bigger break (and end of utterance).

The following tree is very simple and simply adds a break after the last word of a token that has following punctuation. Note the first condition is done by a lisp function as we wand to ensure that only the last word in a token gets the break. (Earlier erroneous versions of this would insert breaks after each word in `1984.'

(set! simple_phrase_cart_tree
'
((lisp_token_end_punc in ("?" "." ":"))
  ((BB))
  ((lisp_token_end_punc in ("'" "\"" "," ";"))
   ((B))
   ((n.name is 0)  ;; end of utterance
    ((BB))
    ((NB))))))

This tree is defined `festival/lib/phrase.scm' in the standard distribution and is certainly a good first step in defining a phrasing model for a new language.

To make a better phrasing model requires more information. As the basic punctuation model underpredicts we need information that will find reasonable boundaries within strings of words. In English, boundaries are more likely between content words and function words, because most function words are before the words they related to, in Japanese function words are typically after their relate content words so breaks are more likely between function words and content words. If you have no data to train from, written rules, in a CART tree, can exploited this fact and give a phrasing model that is better than a punctuation only. Basically a rule could be if the current word is a content word and the next is a function word (or the reverse if that appropriate for a language) and we are more than 5 words from a punctuation symbol them predict a break. We maybe also want to insure that we are also at least five words from predicted break too.

Note the above basic rules aren't optimal but when you are building a new voice in a new language and have no data to train from you will get reasonably far with simple rules like that, such that phrasing prediction will be less of a problem than the other problems you will find in you voice.

To implement such a scheme we need three basic functions: one to determine if the current word is a function of content word, one to determine number of words since previous punctuation (or start of utterance) and one to determine number of words to next punctuation (or end of utterance. The first of these functions is already provided for with a feature, through the feature function gpos. This uses the word list in the lisp variable guess_pos to determine the basic category of a word. Because in most languages the set of function words is very nearly a closed class they can usually be explicitly listed. The format of the guess_pos variable is a list of lists whose first element is the set name and the rest of the list if the words that are part of that set. Any word not a member of any of these sets is defined to be in the set content. For example the basic definition for this for English, given in `festival/lib/pos.scm' is

(set! english_guess_pos
      '((in of for in on that with by at from as if that against about 
	    before because if under after over into while without
	    through new between among until per up down)
	(to to)
	(det the a an no some this that each another those every all any 
	     these both neither no many)
	(md will may would can could should must ought might)
	(cc and but or plus yet nor)
	(wp who what where how when)
	(pps her his their its our their its mine)
	(aux is am are was were has have had be)
	(punc "." "," ":" ";" "\"" "'" "(" "?" ")" "!")
	))

The punctuation distance check can be written as a Lisp feature function

(define (since_punctuation word)
 "(since_punctuation word)
Number of words since last punctuation or beginning of utterance."
 (cond
   ((null word) 0) ;; beginning or utterance
   ((string-equal "0" (item.feat word "p.lisp_token_end_punc")) 0)
   (t
    (+ 1 (since_punctuation (item.prev word))))))

The function looking forward would be

(define (until_punctuation word)
 "(until_punctuation word)
Number of words until next punctuation or end of utterance."
 (cond
   ((null word) 0) ;; beginning or utterance
   ((string-equal "0" (token_end_punc word)) 0)
   (t
    (+ 1 (since_punctuation (item.prev word))))))

The whole tree using these features that will insert a break at punctuation or between content and function words more than 5 words from a punctuation symbol is as follows

(set! simple_phrase_cart_tree_2
'
((lisp_token_end_punc in ("?" "." ":"))
  ((BB))
  ((lisp_token_end_punc in ("'" "\"" "," ";"))
   ((B))
   ((n.name is 0)  ;; end of utterance
    ((BB))
    ((lisp_since_punctuation > 5)
     ((lisp_until_punctuation > 5)
      ((gpos is content)
       ((n.gpos content)
        ((NB))
        ((B)))   ;; not content so a function word
       ((NB)))   ;; this is a function word
      ((NB)))    ;; to close to punctuation
     ((NB)))     ;; to soon after punctuation
    ((NB))))))

To use this add the above to a file in your `festvox/' directory and ensure it is loaded by your standard voice file. In your voice definition function. Add the following

   (set! guess_pos english_guess_pos) ;; or appropriate for your language
 
   (Parameter.set 'Phrase_Method 'cart_tree)
   (set! phrase_cart_tree simple_phrase_cart_tree_2)

A much better method for predicting phrase breaks is using a full statistical model trained from data. The problem is that you need a lot of data to train phrase break models. Elsewhere in this document we suggest the use of a timit style database or around 460 sentences, (around 14500 segments) for training models. However a database such as this as very few internal utterance phrase breaks. An almost perfect model word predict breaks at the end of each utterances and never internally. Even the f2b database from the Boston University Radio New Corpus ostendorf95 which does have a number of utterance internal breaks isn't really big enough. For English we used the MARSEC database roach93 which is much larger (around 37,000 words). Finding such a database for your language will not be easy and you may need to fall back on a purely hand written rule system.

Often syntax is suggested as a strong correlate of prosodic phrase. Although there is evidence that it influences prosodic phrasing, there are notable exceptions bachenko90. Also considering how difficult it is to get a reliable parse tree it is probably not worth the effort, training a reliable parser is non-trivial, (though we provide a method for training stochastic context free grammars in the speech tools, see manual for details). Of course if your text to be synthesized is coming from a language system such as machine translation or language generation then a syntax tree may be readily available. In that case a simple rule mechanism taking into account syntactic phrasing may be useful

When only moderate amounts of data are available for training a simple CART tree may be able to tease out a reasonable model. See hirschberg94 for some discussion on this. Here is a short example of building a CART tree for phrase prediction. Let us assume you have a database of utterances as described previously. By convention we build models in directories under `festival/' in the main database directory. Thus let us create `festival/phrbrk'.

First we need to list the features that are likely to be suitable predictors for phrase breaks. Add these to a file `phrbrk.feats', what goes in here will depend on what you have, full part of speech helps a lot but you may not have that for your language. The gpos described above is a good cheap alternative. Possible features may be

word_break
lisp_token_end_punc
lisp_until_punctuation
lisp_since_punctuation
p.gpos
gpos
n.gpos

Given this list you can extract features form your database of utterances with the Festival script `dumpfeats'

dumpfeats -eval ../../festvox/phrbrk.scm -feats phrbrk.feats \
   -relation Word -output phrbrk.data ../utts/*.utts

`festvox/phrbrk.scm' should contain the definitions of the function until_punctuation, since_punctuation and any other Lisp feature functions you define.

Next we want to split this data into test and train data. We provide a simple shell script called `traintest' which splits a given file 9:1, i.e every 10th line is put in the test set.

traintest phrbrk.data

As we intend to run `wagon' the CART tree builder on this data we also need create the feature description file for the data. The feature description file consists of a bracketed list of feature name and type. Type may be int float or categorical where a list of possible values is given. The script `make_wagon_desc' (distributed with the speech tools) will make a reasonable approximation for this file

make_wagon_desc phrbrk.data phrbrk.feats phrbrk.desc

This script will treat all features as categorical. Thus any float or int features will be treated categorically and each value found in the data will be listed as a separate item. In our example lisp_since_punctuation and lisp_until_punctuation are actually float (well maybe even int) but they will be listed as categorically in `phrbrk.desc', something like

...
(lisp_since_punctuation
0
1
2
4
3
5
6
7
8)
...

You should change this entry (by hand) to be

...
(lisp_since_punctuation float )
...

The script cannot work out the type of a feature automatically so you must make this decision yourself.

Now that we have the data and description we can build a CART tree. The basic command for `wagon' will be

wagon -desc phrbrk.desc -data phrbrk.data.train -test phrbrk.data.test \
   -output phrbrk.tree

You will probably also want to set a stop value. The default stop value is 50, which means there must be at least 50 examples in a group before it will consider looking for a question to split it. Unless you have a lot of data this is probably too large and a value of 10 to 20 is probably more reasonable.

Other arguments to `wagon' should also be considered. A stepwise approach where all features are tested incrementally to find the best set of features which give the best tree can give better results than simply using all features. Though care should be taken with this as the generated tree becomes optimized from the given test set. Thus a further held our test set is required to properly test the accuracy of the result. In the stepwise case it is normal to split the train set again and call wagon as follows

traintest phrbrk.data.train
wagon -desc phrbrk.desc -data phrbrk.data.train.train \
   -test phrbrk.data.train.test \
   -output phrbrk.tree -stepwise
wagon_test -data phrbrk.data.test -desc phrbrk.desc \
   -tree phrbrk.tree

Stepwise is particularly useful when features are highly correlated with themselves and its not clear which is best general predictor. Note that stepwise will take much longer to run as it potentially must build a large number of trees.

Other arguments to `wagon' can be considered, refer to the relevant chapter in speech tools manual for their details.

However it should be noted that without a good intonation and duration model spending time on producing good phrasing is probably not worth it. The quality of all these three prosodic components is closely related such that if one is much better than there may not be any real benefit.

8.2 Accent/Boundary Assignment

Accent and boundary tones are what we will use, hopefully in a theory independent way, to refer to the two main types of intonation event. For English, and for many other languages the prediction of position of the accents and boundaries can be done as an independent process from F0 contour generation itself. This is definite true from the major theories we will be considering.

As with phrase break prediction there are some simple rules that will go a surprisingly long way. And as with most of the other statistical learning techniques simple rules cover most of the work, more complex rules work better, but the best results are from using the sorts of information you were using in rules but statistically training them from a appropriate data.

For English the placement of accents on stressed syllables in all content words is a quite reasonable approximation achieving about 80% accuracy on typical databases. hirschberg90 is probably the best example of a detailed rule driven approach (for English). CART trees based on the sorts of features Hirschberg uses are quite reasonable. Though eventual these rules become limiting and a richer knowledge source is required to assign accent patterns to complex nominals (see sproat90).

However all these techniques quickly come to the stumbling block that although simple so-called discourse neutral intonation is relatively easy achieve, achieving realistic, natural accent placement is still beyond our synthesis systems (though perhaps not for much longer).

The simplest rule for English may be reasonable for other languages. There are even simpler solutions to this, such as fixed prosody, or fixed declination, but apart from debugging a voice these are simpler than is required even for the most basic voices.

For English, adding a simple hat accent on lexically stressed syllables in all content words works surprisingly well. To do this in Festival you need a CART tree to predict accentedness, and rules to add the hat accent (though we will leave out F0 generation until the next section).

A basic tree that predicts accents of stressed syllables in content words is

(set! simple_accent_cart_tree
  '
  (
   (R:SylStructure.parent.gpos is content)
    ( (stress is 1)
       ((Accented))
       ((NONE))
    )
  )
)

The above tree simply distinguishes accented syllables from non-accented. In theories like ToBI (silverman92), a number of different types of accent are supported. ToBI, with variations, has been applied to a number of languages and may be suitable for yours. However, although accent and boundary types have been identified for various languages and dialects, a computational mechanism for generating and F0 contour from an accent specification often has not yet been specified (we will discuss this more fully below).

If the above is considered too naive a more elaborate hand specified tree can also be written, using relevant factors, probably similar to those used in hirschberg90. Following that, training from data is the next option. Assuming a database exists and has been labelled with discrete accent classifications, we can extract data from it for training a CART tree with `wagon'. We will build the tree in `festival/accents/'. First we need a file listing the features that are felt to affect accenting. For this we will predict accents on syllables as that has been used for the English voices created so far, but there is an argument for predict accent placement on a word basis as although accents will need to be syllable aligned, which syllable in a word gets the accent is reasonably well defined (at least compared with predicting accent placement).

A possible list of features for accent prediction is put in the file `accent.feats'.

R:Intonation.daughter1.name
R:SylStructure.parent.R:Word.p.gpos
R:SylStructure.parent.gpos
R:SylStructure.parent.R:Word.n.gpos
ssyl_in
syl_in
ssyl_out
syl_out
p.stress
stress
n.stress
pp.syl_break
p.syl_break
syl_break
n.syl_break
nn.syl_break
pos_in_word
position_type

We can extract these features from the utterances using the Festival script `dumpfeats'

dumpfeats -feats accent.feats -relation Syllable \
      -output accent.data ../utts/*.utts

We now need a description file for the features which can be approximated by the speech tools script `make_wagon_desc'

make_wagon_desc accent.data accent.feat accent.desc

Because this script cannot determine if a feature is categorical, if takes an range of values you must hand edit the output file and change any feature to float or int if that is what it is.

The next stage is to split the data into training and test sets. If stepwise training is to be used for building the CART tree (which is recommended) then the training data should be further split

traintest accent.data
traintest accent.data.train

Deciding on a stop value for training depends on the number of examples, though this can be tuned to ensure over-training isn't happening.

wagon -data accent.data.train.train -desc accent.desc \
   -test accent.data.train.test -stop 10 -stepwise -output accent.tree
wagon_test -data accent.data.test -desc  accent.desc \
   -tree accent.tree

This above is designed to predict accents, and similar tree should be used to predict boundary tones as well. For the most part intonation boundaries are defined to occur at prosodic phrase boundaries so that task is somewhat easier, though if you have a number of boundary tone types in your inventory then the prediction is not so straightforward.

When training ToBI type accent types it is not easy to get the right type of variation in the accent types. Although some ToBI labels have been associated with semantic intentions and including discourse information has been shown help prediction (e.g. black97a), getting this acceptably correct is not easy. Various techniques in modifying the training data do seem to help. Because of the low incidence of `L*' labels in at least the f2b data, duplicating all sample points in the training data with L's does increase the likelihood of prediction and does seem to give a more varied distribution. Alternatively wagon returns a probability distribution for the accents, normally the most probable is selected, this could be modified to select from the distribution randomly based on their probabilities.

Once trees have been built they can be used in a voices as follows. Within the voice definition function

   (set! int_accent_cart_tree simple_accent_cart_tree)
   (set! int_tone_cart_tree simple_tone_cart_tree)
   (Parameter.set 'Int_Method Intonation_Tree)

or if only one tree is required you can use the simpler intonation method

   (set! int_accent_cart_tree simple_accent_cart_tree)
   (Parameter.set 'Int_Method Intonation_Simple)

8.3 F0 Generation

Predicting where accents go (and their types) is only half of the problem. We also have build an F0 contour based on these. Note intonation is split between accent placement and F0 generation as it is obvious that accent position influences durations and an F0 contour cannot be generated without knowing the durations of the segments the contour is to be generated over.

There are three basic F0 generation modules available in Festival, though others could be added, by general rule, by linear regression/CART, and by Tilt.

8.3.1 F0 by rule

The first is designed to be the most general and will always allow some form of F0 generation. This method allows target points to be programmatically created for each syllable in an utterance. The idea follows closely a generalization of the implementation of ToBI type accents in anderson84, where n-points are predicted for each accent. They (and others in intonation) appeal to the notion of baseline and place target F0 points above and below that line based on accent type, position in phrase. The baseline itself is often defined to decline over the phrase reflecting the general declination of F0 over type.

The simple idea behind this general method is that a Lisp function is called for each syllable in the utterance. That Lisp function returns a list of target F0 points that lie within that syllable. Thus the generality of this methods actual lies in the fact that it simply allows the user to program anything they want. For example our simple hat accent can be generated using this technique as follows.

This fixes the F0 range of the speaker so would need to be changed for different speakers.

(define (targ_func1 utt syl)
  "(targ_func1 UTT STREAMITEM)
Returns a list of targets for the given syllable."
  (let ((start (item.feat syl 'syllable_start))
        (end (item.feat syl 'syllable_end)))
    (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented")
        (list
         (list start 110)
         (list (/ (+ start end) 2.0) 140)
         (list end 100)))))

It simply checks if the current syllable is accented and if so returns a list of position/target pairs. A value at the start of the syllable or 110Hz, a value at 140Hz at the mid-point of the syllable and a value of 100 at the end.

This general technique can be expanded with other rules as necessary. Festival includes an implementation of ToBI using exactly this technique, it is based on the rules described in jilka96 and in the file `festival/lib/tobi_f0.scm'.

8.3.2 F0 by linear regression

This technique was developed specifically to avoid the difficult decisions of exactly what parameters with what value should be used in rules like those of anderson84. The first implementation of this work is presented black96. The idea is to find the appropriate F0 target value for each syllable based on available features by training from data. A set of features are collected for each syllable and a linear regression model is used to model three points on each syllable. The technique produces reasonable synthesis and requires less analysis of the intonation models that would be required to write a rule system using the general F0 target method described in the previous section.

However to be fair, this technique is also much simpler and there are are obviously a number of intonational phenomena which this cannot capture (e.g. multiple accents on syllables and it will never really capture accent placement with respect to the vowel). The previous technique allows specification of structure but without explicit training from data (though doesn't exclude that) while this technique imposes almost no structure but depends solely on data. The Tilt modelling discussed in the following section tries to balance these two extremes.

The advantage of the linear regression method is very little knowledge about the intonation the language under study needs to be known. Of course if there is knowledge and theories it is usually better to follow them (or at least find the features which influence the F0 in that language). Extracting features for F0 modelling is similar to extracting features for the other models. This time we want the means F0 at the start middle and end of each utterance. The Festival features syl_startpitch, syl_midpitch and syl_endpitch proved this. Note that syl_midpitch returns the pitch at the mid of the vowel in the syllable rather than the middle of the syllable.

For a linear regression model all features must be continuous. Thus features which are categorical that influence F0 need to be converted. The standard technique for this is to introduce new features, one for each possible value in the class and output values of 0 or 1 for these modified features depending on the value of the base features. For example in a ToBI environment the output of the feature tobi_accent will include H*, L*, L+H* etc. In the modified form you would have features of the form tobi_accent_H*, tobi_accent_L*, tobi_accent_L_H*, etc.

The program `ols' in the speech tools takes feature files and description files in exactly the same format as `wagon', except that all feature must be declared as type `float'. The standard ordinary least squares algorithm used to find the coefficients cannot, in general, deal with features that are directly correlated with others as this causes a singularity when inverting the matrix. The solution to this is to exclude such features. The option -robust enables that though at the expense of a longer compute time. Again like `file' a stepwise option is included so that the best subset of features may be found.

The resulting models may be used by the Int_Targets_LR module which takes its LR models from the variables f0_lr_start, f0_lr_mid and f0_lr_end. The output of ols is a list of coefficients (with the Intercept first). These need to be converted to the appropriate bracket form including their feature names. An example of which is in `festival/lib/f2bf0lr.scm'.

If the conversion of categoricals to floats seems to much work or would prohibitively increase the number of features you could use `wagon' to generate trees to predict F0 values. The advantage is that of a decision tree over the LR model is that it can deal with data in a non-linear fashion, But this is also the disadvantage. Also the decision tree technique may split the data sub-optimally. The LR model is probably more theoretically appropriate but ultimately the results depend on how goods the models sound.

Dump features as with the LR models, but this time there is no need convert categorical features to floats. A potential set of features to do this from (substitute syl_midpitch and syl_endpitch for the other two models is

syl_endpitch
pp.tobi_accent
p.tobi_accent
tobi_accent
n.tobi_accent
nn.tobi_accent
pp.tobi_endtone
R:Syllable.p.tobi_endtone
tobi_endtone
n.tobi_endtone
nn.tobi_endtone
pp.syl_break
p.syl_break
syl_break
n.syl_break
nn.syl_break
pp.stress
p.stress
stress
n.stress
nn.stress
syl_in
syl_out
ssyl_in
ssyl_out
asyl_in
asyl_out
last_accent
next_accent
sub_phrases

The above, of course assumes a ToBI accent labelling, modify that as appropriate for you actually labelling.

Once you have generated three trees predicting values for start, mid and end points in each syllable you will need to add some Scheme code to use these appropriately. Suitable code is provided in `src/intonation/tree_f0.scm' you will need to include that in your voice. To use it as the intonation target module you will need to add something like the following to your voice function

(set! F0start_tree f2b_F0start_tree)
(set! F0mid_tree f2b_F0mid_tree)
(set! F0end_tree f2b_F0end_tree)
(set! int_params
	'((target_f0_mean 110) (target_f0_std 10)
	  (model_f0_mean 170) (model_f0_std 40)))
(Parameter.set 'Int_Target_Method Int_Targets_Tree)

The int_params values allow you to use the model with a speaker of a different pitch range. That is all predicted values are converted using the formula

   (+ (* (/ (- value model_f0_mean) model_f0_stddev)
       target_f0_stddev) target_f0_mean)))

Or for those of you who can't real Lisp expressions

   ((value - model_f0_mean) / model_f0_stddev) * target_f0_stddev)+
      target_f0_mean

The values in the example above are for converting a female speaker (used for training) to a male pitch range.

8.3.3 Tilt modelling

Tilt modelling is still under development and not as mature as the other methods as described above, but it potentially offers a more consistent solution to the problem. A tilt parameterization of a natural F0 contour can be automatically derived from a waveform and a labelling of accent placements (a simple `a' for accents and `b' of boundaries) taylor99. Further work is being done on trying to automatically find the accents placements too.

For each `a' in an labeling four continuous parameters are found: height, duration, peak position with respect to vowel start, and tilt. Prediction models may then be generate to predict these parameters which we feel better capture the dimensions of F0 contour itself. We have had success in building models for these parameters, dusterhoff97a, with better results than the linear regression model on comparable data. However so far we have not done any tests with Tilt on languages other than English.

The speech tools include the programs `tilt_analyse' and `tilt_synthesize' to aid model building but we do not yet include fill Festival end support for using the generated models.

8.4 Duration

Like the above prosody phenomena, very simple solutions to predicting durations work surprisingly well, though very good solutions are extremely difficult to achieve.

Again the basic strategy is assigning fixed models, simple rules models, complex rule modules, and trained models using the features in the complex rule models. The choice of where to stop depends on the resources available to you and time you wish to spend on the problem. Given a reasonably sized database training a simple CART tree for durations achieves quite acceptable results. This is currently what we do for our English voices in Festival. There are better models out there but we have not fully investigated them or included easy scripts to customize them.

The simplest model for duration is a fixed duration for each phone. A value of 100 milliseconds is a reasonable start. This type of model is only of use at initial testing of a diphone database beyond that it sounds too artificial. The Festival function SayPhones uses a fixed duration model, controlled by the value (in ms) in the variable FP_duration. Although there is a fixed duration module in Festival (see the manual) its worthwhile starting off with something a little more interesting.

The next level for duration models is to use average durations for the phones. Even when real data isn't available to calculate averages, writing values by hand can be acceptable, basically vowels are longer than consonants, and stops are the shortest. Estimating values for a set of phones can be done by looking at data from another language, (if you are really stuck, see `festival/lib/mrpa_durs.scm'), to get the basic idea of average phone lengths.

In most languages phones are longer at the phrase final and to a lesser extent phrase initial positions. A simple multiplicative factor can be defined for these positions. The next stage from this is a set of rules that modify the basic average based on the context they occur in. For English the best definition of such rules is the duration rules given in chapter 9, allen87 (often referred to as the Klatt duration model). The factors used in this may also apply to other languages. A simplified form of this, that we have successfully used for a number of languages, and is often used as our first approximation for a duration rule set is as follows.

Here we define a simple decision tree that returns a multiplication factor for a segment

(set! simple_dur_tree
 '
   ((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
    ((R:SylStructure.parent.stress is 1)
     ((1.5))
     ((1.2)))
    ((R:SylStructure.parent.syl_break > 1)   ;; clause final
     ((R:SylStructure.parent.stress is 1)
      ((1.5))
      ((1.2)))
     ((R:SylStructure.parent.stress is 1)
      ((ph_vc is +)
       ((1.2))
       ((1.0)))
      ((1.0))))))

You may modify this adding more conditions as much as you want. In addition to the tree you need to define the averages for each phone in your phone set. For reasons we will explain below the format of this information is `segname 0.0 average' as in

(set! simple_phone_data
'(
   (# 0.0 0.250)
   (a 0.0 0.080)
   (e 0.0 0.080)
   (i 0.0 0.070)
   (o 0.0 0.080)
   (u 0.0 0.070)
   (i0 0.0 0.040)
   ...
 ))

With both these expressions loaded in your voice you may set the following in your voice definition function. setting up this tree and data as the standard and the appropriate duration module.

  ;; Duration prediction
  (set! duration_cart_tree simple_dur_tree)
  (set! duration_ph_info simple_phone_data)
  (Parameter.set 'Duration_Method 'Tree_ZScores)

Though in your voice use voice specific names for the simple_ variables otherwise you may class with other voices.

It has been shown campbell91 that a better representation for duration for modeling is zscores, that is number of standard deviations from the mean. The duration module used in the above is actually designed to take a CART tree that returns zscores and uses the information in duration_ph_info to change that into an absolute duration. The two fields after the phone name are mean and standard deviation. The interpretation of this tree and this phone info happens to give the right result when we use the tree to predict factors and have the stddev field contain the average duration, as we did above.

However no matter if we use zscores or absolutes, a better way to build a duration model is to train from data rather than arbitrarily selecting modification factors.

Given a reasonable sized database we can dump durations and features for each segment in the database. Then we can train a model using those samples. For our English voices we have trained regression models using `wagon', though we include the tools for linear regression models too.

An initial set of features to dump might be

segment_duration
name
p.name
n.name
R:SylStructure.parent.syl_onsetsize
R:SylStructure.parent.syl_codasize
R:SylStructure.parent.R:Syllable.n.syl_onsetsize
R:SylStructure.parent.R:Syllable.p.syl_codasize
R:SylStructure.parent.position_type
R:SylStructure.parent.parent.word_numsyls
pos_in_syl
syl_initial
syl_final
R:SylStructure.parent.pos_in_word
p.seg_onsetcoda
seg_onsetcoda
n.seg_onsetcoda
pp.ph_vc 
p.ph_vc 
ph_vc 
n.ph_vc 
nn.ph_vc
pp.ph_vlng 
p.ph_vlng 
ph_vlng 
n.ph_vlng 
nn.ph_vlng
pp.ph_vheight
p.ph_vheight
ph_vheight
n.ph_vheight
nn.ph_vheight
pp.ph_vfront
p.ph_vfront
ph_vfront
n.ph_vfront
nn.ph_vfront
pp.ph_vrnd
p.ph_vrnd
ph_vrnd
n.ph_vrnd
nn.ph_vrnd
pp.ph_ctype 
p.ph_ctype 
ph_ctype 
n.ph_ctype 
nn.ph_ctype
pp.ph_cplace
p.ph_cplace
ph_cplace
n.ph_cplace
nn.ph_cplace
pp.ph_cvox 
p.ph_cvox 
ph_cvox 
n.ph_cvox 
nn.ph_cvox
R:SylStructure.parent.R:Syllable.pp.syl_break
R:SylStructure.parent.R:Syllable.p.syl_break
R:SylStructure.parent.syl_break
R:SylStructure.parent.R:Syllable.n.syl_break
R:SylStructure.parent.R:Syllable.nn.syl_break
R:SylStructure.parent.R:Syllable.pp.stress 
R:SylStructure.parent.R:Syllable.p.stress 
R:SylStructure.parent.stress 
R:SylStructure.parent.R:Syllable.n.stress 
R:SylStructure.parent.R:Syllable.nn.stress 
R:SylStructure.parent.syl_in
R:SylStructure.parent.syl_out
R:SylStructure.parent.ssyl_in
R:SylStructure.parent.ssyl_out
R:SylStructure.parent.parent.gpos

By convention we build duration models in `festival/dur/'. We will save the above feature names in `dur.featnames'. We can dump the features with the command

dumpfeats -relation Segment -feats dur.featnames -output dur.feats \
         ../utts/*.utt

This will put all the features in the file `dur.feats'. For wagon we need to build a feature description file, we can build a first approximation with the `make_wagon_desc' script available with the speech tools

make_wagon_desc dur.feats dur.featnames dur.desc

You will then need to edit `dur.desc' to change a number of features from their categorical list (lots of numbers) into type float. Specifically for the above list the features segment_duration, R:SylStructure.parent.parent.word_numsyls, pos_in_syl, R:SylStructure.parent.pos_in_word, R:SylStructure.parent.syl_in, R:SylStructure.parent.syl_out, R:SylStructure.parent.ssyl_in and R:SylStructure.parent.ssyl_out should be declared as floats.

We then need to split the data into training and test sets (and further split the train set if we are going to use stepwise CART building.

traintest dur.feats
traintest dur.feats.train

We can no build a model using wagon

wagon -data dur.feat.train.train -desc dur.desc \
        -test dur.feats.train.test -stop 10 -stepwise \
        -output dur.10.tree 
wagon_test -data dur.feats.test -tree dur.10.tree -desc dur.desc

You may wish to remove all examples of silence from the data as silence durations typically has quite a different distribution from other phones. In fact it is common that databases include many examples of silence which are not of natural length as they are arbitrary parts of the initial and following silence around the spoken utterances. Their durations are not something that should be trained for.

These instructions above will build a tree that predicts absolute values. To get such a tree to work with the zscore module simply make the stddev field above 1. As stated above using zscores typically give better results. Although the correlation of these duration models in the zscore domain may not be as good as training models predicting absolute scores when those predicted scores are convert back into the absolute domain we have found (for English) that the correlations are better, and RMSE smaller.

In order to train a zscore model you need to convert the absolute segment durations, to do that you need the means and standard deviations for each segment in your phoneset.

There is a whole branch of possible mappings for the distribution of durations: zscores, logs, logs-zscores, etc or even more complex functions bellegarda98. These variations do give some improvements. The intention is to map the distribution to a normal distribution which makes it easier to learn.

Other learning techniques, particularly Sums of Products model (sproat98 chapter 5), which has been shown to training better even on small amounts of data.

Another technique, which although shouldn't work is to borrow a models trained for another language for which data is available. Actually the duration model used in Festival for the US and UK voices is the same, it was in fact trained from the f2b database, a US English database. As the phone sets are different for US and UK English we trained the models using phonetic features rather than phone names, and trained them in the zscore domain keeping the actual phone names and means and standard deviations separate. Although the models were slightly better if we included the phone names themselves, it was only slightly better and the models were also substantially larger (and took longer to train). Using the phonetic feature offers a more general model (it works for UK English), more compact, quicker learning time and with only a small cost in performance.

Also in the German voice developed at OGI, the same English duration model was used. The results are acceptable and are at least better than any hand written rule system that could be written. Improvements in that model are probably only possible by training on real German data. Note however such cross language borrowing of models is unlikely to work in general but there may be cases where it is a reasonable fall back position.

8.5 Prosody Research

Note that the above descriptions are for the easy implementation of prosody models which unfortunately means that the models will not be perfect. Of course no models will be perfect but with some work it is often possible to improve the basic models or at least make them more appropriate to the synthesis task. For example if your intend use of your synthesis voice is primarily for dialog systems training one news caster speech will not give the best effect. Festival is designed as a research system as well as tool to build languages so it is well adapted to prosody research.

One thing which clearly shows off how imporoverished our prosodic models are is the comparing of predicted prosody with natural prosody. Given a label file and an F0 Target file the following code will generate\ that utterance using the current voice

(define (resynth labfile f0file)
  (let ((utt (Utterance SegF0))) ; need some u to start with
    (utt.relation.load utt 'Segment labfile)
    (utt.relation.load utt 'Target f0file)
    (Wave_Synth utt))
)

The format of the label file should be one that can be read into Festival (e.g. the XLabel format) For example

#
	 0.02000 26 	pau ; 
	 0.09000 26 	ih ; 
	 0.17500 26 	z ; 
	 0.22500 26 	dh ; 
	 0.32500 26 	ae ; 
	 0.35000 26 	t ; 
	 0.44500 26 	ow ; 
	 0.54000 26 	k ; 
	 0.75500 26 	ey ; 
	 0.79000 26 	pau ; 

The target file is a little more complex again it is a label file but with features "pos" and "F0" at each stage. Thus the format for a naturally rendered version of the above would be.

#
0.070000 124 0 ; pos 0.070000 ; f0 133.045230 ; 
0.080000 124 0 ; pos 0.080000 ; f0 129.067890 ; 
0.090000 124 0 ; pos 0.090000 ; f0 125.364600 ; 
0.100000 124 0 ; pos 0.100000 ; f0 121.554800 ; 
0.110000 124 0 ; pos 0.110000 ; f0 117.248260 ; 
0.120000 124 0 ; pos 0.120000 ; f0 115.534490 ; 
0.130000 124 0 ; pos 0.130000 ; f0 113.769620 ; 
0.140000 124 0 ; pos 0.140000 ; f0 111.513180 ; 
0.240000 124 0 ; pos 0.240000 ; f0 108.386380 ; 
0.250000 124 0 ; pos 0.250000 ; f0 102.564100 ; 
0.260000 124 0 ; pos 0.260000 ; f0 97.383600 ; 
0.270000 124 0 ; pos 0.270000 ; f0 97.199710 ; 
0.280000 124 0 ; pos 0.280000 ; f0 96.537280 ; 
0.290000 124 0 ; pos 0.290000 ; f0 96.784970 ; 
0.300000 124 0 ; pos 0.300000 ; f0 98.328150 ; 
0.310000 124 0 ; pos 0.310000 ; f0 100.950830 ; 
0.320000 124 0 ; pos 0.320000 ; f0 102.853580 ; 
0.370000 124 0 ; pos 0.370000 ; f0 117.105770 ; 
0.380000 124 0 ; pos 0.380000 ; f0 116.747730 ; 
0.390000 124 0 ; pos 0.390000 ; f0 119.252310 ; 
0.400000 124 0 ; pos 0.400000 ; f0 120.735070 ; 
0.410000 124 0 ; pos 0.410000 ; f0 122.259190 ; 
0.420000 124 0 ; pos 0.420000 ; f0 124.512020 ; 
0.430000 124 0 ; pos 0.430000 ; f0 126.476430 ; 
0.440000 124 0 ; pos 0.440000 ; f0 121.600880 ; 
0.450000 124 0 ; pos 0.450000 ; f0 109.589040 ; 
0.560000 124 0 ; pos 0.560000 ; f0 148.519490 ; 
0.570000 124 0 ; pos 0.570000 ; f0 147.093260 ; 
0.580000 124 0 ; pos 0.580000 ; f0 149.393750 ; 
0.590000 124 0 ; pos 0.590000 ; f0 152.566530 ; 
0.670000 124 0 ; pos 0.670000 ; f0 114.544910 ; 
0.680000 124 0 ; pos 0.680000 ; f0 119.156750 ; 
0.690000 124 0 ; pos 0.690000 ; f0 120.519990 ; 
0.700000 124 0 ; pos 0.700000 ; f0 121.357320 ; 
0.710000 124 0 ; pos 0.710000 ; f0 121.615970 ; 
0.720000 124 0 ; pos 0.720000 ; f0 120.752700 ; 

This file was generated from a waveform using the folloing command

pda -s 0.01 -otype ascii -fmax 160 -fmin 70 wav/utt003.wav | 
awk 'BEGIN { printf("#\n") }
     { if ($1 > 0)
         printf("%f 124 0 ; pos %f ; f0 %f ; \n",
                NR*0.010,NR*0.010,$1) }' >Targets/utt003.Target

The utetrance may then be rendered as

festival> (set! utt1 (resynth "lab/utt003.lab" "Targets/utt003.utt"))

Note that this method will loose a little in diphone selection. If your diphone database uses consonant cluster allophones it wont be possible to properly detect these as there is no syllabic structure in this. That may or may not be important to you. Even this simple method however clearly shows how important the right prosody is to the understandability of a string of phones.

We have successfully done this on a number of natural utterances. We extracted the labels automatically by using the aligner discussed in the diphone chapter. As we were using diphones from the same speaker as the natural utterances (KAL) the alignment is surprisingly good and trivial to do. You must however synthesis the utterance first and save the waveform and labels. Note you should listen to ensure that the synthesizer has generated the right labels (as much as that is possible), including breaks in the same places. Comparing synthesized utterances with natural ones quickly shows up many problems in synthesis.


Go to the first, previous, next, last section, table of contents.