Prosodic phrasing in speech synthesis makes the whole speech more understandable. Due to the size of peoples lungs there is a finite length of time people can talk before they can take a breath, which defines an upper bound on prosodic phrases. However we rarely make our phrases this maximum length and use phrasing to mark groups within the speech. There is the apocryphal story of the speech synthesis example with an unnaturally long prosodic phrase played at a conference presentation. At the end of the phrase the audience all took a large in-take of breathe.
For the most case very simple prosodic phrasing is sufficient. A comparison of various prosodic phrasing techniques is discussed in taylor98a, though we will cover some of them here also.
For English (and most likely many other language too) simple rules based on punctuation is a very good predictor of prosodic phrase boundaries. It is rare that punctuation exists where there is no boundary, but there will be a substantial number of prosodic boundaries which are not explicitly marked with punctuation. Thus a prosodic phrasing algorithm solely based on punctuation will typically under predict but rarely make a false insertion. However depending on the actual application you wish to use the synthesizer for it may be the case that explicitly adding punctuation at desired phrase breaks is possible and a prediction system based solely on punctuation is adequate.
Festival basically supports two methods for predicting prosodic phrases, though any other method can easily be used. Note that these do not necessary entail pauses in the synthesized output. Pauses are further predicted from prosodic phrase information.
The first basic method is by CART tree. A test is made on each
word to predict it is at the end of a prosodic phrase. The basic CART
tree returns B
or BB
(though may return what you consider
is appropriate form break labels as long as the rest of
your models support it). The two levels identify different
levels of break, BB
being a used to denote a bigger
break (and end of utterance).
The following tree is very simple and simply adds a break after the last word of a token that has following punctuation. Note the first condition is done by a lisp function as we wand to ensure that only the last word in a token gets the break. (Earlier erroneous versions of this would insert breaks after each word in `1984.'
(set! simple_phrase_cart_tree ' ((lisp_token_end_punc in ("?" "." ":")) ((BB)) ((lisp_token_end_punc in ("'" "\"" "," ";")) ((B)) ((n.name is 0) ;; end of utterance ((BB)) ((NB))))))
This tree is defined `festival/lib/phrase.scm' in the standard distribution and is certainly a good first step in defining a phrasing model for a new language.
To make a better phrasing model requires more information. As the basic punctuation model underpredicts we need information that will find reasonable boundaries within strings of words. In English, boundaries are more likely between content words and function words, because most function words are before the words they related to, in Japanese function words are typically after their relate content words so breaks are more likely between function words and content words. If you have no data to train from, written rules, in a CART tree, can exploited this fact and give a phrasing model that is better than a punctuation only. Basically a rule could be if the current word is a content word and the next is a function word (or the reverse if that appropriate for a language) and we are more than 5 words from a punctuation symbol them predict a break. We maybe also want to insure that we are also at least five words from predicted break too.
Note the above basic rules aren't optimal but when you are building a new voice in a new language and have no data to train from you will get reasonably far with simple rules like that, such that phrasing prediction will be less of a problem than the other problems you will find in you voice.
To implement such a scheme we need three basic functions: one to
determine if the current word is a function of content word, one to
determine number of words since previous punctuation (or start of
utterance) and one to determine number of words to next punctuation (or
end of utterance. The first of these functions is already provided for
with a feature, through the feature function gpos
. This uses
the word list in the lisp variable guess_pos
to determine the
basic category of a word. Because in most languages the set of function
words is very nearly a closed class they can usually be explicitly
listed. The format of the guess_pos
variable is a list of
lists whose first element is the set name and the rest of the list if
the words that are part of that set. Any word not a member of
any of these sets is defined to be in the set content
. For
example the basic definition for this for English,
given in `festival/lib/pos.scm' is
(set! english_guess_pos '((in of for in on that with by at from as if that against about before because if under after over into while without through new between among until per up down) (to to) (det the a an no some this that each another those every all any these both neither no many) (md will may would can could should must ought might) (cc and but or plus yet nor) (wp who what where how when) (pps her his their its our their its mine) (aux is am are was were has have had be) (punc "." "," ":" ";" "\"" "'" "(" "?" ")" "!") ))
The punctuation distance check can be written as a Lisp feature function
(define (since_punctuation word) "(since_punctuation word) Number of words since last punctuation or beginning of utterance." (cond ((null word) 0) ;; beginning or utterance ((string-equal "0" (item.feat word "p.lisp_token_end_punc")) 0) (t (+ 1 (since_punctuation (item.prev word))))))
The function looking forward would be
(define (until_punctuation word) "(until_punctuation word) Number of words until next punctuation or end of utterance." (cond ((null word) 0) ;; beginning or utterance ((string-equal "0" (token_end_punc word)) 0) (t (+ 1 (since_punctuation (item.prev word))))))
The whole tree using these features that will insert a break at punctuation or between content and function words more than 5 words from a punctuation symbol is as follows
(set! simple_phrase_cart_tree_2 ' ((lisp_token_end_punc in ("?" "." ":")) ((BB)) ((lisp_token_end_punc in ("'" "\"" "," ";")) ((B)) ((n.name is 0) ;; end of utterance ((BB)) ((lisp_since_punctuation > 5) ((lisp_until_punctuation > 5) ((gpos is content) ((n.gpos content) ((NB)) ((B))) ;; not content so a function word ((NB))) ;; this is a function word ((NB))) ;; to close to punctuation ((NB))) ;; to soon after punctuation ((NB))))))
To use this add the above to a file in your `festvox/' directory and ensure it is loaded by your standard voice file. In your voice definition function. Add the following
(set! guess_pos english_guess_pos) ;; or appropriate for your language (Parameter.set 'Phrase_Method 'cart_tree) (set! phrase_cart_tree simple_phrase_cart_tree_2)
A much better method for predicting phrase breaks is using a full statistical model trained from data. The problem is that you need a lot of data to train phrase break models. Elsewhere in this document we suggest the use of a timit style database or around 460 sentences, (around 14500 segments) for training models. However a database such as this as very few internal utterance phrase breaks. An almost perfect model word predict breaks at the end of each utterances and never internally. Even the f2b database from the Boston University Radio New Corpus ostendorf95 which does have a number of utterance internal breaks isn't really big enough. For English we used the MARSEC database roach93 which is much larger (around 37,000 words). Finding such a database for your language will not be easy and you may need to fall back on a purely hand written rule system.
Often syntax is suggested as a strong correlate of prosodic phrase. Although there is evidence that it influences prosodic phrasing, there are notable exceptions bachenko90. Also considering how difficult it is to get a reliable parse tree it is probably not worth the effort, training a reliable parser is non-trivial, (though we provide a method for training stochastic context free grammars in the speech tools, see manual for details). Of course if your text to be synthesized is coming from a language system such as machine translation or language generation then a syntax tree may be readily available. In that case a simple rule mechanism taking into account syntactic phrasing may be useful
When only moderate amounts of data are available for training a simple CART tree may be able to tease out a reasonable model. See hirschberg94 for some discussion on this. Here is a short example of building a CART tree for phrase prediction. Let us assume you have a database of utterances as described previously. By convention we build models in directories under `festival/' in the main database directory. Thus let us create `festival/phrbrk'.
First we need to list the features that are likely to be suitable
predictors for phrase breaks. Add these to a file `phrbrk.feats',
what goes in here will depend on what you have, full part of speech
helps a lot but you may not have that for your language. The
gpos
described above is a good cheap alternative. Possible
features may be
word_break lisp_token_end_punc lisp_until_punctuation lisp_since_punctuation p.gpos gpos n.gpos
Given this list you can extract features form your database of utterances with the Festival script `dumpfeats'
dumpfeats -eval ../../festvox/phrbrk.scm -feats phrbrk.feats \ -relation Word -output phrbrk.data ../utts/*.utts
`festvox/phrbrk.scm' should contain the definitions of
the function until_punctuation
, since_punctuation
and any other Lisp feature functions you define.
Next we want to split this data into test and train data. We provide a simple shell script called `traintest' which splits a given file 9:1, i.e every 10th line is put in the test set.
traintest phrbrk.data
As we intend to run `wagon' the CART tree builder on this data we
also need create the feature description file for the data. The feature
description file consists of a bracketed list of feature name and type.
Type may be int
float
or categorical where a list of
possible values is given. The script `make_wagon_desc'
(distributed with the speech tools) will make a reasonable approximation
for this file
make_wagon_desc phrbrk.data phrbrk.feats phrbrk.desc
This script will treat all features as categorical. Thus any
float
or int
features will be treated categorically and
each value found in the data will be listed as a separate item. In our
example lisp_since_punctuation
and lisp_until_punctuation
are actually float (well maybe even int) but they will be listed as
categorically in `phrbrk.desc', something like
... (lisp_since_punctuation 0 1 2 4 3 5 6 7 8) ...
You should change this entry (by hand) to be
... (lisp_since_punctuation float ) ...
The script cannot work out the type of a feature automatically so you must make this decision yourself.
Now that we have the data and description we can build a CART tree. The basic command for `wagon' will be
wagon -desc phrbrk.desc -data phrbrk.data.train -test phrbrk.data.test \ -output phrbrk.tree
You will probably also want to set a stop value. The default stop value is 50, which means there must be at least 50 examples in a group before it will consider looking for a question to split it. Unless you have a lot of data this is probably too large and a value of 10 to 20 is probably more reasonable.
Other arguments to `wagon' should also be considered. A stepwise approach where all features are tested incrementally to find the best set of features which give the best tree can give better results than simply using all features. Though care should be taken with this as the generated tree becomes optimized from the given test set. Thus a further held our test set is required to properly test the accuracy of the result. In the stepwise case it is normal to split the train set again and call wagon as follows
traintest phrbrk.data.train wagon -desc phrbrk.desc -data phrbrk.data.train.train \ -test phrbrk.data.train.test \ -output phrbrk.tree -stepwise wagon_test -data phrbrk.data.test -desc phrbrk.desc \ -tree phrbrk.tree
Stepwise is particularly useful when features are highly correlated with themselves and its not clear which is best general predictor. Note that stepwise will take much longer to run as it potentially must build a large number of trees.
Other arguments to `wagon' can be considered, refer to the relevant chapter in speech tools manual for their details.
However it should be noted that without a good intonation and duration model spending time on producing good phrasing is probably not worth it. The quality of all these three prosodic components is closely related such that if one is much better than there may not be any real benefit.
Accent and boundary tones are what we will use, hopefully in a theory independent way, to refer to the two main types of intonation event. For English, and for many other languages the prediction of position of the accents and boundaries can be done as an independent process from F0 contour generation itself. This is definite true from the major theories we will be considering.
As with phrase break prediction there are some simple rules that will go a surprisingly long way. And as with most of the other statistical learning techniques simple rules cover most of the work, more complex rules work better, but the best results are from using the sorts of information you were using in rules but statistically training them from a appropriate data.
For English the placement of accents on stressed syllables in all content words is a quite reasonable approximation achieving about 80% accuracy on typical databases. hirschberg90 is probably the best example of a detailed rule driven approach (for English). CART trees based on the sorts of features Hirschberg uses are quite reasonable. Though eventual these rules become limiting and a richer knowledge source is required to assign accent patterns to complex nominals (see sproat90).
However all these techniques quickly come to the stumbling block that although simple so-called discourse neutral intonation is relatively easy achieve, achieving realistic, natural accent placement is still beyond our synthesis systems (though perhaps not for much longer).
The simplest rule for English may be reasonable for other languages. There are even simpler solutions to this, such as fixed prosody, or fixed declination, but apart from debugging a voice these are simpler than is required even for the most basic voices.
For English, adding a simple hat accent on lexically stressed syllables in all content words works surprisingly well. To do this in Festival you need a CART tree to predict accentedness, and rules to add the hat accent (though we will leave out F0 generation until the next section).
A basic tree that predicts accents of stressed syllables in content words is
(set! simple_accent_cart_tree ' ( (R:SylStructure.parent.gpos is content) ( (stress is 1) ((Accented)) ((NONE)) ) ) )
The above tree simply distinguishes accented syllables from non-accented. In theories like ToBI (silverman92), a number of different types of accent are supported. ToBI, with variations, has been applied to a number of languages and may be suitable for yours. However, although accent and boundary types have been identified for various languages and dialects, a computational mechanism for generating and F0 contour from an accent specification often has not yet been specified (we will discuss this more fully below).
If the above is considered too naive a more elaborate hand specified tree can also be written, using relevant factors, probably similar to those used in hirschberg90. Following that, training from data is the next option. Assuming a database exists and has been labelled with discrete accent classifications, we can extract data from it for training a CART tree with `wagon'. We will build the tree in `festival/accents/'. First we need a file listing the features that are felt to affect accenting. For this we will predict accents on syllables as that has been used for the English voices created so far, but there is an argument for predict accent placement on a word basis as although accents will need to be syllable aligned, which syllable in a word gets the accent is reasonably well defined (at least compared with predicting accent placement).
A possible list of features for accent prediction is put in the file `accent.feats'.
R:Intonation.daughter1.name R:SylStructure.parent.R:Word.p.gpos R:SylStructure.parent.gpos R:SylStructure.parent.R:Word.n.gpos ssyl_in syl_in ssyl_out syl_out p.stress stress n.stress pp.syl_break p.syl_break syl_break n.syl_break nn.syl_break pos_in_word position_type
We can extract these features from the utterances using the Festival script `dumpfeats'
dumpfeats -feats accent.feats -relation Syllable \ -output accent.data ../utts/*.utts
We now need a description file for the features which can be approximated by the speech tools script `make_wagon_desc'
make_wagon_desc accent.data accent.feat accent.desc
Because this script cannot determine if a feature is categorical,
if takes an range of values you must hand edit the output
file and change any feature to float
or int
if that is what
it is.
The next stage is to split the data into training and test sets. If stepwise training is to be used for building the CART tree (which is recommended) then the training data should be further split
traintest accent.data traintest accent.data.train
Deciding on a stop value for training depends on the number of examples, though this can be tuned to ensure over-training isn't happening.
wagon -data accent.data.train.train -desc accent.desc \ -test accent.data.train.test -stop 10 -stepwise -output accent.tree wagon_test -data accent.data.test -desc accent.desc \ -tree accent.tree
This above is designed to predict accents, and similar tree should be used to predict boundary tones as well. For the most part intonation boundaries are defined to occur at prosodic phrase boundaries so that task is somewhat easier, though if you have a number of boundary tone types in your inventory then the prediction is not so straightforward.
When training ToBI type accent types it is not easy to get the right type of variation in the accent types. Although some ToBI labels have been associated with semantic intentions and including discourse information has been shown help prediction (e.g. black97a), getting this acceptably correct is not easy. Various techniques in modifying the training data do seem to help. Because of the low incidence of `L*' labels in at least the f2b data, duplicating all sample points in the training data with L's does increase the likelihood of prediction and does seem to give a more varied distribution. Alternatively wagon returns a probability distribution for the accents, normally the most probable is selected, this could be modified to select from the distribution randomly based on their probabilities.
Once trees have been built they can be used in a voices as follows. Within the voice definition function
(set! int_accent_cart_tree simple_accent_cart_tree) (set! int_tone_cart_tree simple_tone_cart_tree) (Parameter.set 'Int_Method Intonation_Tree)
or if only one tree is required you can use the simpler intonation method
(set! int_accent_cart_tree simple_accent_cart_tree) (Parameter.set 'Int_Method Intonation_Simple)
Predicting where accents go (and their types) is only half of the problem. We also have build an F0 contour based on these. Note intonation is split between accent placement and F0 generation as it is obvious that accent position influences durations and an F0 contour cannot be generated without knowing the durations of the segments the contour is to be generated over.
There are three basic F0 generation modules available in Festival, though others could be added, by general rule, by linear regression/CART, and by Tilt.
The first is designed to be the most general and will always allow some form of F0 generation. This method allows target points to be programmatically created for each syllable in an utterance. The idea follows closely a generalization of the implementation of ToBI type accents in anderson84, where n-points are predicted for each accent. They (and others in intonation) appeal to the notion of baseline and place target F0 points above and below that line based on accent type, position in phrase. The baseline itself is often defined to decline over the phrase reflecting the general declination of F0 over type.
The simple idea behind this general method is that a Lisp function is called for each syllable in the utterance. That Lisp function returns a list of target F0 points that lie within that syllable. Thus the generality of this methods actual lies in the fact that it simply allows the user to program anything they want. For example our simple hat accent can be generated using this technique as follows.
This fixes the F0 range of the speaker so would need to be changed for different speakers.
(define (targ_func1 utt syl) "(targ_func1 UTT STREAMITEM) Returns a list of targets for the given syllable." (let ((start (item.feat syl 'syllable_start)) (end (item.feat syl 'syllable_end))) (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented") (list (list start 110) (list (/ (+ start end) 2.0) 140) (list end 100)))))
It simply checks if the current syllable is accented and if so returns a list of position/target pairs. A value at the start of the syllable or 110Hz, a value at 140Hz at the mid-point of the syllable and a value of 100 at the end.
This general technique can be expanded with other rules as necessary. Festival includes an implementation of ToBI using exactly this technique, it is based on the rules described in jilka96 and in the file `festival/lib/tobi_f0.scm'.
This technique was developed specifically to avoid the difficult decisions of exactly what parameters with what value should be used in rules like those of anderson84. The first implementation of this work is presented black96. The idea is to find the appropriate F0 target value for each syllable based on available features by training from data. A set of features are collected for each syllable and a linear regression model is used to model three points on each syllable. The technique produces reasonable synthesis and requires less analysis of the intonation models that would be required to write a rule system using the general F0 target method described in the previous section.
However to be fair, this technique is also much simpler and there are are obviously a number of intonational phenomena which this cannot capture (e.g. multiple accents on syllables and it will never really capture accent placement with respect to the vowel). The previous technique allows specification of structure but without explicit training from data (though doesn't exclude that) while this technique imposes almost no structure but depends solely on data. The Tilt modelling discussed in the following section tries to balance these two extremes.
The advantage of the linear regression method is very little
knowledge about the intonation the language under study needs to be
known. Of course if there is knowledge and theories it is usually
better to follow them (or at least find the features which influence
the F0 in that language). Extracting features for F0 modelling
is similar to extracting features for the other models. This
time we want the means F0 at the start middle and end of
each utterance. The Festival features syl_startpitch
,
syl_midpitch
and syl_endpitch
proved this. Note
that syl_midpitch
returns the pitch at the mid of the
vowel in the syllable rather than the middle of the syllable.
For a linear regression model all features must be continuous.
Thus features which are categorical that influence F0 need to be
converted. The standard technique for this is to introduce new features,
one for each possible value in the class and output values of 0 or 1
for these modified features depending on the value of the base features.
For example in a ToBI environment the output of the feature
tobi_accent
will include H*
, L*
, L+H*
etc.
In the modified form you would have features of the form
tobi_accent_H*
, tobi_accent_L*
, tobi_accent_L_H*
,
etc.
The program `ols' in the speech tools takes feature files and
description files in exactly the same format as `wagon', except
that all feature must be declared as type `float'. The standard
ordinary least squares algorithm used to find the coefficients
cannot, in general, deal with features that are directly correlated
with others as this causes a singularity when inverting the
matrix. The solution to this is to exclude such features. The
option -robust
enables that though at the expense of a longer
compute time. Again like `file' a stepwise option is included
so that the best subset of features may be found.
The resulting models may be used by the Int_Targets_LR
module
which takes its LR models from the variables f0_lr_start
,
f0_lr_mid
and f0_lr_end
. The output of ols
is a
list of coefficients (with the Intercept first). These need to be
converted to the appropriate bracket form including their feature names.
An example of which is in `festival/lib/f2bf0lr.scm'.
If the conversion of categoricals to floats seems to much work or would prohibitively increase the number of features you could use `wagon' to generate trees to predict F0 values. The advantage is that of a decision tree over the LR model is that it can deal with data in a non-linear fashion, But this is also the disadvantage. Also the decision tree technique may split the data sub-optimally. The LR model is probably more theoretically appropriate but ultimately the results depend on how goods the models sound.
Dump features as with the LR models, but this time there is
no need convert categorical features to floats. A potential
set of features to do this from (substitute syl_midpitch
and syl_endpitch
for the other two models is
syl_endpitch pp.tobi_accent p.tobi_accent tobi_accent n.tobi_accent nn.tobi_accent pp.tobi_endtone R:Syllable.p.tobi_endtone tobi_endtone n.tobi_endtone nn.tobi_endtone pp.syl_break p.syl_break syl_break n.syl_break nn.syl_break pp.stress p.stress stress n.stress nn.stress syl_in syl_out ssyl_in ssyl_out asyl_in asyl_out last_accent next_accent sub_phrases
The above, of course assumes a ToBI accent labelling, modify that as appropriate for you actually labelling.
Once you have generated three trees predicting values for start, mid and end points in each syllable you will need to add some Scheme code to use these appropriately. Suitable code is provided in `src/intonation/tree_f0.scm' you will need to include that in your voice. To use it as the intonation target module you will need to add something like the following to your voice function
(set! F0start_tree f2b_F0start_tree) (set! F0mid_tree f2b_F0mid_tree) (set! F0end_tree f2b_F0end_tree) (set! int_params '((target_f0_mean 110) (target_f0_std 10) (model_f0_mean 170) (model_f0_std 40))) (Parameter.set 'Int_Target_Method Int_Targets_Tree)
The int_params
values allow you to use the model with
a speaker of a different pitch range. That is all predicted
values are converted using the formula
(+ (* (/ (- value model_f0_mean) model_f0_stddev) target_f0_stddev) target_f0_mean)))
Or for those of you who can't real Lisp expressions
((value - model_f0_mean) / model_f0_stddev) * target_f0_stddev)+ target_f0_mean
The values in the example above are for converting a female speaker (used for training) to a male pitch range.
Tilt modelling is still under development and not as mature as the other methods as described above, but it potentially offers a more consistent solution to the problem. A tilt parameterization of a natural F0 contour can be automatically derived from a waveform and a labelling of accent placements (a simple `a' for accents and `b' of boundaries) taylor99. Further work is being done on trying to automatically find the accents placements too.
For each `a' in an labeling four continuous parameters are found: height, duration, peak position with respect to vowel start, and tilt. Prediction models may then be generate to predict these parameters which we feel better capture the dimensions of F0 contour itself. We have had success in building models for these parameters, dusterhoff97a, with better results than the linear regression model on comparable data. However so far we have not done any tests with Tilt on languages other than English.
The speech tools include the programs `tilt_analyse' and `tilt_synthesize' to aid model building but we do not yet include fill Festival end support for using the generated models.
Like the above prosody phenomena, very simple solutions to predicting durations work surprisingly well, though very good solutions are extremely difficult to achieve.
Again the basic strategy is assigning fixed models, simple rules models, complex rule modules, and trained models using the features in the complex rule models. The choice of where to stop depends on the resources available to you and time you wish to spend on the problem. Given a reasonably sized database training a simple CART tree for durations achieves quite acceptable results. This is currently what we do for our English voices in Festival. There are better models out there but we have not fully investigated them or included easy scripts to customize them.
The simplest model for duration is a fixed duration for each phone. A
value of 100 milliseconds is a reasonable start. This type of model is
only of use at initial testing of a diphone database beyond that it
sounds too artificial. The Festival function SayPhones
uses a
fixed duration model, controlled by the value (in ms) in the variable
FP_duration
. Although there is a fixed duration module in
Festival (see the manual) its worthwhile starting off with something
a little more interesting.
The next level for duration models is to use average durations for the phones. Even when real data isn't available to calculate averages, writing values by hand can be acceptable, basically vowels are longer than consonants, and stops are the shortest. Estimating values for a set of phones can be done by looking at data from another language, (if you are really stuck, see `festival/lib/mrpa_durs.scm'), to get the basic idea of average phone lengths.
In most languages phones are longer at the phrase final and to a lesser extent phrase initial positions. A simple multiplicative factor can be defined for these positions. The next stage from this is a set of rules that modify the basic average based on the context they occur in. For English the best definition of such rules is the duration rules given in chapter 9, allen87 (often referred to as the Klatt duration model). The factors used in this may also apply to other languages. A simplified form of this, that we have successfully used for a number of languages, and is often used as our first approximation for a duration rule set is as follows.
Here we define a simple decision tree that returns a multiplication factor for a segment
(set! simple_dur_tree ' ((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial ((R:SylStructure.parent.stress is 1) ((1.5)) ((1.2))) ((R:SylStructure.parent.syl_break > 1) ;; clause final ((R:SylStructure.parent.stress is 1) ((1.5)) ((1.2))) ((R:SylStructure.parent.stress is 1) ((ph_vc is +) ((1.2)) ((1.0))) ((1.0))))))
You may modify this adding more conditions as much as you want. In addition to the tree you need to define the averages for each phone in your phone set. For reasons we will explain below the format of this information is `segname 0.0 average' as in
(set! simple_phone_data '( (# 0.0 0.250) (a 0.0 0.080) (e 0.0 0.080) (i 0.0 0.070) (o 0.0 0.080) (u 0.0 0.070) (i0 0.0 0.040) ... ))
With both these expressions loaded in your voice you may set the following in your voice definition function. setting up this tree and data as the standard and the appropriate duration module.
;; Duration prediction (set! duration_cart_tree simple_dur_tree) (set! duration_ph_info simple_phone_data) (Parameter.set 'Duration_Method 'Tree_ZScores)
Though in your voice use voice specific names for the simple_
variables otherwise you may class with other voices.
It has been shown campbell91 that a better representation for
duration for modeling is zscores, that is number of standard
deviations from the mean. The duration module used in the above is
actually designed to take a CART tree that returns zscores and uses the
information in duration_ph_info
to change that into an absolute
duration. The two fields after the phone name are mean and standard
deviation. The interpretation of this tree and this phone info happens
to give the right result when we use the tree to predict factors and
have the stddev field contain the average duration, as we did above.
However no matter if we use zscores or absolutes, a better way to build a duration model is to train from data rather than arbitrarily selecting modification factors.
Given a reasonable sized database we can dump durations and features for each segment in the database. Then we can train a model using those samples. For our English voices we have trained regression models using `wagon', though we include the tools for linear regression models too.
An initial set of features to dump might be
segment_duration name p.name n.name R:SylStructure.parent.syl_onsetsize R:SylStructure.parent.syl_codasize R:SylStructure.parent.R:Syllable.n.syl_onsetsize R:SylStructure.parent.R:Syllable.p.syl_codasize R:SylStructure.parent.position_type R:SylStructure.parent.parent.word_numsyls pos_in_syl syl_initial syl_final R:SylStructure.parent.pos_in_word p.seg_onsetcoda seg_onsetcoda n.seg_onsetcoda pp.ph_vc p.ph_vc ph_vc n.ph_vc nn.ph_vc pp.ph_vlng p.ph_vlng ph_vlng n.ph_vlng nn.ph_vlng pp.ph_vheight p.ph_vheight ph_vheight n.ph_vheight nn.ph_vheight pp.ph_vfront p.ph_vfront ph_vfront n.ph_vfront nn.ph_vfront pp.ph_vrnd p.ph_vrnd ph_vrnd n.ph_vrnd nn.ph_vrnd pp.ph_ctype p.ph_ctype ph_ctype n.ph_ctype nn.ph_ctype pp.ph_cplace p.ph_cplace ph_cplace n.ph_cplace nn.ph_cplace pp.ph_cvox p.ph_cvox ph_cvox n.ph_cvox nn.ph_cvox R:SylStructure.parent.R:Syllable.pp.syl_break R:SylStructure.parent.R:Syllable.p.syl_break R:SylStructure.parent.syl_break R:SylStructure.parent.R:Syllable.n.syl_break R:SylStructure.parent.R:Syllable.nn.syl_break R:SylStructure.parent.R:Syllable.pp.stress R:SylStructure.parent.R:Syllable.p.stress R:SylStructure.parent.stress R:SylStructure.parent.R:Syllable.n.stress R:SylStructure.parent.R:Syllable.nn.stress R:SylStructure.parent.syl_in R:SylStructure.parent.syl_out R:SylStructure.parent.ssyl_in R:SylStructure.parent.ssyl_out R:SylStructure.parent.parent.gpos
By convention we build duration models in `festival/dur/'. We will save the above feature names in `dur.featnames'. We can dump the features with the command
dumpfeats -relation Segment -feats dur.featnames -output dur.feats \ ../utts/*.utt
This will put all the features in the file `dur.feats'. For
wagon
we need to build a feature description file, we can
build a first approximation with the `make_wagon_desc'
script available with the speech tools
make_wagon_desc dur.feats dur.featnames dur.desc
You will then need to edit `dur.desc' to change a number of
features from their categorical list (lots of numbers) into type
float
. Specifically for the above list the features
segment_duration
,
R:SylStructure.parent.parent.word_numsyls
, pos_in_syl
,
R:SylStructure.parent.pos_in_word
,
R:SylStructure.parent.syl_in
,
R:SylStructure.parent.syl_out
,
R:SylStructure.parent.ssyl_in
and
R:SylStructure.parent.ssyl_out
should be declared as floats.
We then need to split the data into training and test sets (and further split the train set if we are going to use stepwise CART building.
traintest dur.feats traintest dur.feats.train
We can no build a model using wagon
wagon -data dur.feat.train.train -desc dur.desc \ -test dur.feats.train.test -stop 10 -stepwise \ -output dur.10.tree wagon_test -data dur.feats.test -tree dur.10.tree -desc dur.desc
You may wish to remove all examples of silence from the data as silence durations typically has quite a different distribution from other phones. In fact it is common that databases include many examples of silence which are not of natural length as they are arbitrary parts of the initial and following silence around the spoken utterances. Their durations are not something that should be trained for.
These instructions above will build a tree that predicts absolute values. To get such a tree to work with the zscore module simply make the stddev field above 1. As stated above using zscores typically give better results. Although the correlation of these duration models in the zscore domain may not be as good as training models predicting absolute scores when those predicted scores are convert back into the absolute domain we have found (for English) that the correlations are better, and RMSE smaller.
In order to train a zscore model you need to convert the absolute segment durations, to do that you need the means and standard deviations for each segment in your phoneset.
There is a whole branch of possible mappings for the distribution of durations: zscores, logs, logs-zscores, etc or even more complex functions bellegarda98. These variations do give some improvements. The intention is to map the distribution to a normal distribution which makes it easier to learn.
Other learning techniques, particularly Sums of Products model (sproat98 chapter 5), which has been shown to training better even on small amounts of data.
Another technique, which although shouldn't work is to borrow a models trained for another language for which data is available. Actually the duration model used in Festival for the US and UK voices is the same, it was in fact trained from the f2b database, a US English database. As the phone sets are different for US and UK English we trained the models using phonetic features rather than phone names, and trained them in the zscore domain keeping the actual phone names and means and standard deviations separate. Although the models were slightly better if we included the phone names themselves, it was only slightly better and the models were also substantially larger (and took longer to train). Using the phonetic feature offers a more general model (it works for UK English), more compact, quicker learning time and with only a small cost in performance.
Also in the German voice developed at OGI, the same English duration model was used. The results are acceptable and are at least better than any hand written rule system that could be written. Improvements in that model are probably only possible by training on real German data. Note however such cross language borrowing of models is unlikely to work in general but there may be cases where it is a reasonable fall back position.
Note that the above descriptions are for the easy implementation of prosody models which unfortunately means that the models will not be perfect. Of course no models will be perfect but with some work it is often possible to improve the basic models or at least make them more appropriate to the synthesis task. For example if your intend use of your synthesis voice is primarily for dialog systems training one news caster speech will not give the best effect. Festival is designed as a research system as well as tool to build languages so it is well adapted to prosody research.
One thing which clearly shows off how imporoverished our prosodic models are is the comparing of predicted prosody with natural prosody. Given a label file and an F0 Target file the following code will generate\ that utterance using the current voice
(define (resynth labfile f0file) (let ((utt (Utterance SegF0))) ; need some u to start with (utt.relation.load utt 'Segment labfile) (utt.relation.load utt 'Target f0file) (Wave_Synth utt)) )
The format of the label file should be one that can be read into Festival (e.g. the XLabel format) For example
# 0.02000 26 pau ; 0.09000 26 ih ; 0.17500 26 z ; 0.22500 26 dh ; 0.32500 26 ae ; 0.35000 26 t ; 0.44500 26 ow ; 0.54000 26 k ; 0.75500 26 ey ; 0.79000 26 pau ;
The target file is a little more complex again it is a label file but with features "pos" and "F0" at each stage. Thus the format for a naturally rendered version of the above would be.
# 0.070000 124 0 ; pos 0.070000 ; f0 133.045230 ; 0.080000 124 0 ; pos 0.080000 ; f0 129.067890 ; 0.090000 124 0 ; pos 0.090000 ; f0 125.364600 ; 0.100000 124 0 ; pos 0.100000 ; f0 121.554800 ; 0.110000 124 0 ; pos 0.110000 ; f0 117.248260 ; 0.120000 124 0 ; pos 0.120000 ; f0 115.534490 ; 0.130000 124 0 ; pos 0.130000 ; f0 113.769620 ; 0.140000 124 0 ; pos 0.140000 ; f0 111.513180 ; 0.240000 124 0 ; pos 0.240000 ; f0 108.386380 ; 0.250000 124 0 ; pos 0.250000 ; f0 102.564100 ; 0.260000 124 0 ; pos 0.260000 ; f0 97.383600 ; 0.270000 124 0 ; pos 0.270000 ; f0 97.199710 ; 0.280000 124 0 ; pos 0.280000 ; f0 96.537280 ; 0.290000 124 0 ; pos 0.290000 ; f0 96.784970 ; 0.300000 124 0 ; pos 0.300000 ; f0 98.328150 ; 0.310000 124 0 ; pos 0.310000 ; f0 100.950830 ; 0.320000 124 0 ; pos 0.320000 ; f0 102.853580 ; 0.370000 124 0 ; pos 0.370000 ; f0 117.105770 ; 0.380000 124 0 ; pos 0.380000 ; f0 116.747730 ; 0.390000 124 0 ; pos 0.390000 ; f0 119.252310 ; 0.400000 124 0 ; pos 0.400000 ; f0 120.735070 ; 0.410000 124 0 ; pos 0.410000 ; f0 122.259190 ; 0.420000 124 0 ; pos 0.420000 ; f0 124.512020 ; 0.430000 124 0 ; pos 0.430000 ; f0 126.476430 ; 0.440000 124 0 ; pos 0.440000 ; f0 121.600880 ; 0.450000 124 0 ; pos 0.450000 ; f0 109.589040 ; 0.560000 124 0 ; pos 0.560000 ; f0 148.519490 ; 0.570000 124 0 ; pos 0.570000 ; f0 147.093260 ; 0.580000 124 0 ; pos 0.580000 ; f0 149.393750 ; 0.590000 124 0 ; pos 0.590000 ; f0 152.566530 ; 0.670000 124 0 ; pos 0.670000 ; f0 114.544910 ; 0.680000 124 0 ; pos 0.680000 ; f0 119.156750 ; 0.690000 124 0 ; pos 0.690000 ; f0 120.519990 ; 0.700000 124 0 ; pos 0.700000 ; f0 121.357320 ; 0.710000 124 0 ; pos 0.710000 ; f0 121.615970 ; 0.720000 124 0 ; pos 0.720000 ; f0 120.752700 ;
This file was generated from a waveform using the folloing command
pda -s 0.01 -otype ascii -fmax 160 -fmin 70 wav/utt003.wav | awk 'BEGIN { printf("#\n") } { if ($1 > 0) printf("%f 124 0 ; pos %f ; f0 %f ; \n", NR*0.010,NR*0.010,$1) }' >Targets/utt003.Target
The utetrance may then be rendered as
festival> (set! utt1 (resynth "lab/utt003.lab" "Targets/utt003.utt"))
Note that this method will loose a little in diphone selection. If your diphone database uses consonant cluster allophones it wont be possible to properly detect these as there is no syllabic structure in this. That may or may not be important to you. Even this simple method however clearly shows how important the right prosody is to the understandability of a string of phones.
We have successfully done this on a number of natural utterances. We extracted the labels automatically by using the aligner discussed in the diphone chapter. As we were using diphones from the same speaker as the natural utterances (KAL) the alignment is surprisingly good and trivial to do. You must however synthesis the utterance first and save the waveform and labels. Note you should listen to ensure that the synthesizer has generated the right labels (as much as that is possible), including breaks in the same places. Comparing synthesized utterances with natural ones quickly shows up many problems in synthesis.
Go to the first, previous, next, last section, table of contents.