Prosody Research

Note that the above descriptions are for the easy implementation of prosody models which unfortunately means that the models will not be perfect. Of course no models will be perfect but with some work it is often possible to improve the basic models or at least make them more appropriate to the synthesis task. For example if your intend use of your synthesis voice is primarily for dialog systems training one news caster speech will not give the best effect. Festival is designed as a research system as well as tool to build languages so it is well adapted to prosody research.

One thing which clearly shows off how imporoverished our prosodic models are is the comparing of predicted prosody with natural prosody. Given a label file and an F0 Target file the following code will generate\ that utterance using the current voice

(define (resynth labfile f0file)
  (let ((utt (Utterance SegF0))) ; need some u to start with
    (utt.relation.load utt 'Segment labfile)
    (utt.relation.load utt 'Target f0file)
    (Wave_Synth utt))
)

The format of the label file should be one that can be read into Festival (e.g. the XLabel format) For example

#
 0.02000 26  pau ; 
 0.09000 26  ih ; 
 0.17500 26  z ; 
 0.22500 26  dh ; 
 0.32500 26  ae ; 
 0.35000 26  t ; 
 0.44500 26  ow ; 
 0.54000 26  k ; 
 0.75500 26  ey ; 
 0.79000 26  pau ; 

The target file is a little more complex again it is a label file but with features "pos" and "F0" at each stage. Thus the format for a naturally rendered version of the above would be.

#
0.070000 124 0 ; pos 0.070000 ; f0 133.045230 ; 
0.080000 124 0 ; pos 0.080000 ; f0 129.067890 ; 
0.090000 124 0 ; pos 0.090000 ; f0 125.364600 ; 
0.100000 124 0 ; pos 0.100000 ; f0 121.554800 ; 
0.110000 124 0 ; pos 0.110000 ; f0 117.248260 ; 
0.120000 124 0 ; pos 0.120000 ; f0 115.534490 ; 
0.130000 124 0 ; pos 0.130000 ; f0 113.769620 ; 
0.140000 124 0 ; pos 0.140000 ; f0 111.513180 ; 
0.240000 124 0 ; pos 0.240000 ; f0 108.386380 ; 
0.250000 124 0 ; pos 0.250000 ; f0 102.564100 ; 
0.260000 124 0 ; pos 0.260000 ; f0 97.383600 ; 
0.270000 124 0 ; pos 0.270000 ; f0 97.199710 ; 
0.280000 124 0 ; pos 0.280000 ; f0 96.537280 ; 
0.290000 124 0 ; pos 0.290000 ; f0 96.784970 ; 
0.300000 124 0 ; pos 0.300000 ; f0 98.328150 ; 
0.310000 124 0 ; pos 0.310000 ; f0 100.950830 ; 
0.320000 124 0 ; pos 0.320000 ; f0 102.853580 ; 
0.370000 124 0 ; pos 0.370000 ; f0 117.105770 ; 
0.380000 124 0 ; pos 0.380000 ; f0 116.747730 ; 
0.390000 124 0 ; pos 0.390000 ; f0 119.252310 ; 
0.400000 124 0 ; pos 0.400000 ; f0 120.735070 ; 
0.410000 124 0 ; pos 0.410000 ; f0 122.259190 ; 
0.420000 124 0 ; pos 0.420000 ; f0 124.512020 ; 
0.430000 124 0 ; pos 0.430000 ; f0 126.476430 ; 
0.440000 124 0 ; pos 0.440000 ; f0 121.600880 ; 
0.450000 124 0 ; pos 0.450000 ; f0 109.589040 ; 
0.560000 124 0 ; pos 0.560000 ; f0 148.519490 ; 
0.570000 124 0 ; pos 0.570000 ; f0 147.093260 ; 
0.580000 124 0 ; pos 0.580000 ; f0 149.393750 ; 
0.590000 124 0 ; pos 0.590000 ; f0 152.566530 ; 
0.670000 124 0 ; pos 0.670000 ; f0 114.544910 ; 
0.680000 124 0 ; pos 0.680000 ; f0 119.156750 ; 
0.690000 124 0 ; pos 0.690000 ; f0 120.519990 ; 
0.700000 124 0 ; pos 0.700000 ; f0 121.357320 ; 
0.710000 124 0 ; pos 0.710000 ; f0 121.615970 ; 
0.720000 124 0 ; pos 0.720000 ; f0 120.752700 ; 

This file was generated from a waveform using the folloing command

pda -s 0.01 -otype ascii -fmax 160 -fmin 70 wav/utt003.wav | 
awk 'BEGIN @{ printf("#\n") @}
     @{ if ($1 > 0)
         printf("%f 124 0 ; pos %f ; f0 %f ; \n",
                NR*0.010,NR*0.010,$1) @}' >Targets/utt003.Target

The utetrance may then be rendered as

festival> (set! utt1 (resynth "lab/utt003.lab" "Targets/utt003.utt"))

Note that this method will loose a little in diphone selection. If your diphone database uses consonant cluster allophones it wont be possible to properly detect these as there is no syllabic structure in this. That may or may not be important to you. Even this simple method however clearly shows how important the right prosody is to the understandability of a string of phones.

We have successfully done this on a number of natural utterances. We extracted the labels automatically by using the aligner discussed in the diphone chapter. As we were using diphones from the same speaker as the natural utterances (KAL) the alignment is surprisingly good and trivial to do. You must however synthesis the utterance first and save the waveform and labels. Note you should listen to ensure that the synthesizer has generated the right labels (as much as that is possible), including breaks in the same places. Comparing synthesized utterances with natural ones quickly shows up many problems in synthesis.