Go to the first, previous, next, last section, table of contents.


8 Lexicons

This chapter covers method for finding the pronunciation of a word. This is either by a lexicon (a large list of words and their pronunciations) or by some method of letter to sound rules.

8.1 Word pronunciations

A pronunciation in Festival requires not just a list of phones but also a syllabic structure. In some language sthe syllabic structure is very simple and well defined and can be unabiguously derived from a phone string. In English however this may not always be the case (compound nouns being the difficult case).

The lexicon structure that is basically available in Festival takes both a word and a part of speech (and arbitrary token) to find the given pronunciation. For English this is probabaly the optimal form, although there exist homogrpahs in the language, the word itself and a fairly broad part of speech tag will mostly identify the proper pronunciation.

An example entry is

("photography"
 n
 (((f  ) 0) ((t o g) 1) ((r  f) 0) ((ii) 0)))

Not that in addition to explicit marking of sylables a stress value is also given (0 or 1). In some languages lexical is fully predictable, in others highly irregular. In some this field may be more approriately used for an other purpose, e.g. tone type in Chinese.

There may be other languages which require a more complex (less complex) format and the decision to use some other format rather than this one is up to you.

Currently there is only residual support fo morphological analysis in Festival. A finite state tranducer based analyser for English based on the work in ritchie92 is included in `festival/lib/engmorph.scm' and `festival/lib/engmorphsyn.scm'. But this should be considered experimental at best. Give the lack of such an analyser our lexicons need to list not only based forms of words but also all their morphological variants. This is (more or less) acceptable in languages such as English or French but which languages with richer morphology such as German it may seem an unnecessary requirement. Agglutenative languages such as FInnish and Turkish this appears to be even more a restriction. This is probably true but this current restriction not necessary hopeless. We have successfully build very good letter to sound rules for German, a language with a rich morphology which allows the system to properly predict pronounciations of morhological variants of root words it has not seen before. We have not yet done any experiments with Finnish or Turkish but see this technique would work, (though of course developing a properly morphological analyser would be better).

8.2 Lexicons and addenda

The basic assumption in Festival is that you will have a large lexicon, tens of thousands of entries, that is a used as a standard part of an implementation of a voice. Letter to sound rules are used as back up when a word is not explicitly listed. This view is based on how English is best dealt with. However this is a very flexible view, An explicit lexicon isn't necessary in Festival and it may be possibel to do much of the work in letter to sound rules. This is how we have implemented Spanish. However even when there is strong relationship between the letters in a word and their pronunciation we still find the a lexicon useful. For Spanish we still use the lexicon for symbols such as `$', `%', individual letters, as well as irregular pronunciations.

In addition to a large lexicon Festival also supports a smaller list called an addenda this is primarily provided to allow specific applications and users to add entries that aren't in the existing lexicon.

8.3 Out of vocabulary words

Because its impossible to list all words in a natural language for general text to speech you will need to provide something to pronounce out of vocabulary words. In some language sthis is easy but in other's it is very hard. No matter what you do you must provide something even if it is simply replacing the unknown word with the word `unknown' (or its local language equivalent). By default a lexicon in Festival will throw an error if a requested word isn't found. To change this you can set the lts_method. Most usefully you can reset this to the name of function, which takes a word and a part of speech specification and returns a word pronuciation as described above.

FOr example is we are always going to return the word unknown but print a warning the the word is being ignored a suitable function is

(define (mylex_lts_function word feats)
"Deal with out of vocabulary word."
  (format t "unknown word: %s\n" word)
  '("unknown" n (((uh n) 1) ((n ou n) 1))))

Note the pronunciation of `unknown' must be in the appropriate phone set. Also the syllabic structure is required. You need to specify this function for your lexicon as follows

(lex.set.lts.method 'mylex_lts_function)

At one level above merely identifying out of vocabulary words, they can be spelled, this of course isn't ideal but it will allow the basic information to be passed over to the listener. This can be done with the out of vocabulary function, as follows.

(define (mylex_lts_function word feats)
"Deal with out of vocabulary wordm by spelling out the letters in the
word."
 (if (equal? 1 (length word))
     (begin
       (format t "the character %s is missing from the lexicon\" word)
       '("unknown" n (((uh n) 1) ((n ou n) 1))))
     (cons
      word
      'n
      (apply
       append
       (mapcar
        (lambda (letter)
         (car (cdr (cdr (lex.lookup letter 'n)))))
        (symbolexplode word))))))

A few point are worth noting in this function. This recusively calls the lexical lookup function on the characters in a word. Each letter should appear in the lexicon with its pronuncitation (in isolation). But a check is made to ensure we don't recurse for ever. The symbolexplode function assumes that that letters are single bytes, which may not be true for some languages and that function would need to be replaced for that language. Note that we append the syllables of each of the letters int he word. For long words this might be too naive as there could be internal prosodic structure in such a spelling that this method would not allow for. In that case you would want letters to be words thus the symbol expolsion to happen at the token to word level. Also the above function assumes that the part of speech for letters is n. This is only really important where letters are homographs in languages so this can be used to distingush which pronunciation you require (cf. `a' in English or `y' in French).

8.4 Letter to sound rules

For many languages there is a systematic relationship between the written form of a word and its pronunciation. For some language this can be fairly easy to write down, by hand. In Festival there is a letter to sound rule system that allows rules to be written. This rule system, describing in detail in the Festival manual itself is what you should use if you are going to write rules by hand. The automatic training method below produces CART trees which although are easy to interpret are prtobabaly unsuitable as a notation for hand specification.

When writing a rule system it is often useful to do it in multiple passes. The Spanish diphone voice distributed as `festvox_ellpc11k.tar.gz' offers a good example of such a use. A set of cascaded LTS rule sets is used to trasnfer the basic word to a full accented, syllabified string of symbols which is then converted into the bracketed from used by Festival. The levels are normalisations (downcasing and accent normalization), convertion to pronunciation, syllabification, stress and finally identifying weak vowels. Splitting the conversion tasks like this can often make writing the rules much easier, though care should be taken to ensure you don't mix up what you think are letters and what you think are phones.

The LTS rule system is a little primitive and lacks some syntactic sugar (sets etc.) that would make writing rules easier. In their present form you need to be very explicit. Testing your rule set can be done in Festival in isolation (and should be done so, rather than by actual synthesis). The funciton lts.apply allows you to apply a LTS rule set to a word or list of symbols. See the manual and the Spanish example for more details.

For some languages the writing of a rule system is too difficult. ALthough there have been many valiant attempts to do so for language slike English life is basically too short to do this. Therefore we also include a method for automatically building LTS rules sets for a lexicon of pronunciations. This reasoning and some results of this method are discussed in pagel98 and black98b.

The method produces a set of CART trees that predict a phone based on the letter context. It is not fully auytomatic but nearly so, in Festival's implementation of it it requires the hand seeding of which letters can go to which phones (irrespective of context). A full walk through is given in Festival manual. That section may in fact be more appropriate for this document.

8.5 Post-lexical rules

In fluent speech word boundaries are often degraded in a way that causes co-articulation accross boundaries. A lexical entriy should normally provide pronuncations as if the word is being spoken in isolation. It is only once the word has been inserted into the the context in which it is going to spoken can co-articulary effects be applied.

Post lexical rules are a general set of rules which can modify the segment relation (or any other part of the utterance for that matter), after the basic pronunciations have been found. In Festival post-lexical rules are defined as functions which will be applied to the utterance after intonational accents have been assigned.

For example in British English word final /r/ is only produced when the following word starts with a vowel. Thus all other word final /r/s need to be deleted. A Scheme function that implements this is as follows

(define (plr_rp_final_r utt)
  (mapcar
   (lambda (s)
    (if (and (string-equal "r" (item.name s))  ;; this is an r
             ;; it is syllable final
             (string-equal "1" (item.feat s "syl_final"))
             ;; the syllable is word final
             (not (string-equal "0" 
                   (item.feat s "R:SylStructure.parent.syl_break")))
             ;; The next segment is not a vowel
             (string-equal "-" (item.feat s "n.ph_vc")))
        (item.delete s)))
   (utt.relation.items utt 'Segment)))

In English we also use post-lexical rules for phenomena such as vowel reduction and schwa deletion in the possessive `'s'.


Go to the first, previous, next, last section, table of contents.