The basic building block for Festival is the utterance. The
structure consists of a set of relations over a set of
items. Each item represents a object such as a word, segment,
syllable, etc. while relations relate these items together. An item may
appear in multiple relations, such as a segment will be in a
Segment relation and also in the
Relations define an ordered structure over the items within them, in
general these may be arbitrary graphs but in practice so far we have
only used lists and trees Items may contain a number of
There are no built-in relations in Festival and the names and use of them is controlled by the particular modules used to do synthesis. Language, voice and module specific relations can easy be created and manipulated. However within our basic voices we have followed a number of conventions that should be followed if you wish to use some of the existing modules.
The relation names used will depend on the particular structure chosen for your voice. So far most of our released voices have the same basic structure though some of our research voices contain quite a different set of relations. For our basic English voices the relations used are as follows
Contains a single item which contains a feature with the input character string that is being synthesized
A list of trees where each root of each tree is the white space separated tokenized object from the input character string. Punctuation and whitespace has been stripped and placed on features on these token items. The daughters of each of these roots are the list of words that the token is associated with. In many cases this is a one to one relationship, but in general it is one to zero or more. For example tokens comprising of digits will typically be associated with a number of words.
The words in the utterance. By word we typically mean something
that can be given a pronunciation from a lexicon (or letter-to-sound
rules). However in most of our voices we distinguish pronunciation by
the words and a part of speech feature. Words with also be leaves of the
Token relation, leaves of the
Phrase relation and roots of
A simple list of trees representing the prosodic phrasing on the
utterance. In our voices we only have one level of prosodic phrase
below the utterance (though you can easily add a deeper hierarchy
if your models require it). The tree roots are labeled with
the phrase type and the leaves of these trees are in the
A simple list of syllable items. These syllable items are intermediate
nodes in the
SylStructure relation allowing access to the words
these syllables are in and the segments that are in these syllables.
In this format no further onset/coda distinction is made explicit but can
be derived from this information.
A simple list of segment (phone) items. These form the leaves of
SylStructure relation through which we can find where each
segment is placed within its syllable and word. By convention
silence phones do not appear in any syllable (or word) but will
exist in the segment relation.
A list of tree structures over the items in the
A simple list of intonation events (accents and boundaries).
These are related to syllables through the
A list of trees whose roots are items in the
and daughters are in the
IntEvent relation. It is assumed that a
syllable may have a number of intonation events associated with it (at
least accents and boundaries), but an intonation event may only by
associated with one syllable.
A relation consisting of a single item that has a feature with the synthesized waveform.
A list of trees whose roots are segments and daughters are F0 target points. This is only used by some intonation modules.
Unit, SourceSegments, Frames, SourceCoef TargetCoef
A number of relations used the the