[ diagram: text going in and moving around, coming out audio ]
Within Festival we can identify three basic parts of the TTS process
From raw text to identified words and basic utterances.
Finding pronunciations of the words and assigning prosodic structure to them: phrasing, intonation and durations.
From a fully specified form (pronunciation and prosody) generate a waveform.
There is another part to TTS which is normally not mentioned, we will mention it here as it is the most important aspect of Festival that makes building of new voices possible -- the system architecture. Festival provides a basic utterance structure, a language to manipulate it, and methods for construction and deletion; it also interacts with your audio system in an efficient way, spooling audio files while the rest of the synthesis process can continue. With the Edinburgh Speech Tools, it offers basic analysis tools (pitch trackers, classification and regression tree builders, waveform I/O etc) and a simple but powerful scripting language. All of these functions make it so that you may get on with the task of building a voice, rather than worrying about the underlying software too much.
We try to model the voice independently of the meaning, with machine learning techniques and statistical methods. This is an important abstraction, as it moves us from the realm of "all human thought" to "all possible sequences." Rather than asking "when and why should this be said," we ask "how is this performed, as a series of speech sounds?" In general, we'll discuss this under the heading of text analysis -- going from written text, possibly with some mark-up, to a set of words and their relationships in an internal representation, called an utterance structure.
Text analysis is the task of identifying the words in the text. By words, we mean tokens for which there is a well defined method of finding their pronunciation, i.e. from a lexicon, or using letter-to-sound rules. The first task in text analysis is to make chunks out of the input text -- tokenizing it. In Festival, at this stage, we also chunk the text into more reasonably sized utterances. An utterance structure is used to hold the information for what might most simply be described as a sentence. We use the term loosely, as it need not be anything syntactic in the traditional linguistic sense, though it most often has prosodic boundaries or edge effects. Separating a text into utterances is important, as it allows synthesis to work bit by bit, allowing the waveform of the first utterance to be available more quickly than if the whole files was processed as one. Otherwise, one would simply play an entire recorded utterance -- which is not nearly as flexible, and in some domains is even impossible.
Utterance chunking is an externally specifiable part of Festival, as it may vary from language to language. For many languages, tokens are white-space separated and utterances can, to a first approximation, be separated after full stops (periods), question marks, or exclamation points. Further complications, such as abbreviations, other-end punctuation (as the upside-down question mark in Spanish), blank lines and so on, make the definition harder. For languages such as Japanese and Chinese, where white space is not normally used to separate what we would term words, a different strategy must be used, though both these languages still use punctuation that can be used to identify utterance boundaries, and word segmentation can be a second process.
Apart from chunking, text analysis also does text normalization. There are many tokens which appear in text that do not have a direct relationship to their pronunciation. Numbers are perhaps the most obvious example. Consider the following sentence
In English, tokens consisting of solely digits have a number of different forms of pronunciation. The "5" above is pronounced "fifth", an ordinal, because it is the day in a month, The first "1996" is pronounced as "nineteen ninety six" because it is a year, and the second "1996" is pronounced as "one thousand nine hundred and ninety size" (British English) as it is a quantity.
On May 5 1996, the university bought 1996 computers.
Two problems that turn up here: non-trivial relationship of tokens to words, and homographs, where the same token may have alternate pronunciations in different contexts. In Festival, homograph disambiguation is considered as part of text analysis. In addition to numbers, there are many other symbols which have internal structure that require special processing -- such as money, times, addresses, etc. All of these can be dealt with in Festival by what is termed token-to-word rules. These are language specific (and sometimes text mode specific). Detailed examples will be given in the text analysis chapter below.
After we have a set of words to be spoken, we have to decide what the sounds should be -- what phonemes, or basic speech sounds, are spoken. Each language and dialect has a phoneme set associated with it, and the choice of this inventory is still not agreed upon; different theories posit different feature geometries. Given a set of units, we can, once again, train models from them, but it is up to linguistics (and practice) to help us find good levels of structure and the units at each.
Prosody, or the way things are spoken, is an extremely important part of the speech message. Changing the placement of emphasis in a sentence can change the meaning of a word, and this emphasis might be revealed as a change in pitch, volume, voice quality, or timing.
We'll present two approaches to taming the prosodic beast: limiting the domain to be spoken, and intonation modeling. By limiting the domain, we can collect enough data to cover the whole output. For some things, like weather or stock quotes, very high quality can be produced, since these are rather contained. For general synthesis, however, we need to be able to turn any text, or perhaps concept, into a spoken form, and we can never collect all the sentences anyone could ever say. To handle this, we break the prosody into a set of features, which we predict using statistically trained models.
- phrasing - duration - intonation - energy - voice quality
For the case of concatenative synthesis, we actually collect recordings of voice talent, and this captures the voice quality to some degree. This way, we avoid detailed physical simulation of the oral tract, and perform synthesis by integrating pieces that we have in our inventory; as we don't have to produce the precisely controlled articulatory motion, we can model the speech using the units available in the sound alone -- though these are the surface realization of an underlying, physically generated signal, and knowledge of that system informs what we do. During waveform generation, the system assembles the units into an audio file or stream, and that can be finally "spoken." There can be some distortion as these units are joined together, but the results can also be quite good.
We systematically collect the units, in all variations, so as to be able to reproduce them later as needed. To do this, we design a set of utterances that contain all of the variation that produces meaningful or apparent contrast in the language, and record it. Of course, this requires a theory of how to break speech into relevant parts and their associated features; various linguistic theories predict these for us, though none are undisputed. There are several different possible unit inventories, and each has tradeoffs, in terms of size, speed, and quality; we will discuss these in some detail.