Go to the first, previous, next, last section, table of contents.

3 Introduction

In this chapter, we discuss some of the general aspects of building a new voice for the Festival Speech Synthesis System, and the requirements involved. We will outline each part that you must provide in order to have a complete voice. There are a number of choices for each part with each choice requiring a varying amount of work; the ultimate decision of what you choose is your own based on the resources you wish to commit and the quality of the voice you wish to build.

The development of the Festival Speech Synthesis System started in April, 1996, and it quickly became a tool in which new voices could be built. That summer, a Masters student (Maria Walters) at Edinburgh tried to build a Scots Gaelic synthesizer, and, although the project was never finished, it did point the way to building new voices in Festival. Later that summer, a Spanish synthesizer was build rather quickly, using an existing diphone set, and an existing Welsh Gaelic synthesizer was migrated into Festival without changing any of the basic architecture of Festival, or of the existing Gaelic synthesizer, though all parts became external data rather than embedded C code. The following year, another Masters student (Argyris Biris), built a Greek synthesizer and a visiting scholar (Borja Etxebarria) at CSTR, Edinburgh improved the Spanish and built a Basque synthesizer. Oregon Graduate Institute (under the supervision of Mike Macon) further developed new voices, including ones for American English voices and Mexican Spanish; they, unlike the prior developments, also included a new waveform synthesizer (in C++) which they also distribute as a plug-in for Festival. In the summer of '98, Dominika Oliver built a Polish synthesizer, and OGI hosted a six week workshop, with Karin Mueller, Bettina Saeuberlich and Horst Meyer building two German synthesis voices within the system.

In all cases, these new voices consist of a set of diphones and some scheme code to provide a front end, including text analysis, pronunciation, and prosody prediction. The voices are quite separate from Festival itself, and can be distributed as packages that can be installed against any installation of Festival. The voices do not interfere with any existing, installed voices.

The problem is that once people see that new voices can be easily created, they want to make some also. Over the initial two and half years, the processes involved have become easier and more automatic, to the point where a document specifically describing the processes from a practical point of view can be written.

Note that we do not consider the processes of building a new voice (either in an existing supported language or in a new language) as trivial. It does require choices which we cannot answer for you. We think that, given a dedicated person already with experience in speech processing, computational linguistics and/or programming can probably build a new voice in a week, though its probably going to take longer than that. We are still very far from a state where we can capture a voice fully automatically from a poor quality microphone on a standard PC (with the TV in the background) and come up with a voice that sounds anything near acceptable, though it can be done even on a laptop in a quiet environment.

We should state this document does not cover all the background necessary to understand speech synthesis as a whole, as it is specifically about building voices in Festival. This touches on many research issues in the field and appropriate references are given throughout. For a good background in speech synthesis and text-to-speech (TTS) systems in general see dutoit97; for a more detailed description of a particular text-to-speech system (The Bell Labs Text-to-speech system) see sproat98, which takes a much more research point of view.

This document in itself is not really suitable as the only text for a course on speech synthesis, but should be a good adjunct to other materials on synthesis, or in a general speech processing course. It is also useful to anyone who would like to build limited domain synthesizers, or synthesizers from their own voices, and this can be done with a limited amount of knowledge about the field. Actually building a voice is a practical and useful thing to do to understand the relative complexities of the various modules in the text-to-speech process. We do consider this documented suitable for students to use in synthesis projects, and we have tried to provide at least some references and background material wherever it seems appropriate.

Although the voices that are built following the instructions in this document, we hope, will useful in themselves, we do expect that those who build them will learn much more about the text-to-speech process.

3.1 Text-to-speech process

Within Festival we can identify three basic parts of the TTS process

Text analysis:: From raw text to identified words and basic utterances.
Linguistic analysis:: Finding pronunciations of the words and assigning prosodic structure to them: phrasing, intonation and durations.
Waveform generation:: From a fully specified form (pronunciation and prosody) generate a waveform.

These partitions are not absolute, but they are a good way of chunking the problem. Of course, different waveform generation techniques may need different types of information. Pronunciation may not always use standard phones, and intonation need not necessarily mean an F0 contour. For the main part, at along least the path which is likely to generate a working voice, rather than the more research oriented techniques described, the above three sections will be fairly cleanly adhered to.

There is another part to TTS which is normally not mentioned, we will mention it here as it is the most important aspect of Festival that makes building of new voices possible -- the system architecture. Festival provides a basic utterance structure, a language to manipulate it, and methods for construction and deletion; it also interacts with your audio system in an efficient way, spooling audio files while the rest of the synthesis process can continue. With the Edinburgh Speech Tools, it offers basic analysis tools (pitch trackers, classification and regression tree builders, waveform I/O etc) and a simple but powerful scripting language. All of these functions make it so that you may get on with the task of building a voice, rather than worrying about the underlying software too much.

3.1.1 Text analysis

Text analysis is the task of identifying the words in the text. By words, we mean tokens for which there is a well defined method of finding their pronunciation, i.e. from a lexicon, or using letter-to-sound rules. The first task in text analysis is to make chunks out of the input text -- tokenizing it. In Festival, at this stage, we also chunk the text into more reasonably sized utterances. An utterance structure is used to hold the information for what might most simply be described as a sentence. We use the term loosely, as it need not be anything syntactic in the traditional linguistic sense, though it most often has prosodic boundaries or edge effects. Separating a text into utterances is important, as it allows synthesis to work bit by bit, allowing the waveform of the first utterance to be available more quickly than if the whole files was processed as one. Otherwise, one would simply play an entire recorded utterance -- which is not nearly as flexible, and in some domains is even impossible.

Utterance chunking is an externally specifiable part of Festival, as it may vary from language to language. For many languages, tokens are white-space separated and utterances can, to a first approximation, be separated after full stops (periods), question marks, or exclamation points. Further complications, such as abbreviations, other-end punctuation (as the upside-down question mark in Spanish), blank lines and so on, make the definition harder. For languages such as Japanese and Chinese, where white space is not normally used to separate what we would term words, a different strategy must be used, though both these languages still use punctuation that can be used to identify utterance boundaries, and word segmentation can be a second process.

Apart from chunking, text analysis also does text normalization. There are many tokens which appear in text that do not have a direct relationship to their pronunciation. Numbers are perhaps the most obvious example. Consider the following sentence

On May 5 1996, the university bought 1996 computers.

In English, tokens consisting of solely digits have a number of different forms of pronunciation. The `5' above is pronounced `fifth', an ordinal, because it is the day in a month, The first `1996' is pronounced as `nineteen ninety six' because it is a year, and the second `1996' is pronounced as `one thousand nine hundred and ninety size' (British English) as it is a quantity.

Two problems that turn up here: non-trivial relationship of tokens to words, and homographs, where the same token may have alternate pronunciations in different contexts. In Festival, homograph disambiguation is considered as part of text analysis. In addition to numbers, there are many other symbols which have internal structure that require special processing -- such as money, times, addresses, etc. All of these can be dealt with in Festival by what is termed token-to-word rules. These are language specific (and sometimes text mode specific). Detailed examples will be given in the text analysis chapter below.

3.1.2 Linguistic analysis

In this section, we consider both word pronunciation and prosody.

We assume that (largely) words have been properly identified at this stage, and their pronunciation can be found by looking them up in a lexicon, or by applying some form of letter-to-sound rules to the letters in the word. We will present methods for automatically building letter-to-sound rules later in this document. For most languages, a machine-readable lexicon with pronunciation (and possibly lexical stress) will be necessary. A second stage in pronunciation invokes modifications to standard pronunciations when they appear in continuous speech. Some pronunciations change depending on the context they are in; for example, in British English, word final /r/ is only pronounced if the following word is vowel initial. These phenomena are dealt with by post-lexical rules, where modification of the standard lexical form is depends on the context that the word appears in.

By prosody, we mean, put far too simply, phrasing, duration, intonation, and power. For many languages, intonation can be split into two stages, accent placement and F0 contour generation. Prosodic models are both language- and speaker-dependent, and we present methods to help build models. Some of the models we present are very simple, and don't necessarily sound good, but they may be adequate for your task; other methods are more involved and require more effort. Considering that, even for well researched languages like English, good, vivacious prosodic modeling is still an unreached goal, simpler, more limited models are often reasonable, unless you wish to undertake a significant amount of new research.

3.1.3 Waveform generation

We will primarily be presenting concatenative methods for waveform generation, where we collect databases of real speech, and select appropriate units and concatenate them. These selected units are then typically altered by some form of signal processing function to change the pitch and duration. Concatenative synthesis is not the only method of waveform synthesis, another two models are formant synthesis, as in MITalk allen87 and hertz90, and articulatory synthesis. These three methods come from quite different directions though ultimately, we believe, will be joined together in a hybrid model of speech parameterization, trained from real data conjoined in non-trivial ways. There's plenty of research to be done.

The methods presented in this document are less ambitious in their research goals. We cover the tasks involved in building diphone databases, limited domain synthesizers, and more general databases. It is possible to use external processes to Festival to perform waveform synthesis -- the MBROLA system dutoit96 offers diphone databases for many language languages. Festival can be used to provided text and linguistic analysis while MBROLA can be used to generate the waveform, if it already supports the language you wish to synthesize.

Given that database collection does require significant resources, an existing voice can, in a pinch, speak another language. It retains many properties of the original language, but it may offer a quick and easy method to get synthesis in that new language. New languages can also be created with the aid of an existing synthesizer, by using it to get alignments for the new language.

3.2 Requirements

This section identifies the basic requirements for building a voice in a new language, and adding a new voice in a language already supported by Festival.

3.2.1 Hardware/software requirements

Because we are most familiar with a Unix environment the scripts, tools etc. assume such a basic environment. This is not to say you couldn't run these scripts on other platforms as many of these tools are supported on platforms like WIN32, its just that in our normal work environment, Unix is ubiquitous and we like working in it. Festival also runs on Win32 platforms.

Much of the testing was done under Linux; wherever possible, we are using freely available tools. We are happy to say that no non-free tools are required to build voices, and we have included citations and/or links to everything needed in this document.

We assume Festival 1.4.1 and the Edinburgh Speech Tools 1.2.1.

Note that we make an extensive use of the Speech Tools programs, and you will need the full distribution of them as well as Festival, rather than the run-time (binary) only versions which are available for some Linux platforms. If you find the task of compiling Festival and the speech tools daunting, you will probably find the rest of the tasks specified in this document more so. However, it is not necessary for you to have any knowledge of C++ to make voices, though familiarity with text processing techniques (e.g. awk, sed, perl) will make understanding the examples given much easier.

We also assume a basic knowledge of Festival, and of speech processing in general. We expect the reader to be familiar with basic terms such as F0, phoneme, and cepstrum, but not in any real detail. References to general texts are given (when we know them to exist). A basic knowledge of programming in Scheme (and/or Lisp) will also make things easier. A basic capability in programming in general will make defining rules, etc., much easier.

If you are going to record your own database, you will need recording equipment: the higher quality, the better. A proper recording studio is ideal, though may not be available for everyone. A cheap microphone stuck on the back of standard PC is not ideal, though we know most of you will end up doing that. A high quality sound board, close-talking, high-quality microphone and a nearly soundproof recording environment will often be the compromise between these two extremes.

Many of the techniques described in here require a fair amount of processing time to achieve, though machines are indeed getting faster and this is becoming less of an issue. If you use the provided aligner for labelling diphones you will need a processor of reasonable speed, likewise for the various training techniques for intonation, duration modeling and letter-to-sound rules. Nothing presented here takes weeks though a number of processes may be over-night jobs, depending on the speed of your machine.

Also we think that you will need a little patience. The process of building a voice is not necessarily going to work first time. It may even fail completely, so if you don't expect anything special, you wont be disappointed.

3.2.2 Voice in a new language

The following list is a basic check list of the core areas you will need to provide pieces for. You may, in some cases, get away with very simple solutions (e.g. fixed phone durations), or be able to borrow from other voices/languages, but whatever you end up doing, you will need to provide something for each part.

You will need to define

Phone set
Token processing rules (numbers etc)
Prosodic phrasing method
Word pronunciation (lexicon and/or letter-to-sound rules)
Intonation (accents and F0 contour)
Durations
Waveform synthesizer

3.2.3 Voice in an existing language

The most common case is when someone wants to make their own voice into a synthesizer. Note that the issues in voice modeling of a particular speaker are still open research problems. Much of the quality of a particular voice comes mostly from the waveform generation method, but other aspects of a speaker such as intonation and duration, and pronunciation are all part of what makes that person's voice sound like them. All of the general-purpose voices we have heard in Festival sound like the speaker they were record from (at least as far as we know all the speakers), but they also don't have all the qualities of that person's voice, though they can be quite convincing for limited-domain synthesizers.

As a practical recommendation to make a new speaker in an existing supported language, you will need to consider

Waveform synthesis
Speaker specific intonation
Speaker specific duration

section 8.10 US/UK English Walkthrough deals with specifically building a new US or UK English voice. This is a relatively easy place to start, though of course we encourage reading this entire document.

Another possible solution to getting a new or particular voice is to do voice conversion, as is done at the Oregon Graduate Institute (OGI) kain98 and elsewhere. OGI have already released new voices based on this conversion and may release the conversion code itself, though the license terms are not the same as those of Festival or this document.

Another aspect of a new voice in an existing language is a voice in a new dialect. The requirements are similar to those of creating a voice in a new language. The lexicon and intonation probably need to change as well as the waveform generation method (a new diphone database). Although much of the text analysis came probably be borrowed, be aware that simple things like number pronunciation can often change between dialects (cf. US and UK English).

We also do work on limited domain synthesis in the same framework. For limited domain synthesis, a reasonably small corpus is collected, and used to synthesize a much larger range of utterances in the same basic style. We give an example of recording a talking clock, which, although built from only 24 recordings, generates over a thousand unique utterances; these capture a lot of the latent speaker characteristics from the data.

3.3 Future

There is still much to be added to this document, both from the practical aspect of documenting currently known techniques for modeling voices, and also new research to make such modeling both better and more reliable.

Both these aspects are being considered, and we intend to regularly update this document as new techniques become more stable, and we get around to documenting and testing things which should be in this document already.

Support basic prosodic characteristics of a new (English) speaker. Most of the scripts (brazenly) have the pitch range hardwired to a default. It would be fairly easy to record a short section of speech from a speaker and set these automatically. Secondly extracting some basic information about speech rate, and pitch mean and range of a speaker is well within the current programs.
Integrate the NSW (Non-standard word) text analysis system. The Johns Hopkins University summer workshop 99, had a project on analysis of non-standard words (NSW) (http://www.clsp.jhu.edu/ws99/projects/normal/). The results of this project were later released fully within a Festival framework. Proper integration of that analysis system, with the addition of the appropriate taring tools and documentation fits well within the Festvox project and should happen soon.
The introduction of festlang module that allows language support to be distribution/used as a distinct module from a voice or lexicon and so can be shared between languages. This has almost happened but not quite formalised yet.
Better support for autolabeled segments and prosody. There is now a free speech recognitions system (CMU Sphinx http://www.speech.cs.cmu.edu/sphinx/) which should offer better segmental labelling for continuous speech (i.e. limited domain and unit selection databases). Some documentation and testing of this should be included. Likewise better prosody labelling (which is perhaps more research oriented) is required, though duration (as opposed to F0) should be possible now.
A number of new experiments have been made in improving the cluster unit selection system that haven't been properly folded into this release. They don't yet solve all problems but do reduce the problems.
Better documentation on audio devices would be useful. We know many problems will be caused by inadequate diagnosis of audio recording quality.
Better pitchmark extraction from waveforms. This isn't too difficult as a signal processing task and we should spend some resources of it as it is likely to be one of the key problems in generated synthesis quality.

Go to the first, previous, next, last section, table of contents.