I. Speech Synthesis

This book is about spoken language output -- how to build a synthetic voice that can express a range of concepts in a given domain, whether a system has a small range, and so an easily describable domain, or an unlimited set of possible inputs.

The Festival Speech Synthesis System, and more specifically, the FestVox voice building tools, provide the framework for the discussion. All these tools, Festival, and all other applications and databases given in examples here are available as open source and/or free software.

This version is still a very early draft, it rondom changes from high level discussion to low level code aspects without the appropriatae introduction of the material. As it stands this version is most useful to people with some speech processing experience who want to build voices in the Festival Speech Synthesis System, and have some patience to work through these notes. These sections will become more general introductions in later versions.

Over the last several years, speech technology, as well as the computational resources needed to carry out the necessary (or popular) algrithms, have advanced considerably. We can build systems that interact through speech; they can listen to what was said using speech recognition, compute or do something, and then speak back, using spoken language generation.

Language is a fundamental part of everyday life. Whether we are using speech, sign language, or perhaps a coding system that conveys meaning through touch, we use language to express our thoughts, intentions, reactions, and experiences -- often scarcely considering that we are even speaking, but rather feeling as if we were engaging in direct manipulation of the notions with our conversation partner.

The complexity of the underlying systems involved in speech rarely enters our minds as we go about our business, and it seems as if this fosters an unspoken attitude: "It's easy to talk, it should be easy to make a machine that talks." By the time you finish this book, you should be aware of the components of the process, the relative effort and expertise involved in each component, and where the trade-offs lie in speech synthesis today. Our hope is that you will take away, even with a light reading, an appreciation of the fundamental aspects of speech communication, and how one might, with this modality, make machines more useful. speech.

The book is organized into six major parts: Overview and Use of Speech Output, Building Synthetic Voices, Interfacing and Integrating, detailed Recipes for building voices, and our concluding remarks critiquing the state of the art, and discussing a number of interesting issues that remain open in speech science and synthesis technology.

Part I is a speech science and technology overview, primarily as it relates to speech-interactive systems, and how to use speech synthesis. We will be using the Festival Speech Synthesis System throughout the book, and it is presented here. If you are familiar with speech synthesis, you might want to skip to the discussion of Festival in the Chapter called A Practical Speech Synthesis System; if you're familiar with Festival, you may wish to go directly to Part II.

In the Chapter called Overview of Speech Synthesis, we start with an introduction to speech in general, the role of spoken language generation, and in particular, of the basic issues in speech synthesis: text analysis, prosody, and audio waveform generation; then in the Chapter called Speech Science, we discuss speech science and technology in greater detail, introducing the basic tools, concepts, and terms involved. the Chapter called A Practical Speech Synthesis System is on using the Festival Speech Synthesis System -- as a system, its basic use, and the modules and architecture. The utterance structure we use is also discussed here.

Part II is on Building Synthetic Voices; here, start looking at the process of building a voice, and illustrate it with demonstaions and walk-throughs using the FestVox voice building tools. the Chapter called Limited domain synthesis covers limited domain synthesis, in which high quality can be achieved at the expense of flexibility, if you know what the system is going to say in advance. In the Chapter called Text analysis, we then go into general synthesis, which includes Text-to-Speech (TTS); the system should be able to speak, no matter what it is given, with some expected level of quality -- even if given only a string of text. It is in this chapter that we expose most of the components of a general purpose synthesizer, such as the text processing module, prosodic models that predict tune and timing, lexicons and pronunciation modeling, and issues with data collection and recording in order to create voices.

the Chapter called Waveform Synthesis is on various techniques for audio waveform generation, once the system knows the final parameters of the utterance. Evaluation and tuning of voices is discussed in the Chapter called Evaluation and Improvements.

In Part III, we get to Interfacing and Integration issues. the Chapter called Markup covers mark-up languages, or mark-up components of standards, such as SABLE, JSML, and VoiceXML, and how these can be used and implemented for synthesis. Then, in the Chapter called Concept-to-speech we go into some detail on Concept-to-Speech systems, and so touch on language generation and the benefits of coupling the language generation and synthesis components in terms of spoken output quality. the Chapter called Deployment is a discussion of deployment issues, such as client/server issues, the memory and disk footprint, runtime constraints, scaling up and pruning systems down to meet the available resources, and signal compression.

Although step by step examples will be discussed in line throughout the various chapters, Part IV offers complete detailed walkthoughs of the actual steps involved in a number of larger examples. the Chapter called A Japanese Diphone Voice gives a complete example of building a Japanese diphone synthesizer, and the Chapter called US/UK English Diphone Synthesizer covers UK and US English diphone synthesizer while the Chapter called ldom full example covers a the deisgn and executation of building a limited domain synthesizer for ...

We follow with Part V, with a discussion of the state of the art and the future of spoken language output, and then a set of appendices.