Building Synthetic Voices

Alan W Black and Kevin A. Lenzo

Language Technologies Institute, Carnegie Mellon University
Wednesday, 24th December, 2014

Full distribution: festvox-2.7.0-release.tar.gz document alone: bsv.pdf
Updates and news on FestVox will be posted on

NOTE: this document is incomplete

Copyright © 1999-2014 Alan W Black & Kevin A. Lenzo

Table of Contents
I. Speech Synthesis
Overview of Speech Synthesis
Uses of Speech Synthesis
General Anatomy of a Synthesizer
Speech Science
A Practical Speech Synthesis System
Basic Use
Utterance structure
Utterance access
Utterance building
Extracting features from utterances
II. Building Synthetic Voices
Basic Requirements
Hardware/software requirements
Voice in a new language
Voice in an existing language
Selecting a speaker
Who owns a voice
Recording under Unix
Extracting pitchmarks from waveforms
Limited domain synthesis
designing the prompts
customizing the synthesizer front end
autolabeling issues
unit size and type
using limited domain synthesizers
Telling the time
Making it better
Text analysis
Non-standard words analysis
Token to word rules
Number pronunciation
Homograph disambiguation
TTS modes
Mark-up modes
Word pronunciations
Lexicons and addenda
Out of vocabulary words
Building letter-to-sound rules by hand
Building letter-to-sound rules automatically
Post-lexical rules
Building lexicons for new languages
Building prosodic models
Accent/Boundary Assignment
F0 Generation
Prosody Research
Prosody Walkthrough
Corpus development
Non-Latin-script languages
Waveform Synthesis
Diphone databases
Diphone introduction
Defining a diphone list
Recording the diphones
Labeling the diphones
Extracting the pitchmarks
Building LPC parameters
Defining a diphone voice
Checking and correcting diphones
Diphone check list
Unit selection databases
Cluster unit selection
Building a Unit Selection Cluster Voice
Diphones from general databases
Statistical Parametric Synthesis
Building a CLUSTERGEN Statistical Parametric Synthesizer
Making it better:Mixed excitation and Random Forests
Labeling Speech
Labeling with Dynamic Time Warping
Labeling with Full Acoustic Models
Prosodic Labeling
Evaluation and Improvements
Does it work at all?
Formal Evaluation Tests
Debugging voices
III. Interfacing and Integration
IV. Recipes
Grapheme-based Synthesizer
General Grapheme-based Voices
Building Indic voices
Creating support for new Indic languages
A Japanese Diphone Voice
US/UK English Diphone Synthesizer
ldom full example
Non-english ldom example
V. Concluding Remarks
Concluding remarks and future
Festival Details
Festival's Scheme Programming Language
Data Types
Core functions
List functions
Arithmetic functions
I/O functions
String functions
System functions
Utterance Functions
Synthesis Functions
Debugging and Help
Adding new C++ functions to Scheme
Regular Expressions
Some Examples
Edinburgh Speech Tools
Machine Learning
Festival resources
General speech resources
Tools Installation
English phone lists
US phoneset
UK phoneset