Labeling Speech

In the early days of concatenative speech synthesis every recorded prompt had to be hand labeled. Although a significant task, very skilled and mind bogglingly tedious it was a feasible task to attempt when databases were relatively and the time to build a voice was measure in years. With the increase in size of database and the demand for much faster turnaround we have moved away from hand labeling to automatic labeling.

In this section we will only touch the the aspects of what we need labeled in recorded data but discuss what techniques are available for how to label it. As discussed before phonemes are a useful but incomplete inventory of units that should be identified but other aspects of lexical stress, prosody, allophonic variations etc are certainly worthy of consideration.

In labeling recorded prompts for synthesis we rely heavily on the work that has been done in the speech recognition community. For synthesis we do, however, have different goals. In ASR (automatic speech recognition) we are trying to find the most likely set of phones that are in a given acoustic observation. In synthesis labeling, however we know the sequence of phones spoken, assuming the voice talent spoke the prompt properly, and wish to find out where those phones are in the signal. We care, very deeply, about the boundaries of segments, while ASR can be achieve adequately performance by only concerning itself with the centers, and hence has rightly been optimized for that. AWB: that point deserves more discussion, though maybe not here.

There are other distinctions from the ASR task, in synthesis labeled we are concerned with a singled speaker, that is, if the synthesizer is going to work well, very carefully performed and consistently recorded. This does make things easier for the labeling task. However in synthesize labeling we are also concerned about prosody, and spectral variation, much more than in ASR.

We discuss two specific techniques for labeling record prompts here, which each have their advantages and limitations. Procedures running these are discussed at the end of each section.

The first technique uses dynamic time warping alignment techniques to find the phone boundaries in a recorded prompt by align it against a synthesized utterance where the phone boundary are know. This is computationally easier than second technique and works well for small databases which do not have full phonetic coverage.

The second technique uses Baum-Welch training to build complete ASR acoustic models from the the database. This takes sometime, but if the database is phonetically balanced, as should be the case in databases designed for speech synthesis voices, can work well. Also this technique can work well on databases in languages that do not yet have a synthesizer, hence making the dynamic time warping technique hard without cross-language phone mapping techniques.

Labeling with Dynamic Time Warping

DTW (dynamic time warping) is a technique for aligning some new recording with some known one. This technique was used in early speech recognition systems which had limit vocabularies as it requires a acoustic signal for each word/phrase to be recognized. This technique is sometime still used in matching two audio signal in command and control situations, for example in some cell-phone for voice dialing.

What is important in DTW alignment is that it can deal with signals that have varying durations. The idea has been around for many years, though its application to labeling in synthesis is relative new. The work here is based on the detail published in [malfrere].

Comparing raw acoustic score is unlikely to given god results so comparisons are done in then spectral domain. Following ASR techniques we will use Mel Frequency Cepstral Coefficients to represent the signal, and also following ASR we will include delta MFCCs (the different between the current MFCC vector and the previous MFCC vector). However for the DTW algorithm the content of the vectors is somewhat irrelevant, and are merely treated as vectors.

The next stage is define a distance function between two vectors, conventionally we use Euclidean Distance defined as

  root (sumof(i-n) (v0i - v1i)^2

Weights could be considered too.

The search itself is best picture as a large matrix. The algorithm then searches for the best path through the matrix. At each node it finds the distance between the two current vectors and sums it with the smallest of three potential previous states. That is one of i-1,j, i,j-1, or i-1,j-1. If two signals were identical the best path would be the diagonal through the matrix, if one part of the signal is shorter or longer than the corresponding one horizontal or vertical parts will have less cost.

  matrix diagram (more than one)

AWB: describe the make_labs stuff and cross-language phone mapping