from-text-to-speech-the-mit.../pages-txt/122.txt

From text to speech: The MITalk system

came from a more extensive analysis of the speech of one of the male subjects
(since data averaged across several male and female talkers would probably not
make for a very good synthetic talker). Subjects read a list of 336 different CVC
nonsense syllables once, except for the designated talker (DHK) who read the list
twice on three separate occasions.

The kind of analysis that was performed on the data base is illustrated in
Figure 11-1. The speech was low-pass filtered at 4.9 kHz and digitized at 10k
samples per second. Linear prediction spectra were computed at a number of
(hand-selected) locations in a syllable. The waveform segment, such as the one
shown at the top in Figure 11-1, was first differenced (to attenuate very low fre-
quency background noise) and multiplied by a Kaiser window (Beta=7.0) prior to
11-pole linear prediction analysis. The linear prediction spectrum is shown at the
bottom of the figure along with the discrete Fourier transform. The 25.6 msec
time-weighting window has an effective averaging duration of about 10 msec. The
same window was used at all analysis points, except during the sustained frication
noise of fricatives, where the window duration was increased so as to better es-
timate the spectral characteristics of the noise.

Spectral samples were obtained: 1) during the consonantal steady state (or at
burst onset for a plosive), 2) at voicing onset (or early in the consonant-vowel tran-
sition for voiced consonants), and 3) shortly after the end of the consonant-vowel
transition. Formant frequencies were also estimated by locating the peaks in a
linear prediction spectrum. Formant motions were plotted every 10 msec during
voiced portions of syllables. Intensity and fundamental frequency were also es-
timated and plotted as a function of time.

In this chapter, it is only possible to present some of the highlights of the
analyses. For example, Figure 11-2 presents first and second formant frequency
trajectories of sixteen vowel nuclei, as averaged across all consonantal environ-
ments for the designated talker. Most of the vowels appear to be diphthongized to
some extent. (The true diphthongs are shown with dashed vectors.) In particular,
it is a characteristic of this common midwestern dialect to terminate the short
vowels IH, EH, AE, and UH in a schwa-like offglide. These average data for
vowels are used as a starting point for consonant-vowel synthesis.

Analysis of consonants revealed two major conclusions concerning the form
of rules appropriate for synthesis of a consonant before any vowel:

1. Some consonants, particularly obstruents, take on significantly dif-
ferent characteristics depending on whether the following vowel is a
front vowel, a back unrounded vowel, or a back rounded vowel.

110