You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
47 lines
2.8 KiB
47 lines
2.8 KiB
From text to speech: The MITalk system
|
|
|
|
came from a more extensive analysis of the speech of one of the male subjects
|
|
(since data averaged across several male and female talkers would probably not
|
|
make for a very good synthetic talker). Subjects read a list of 336 different CVC
|
|
nonsense syllables once, except for the designated talker (DHK) who read the list
|
|
twice on three separate occasions.
|
|
|
|
The kind of analysis that was performed on the data base is illustrated in
|
|
Figure 11-1. The speech was low-pass filtered at 4.9 kHz and digitized at 10k
|
|
samples per second. Linear prediction spectra were computed at a number of
|
|
(hand-selected) locations in a syllable. The waveform segment, such as the one
|
|
shown at the top in Figure 11-1, was first differenced (to attenuate very low fre-
|
|
quency background noise) and multiplied by a Kaiser window (Beta=7.0) prior to
|
|
11-pole linear prediction analysis. The linear prediction spectrum is shown at the
|
|
bottom of the figure along with the discrete Fourier transform. The 25.6 msec
|
|
time-weighting window has an effective averaging duration of about 10 msec. The
|
|
same window was used at all analysis points, except during the sustained frication
|
|
noise of fricatives, where the window duration was increased so as to better es-
|
|
timate the spectral characteristics of the noise.
|
|
|
|
Spectral samples were obtained: 1) during the consonantal steady state (or at
|
|
burst onset for a plosive), 2) at voicing onset (or early in the consonant-vowel tran-
|
|
sition for voiced consonants), and 3) shortly after the end of the consonant-vowel
|
|
transition. Formant frequencies were also estimated by locating the peaks in a
|
|
linear prediction spectrum. Formant motions were plotted every 10 msec during
|
|
voiced portions of syllables. Intensity and fundamental frequency were also es-
|
|
timated and plotted as a function of time.
|
|
|
|
In this chapter, it is only possible to present some of the highlights of the
|
|
analyses. For example, Figure 11-2 presents first and second formant frequency
|
|
trajectories of sixteen vowel nuclei, as averaged across all consonantal environ-
|
|
ments for the designated talker. Most of the vowels appear to be diphthongized to
|
|
some extent. (The true diphthongs are shown with dashed vectors.) In particular,
|
|
it is a characteristic of this common midwestern dialect to terminate the short
|
|
vowels IH, EH, AE, and UH in a schwa-like offglide. These average data for
|
|
vowels are used as a starting point for consonant-vowel synthesis.
|
|
|
|
Analysis of consonants revealed two major conclusions concerning the form
|
|
of rules appropriate for synthesis of a consonant before any vowel:
|
|
|
|
1. Some consonants, particularly obstruents, take on significantly dif-
|
|
ferent characteristics depending on whether the following vowel is a
|
|
front vowel, a back unrounded vowel, or a back rounded vowel.
|
|
|
|
110
|