from-text-to-speech-the-mit.../pages-txt/087.txt

Survey of speech synthesis technology

serted between words, and a reasonable sentence intonation contour was realized
by restricting a given prerecorded element to only certain utterance positions. A
great deal of care was taken in speaking, recording, and editing the basic
vocabulary items.

Word storage has involved various analog and digital techniques that range
from recording each word into a half-second slot on a rotating drum, to sophis-
ticated digital techniques for reducing the number of bits that must be stored.
Digital methods for representing speech waveforms are reviewed by Rabiner and
Schafer (1976) and by Jayant (1974). One remarkable technique developed at
Texas Instruments (Wiggins, 1979) involves storing a 1000 bit-per-second
linear-prediction representation for each word on integrated circuit chips having a
capacity of 200 seconds of speech, and using an IC linear-prediction synthesizer to
play selected words (all of this circuitry being offered at $50 in the Speak-’N-Spell
children’s toy).

7.3.1.2 Formant vocoding of words Rabiner et al. (1971a) suggested that one
could get rid of the choppiness of waveform concatenation by extracting formant
trajectories for each prerecorded word and smoothing formant parameter tracks
across word boundaries before formant vocoder resynthesis. A second advantage
of formant analysis-synthesis of the words that make up a synthetic utterance is
that the duration pattern and fundamental frequency contour can be adjusted to
match the accent pattern, thythm, and intonation requirements of the sentence to be
produced. The technique has been used successfully in telephone number syn-
thesis where a known prosodic contour could be superimposed (for example, a
pause and a “continuation rise” intonation can be placed just before the fourth digit
of a seven digit telephone number). However, the authors did not offer general
prosodic rules for sentence synthesis.

7.3.1.3 Linear-prediction coded words Olive (1974) later showed that a similar
system could be based on linear prediction encoding. Furthermore, it was deter-
mined that a correct fundamental frequency contour for a sentence was percep-
tually more important than the exact duplication of the durational pattern or careful
smoothing of the formant transitions between words.

The advantage of the prerecorded word as a unit is ease of bringing up a
limited audio response unit. The disadvantages are that: 1) large vocabularies are
impractical, and 2) general timing and fundamental frequency rules that adjust the
prosodic characteristics of a word as a function of sentence structure are more
easily defined at a segmental level. For example, only the final vowel and
postvocalic consonants of a word are lengthened at phrase and clause boundaries
(Klatt, 1976b).

75