|
|
Survey of speech synthesis technology
|
|
|
|
|
|
serted between words, and a reasonable sentence intonation contour was realized
|
|
|
by restricting a given prerecorded element to only certain utterance positions. A
|
|
|
great deal of care was taken in speaking, recording, and editing the basic
|
|
|
vocabulary items.
|
|
|
|
|
|
Word storage has involved various analog and digital techniques that range
|
|
|
from recording each word into a half-second slot on a rotating drum, to sophis-
|
|
|
ticated digital techniques for reducing the number of bits that must be stored.
|
|
|
Digital methods for representing speech waveforms are reviewed by Rabiner and
|
|
|
Schafer (1976) and by Jayant (1974). One remarkable technique developed at
|
|
|
Texas Instruments (Wiggins, 1979) involves storing a 1000 bit-per-second
|
|
|
linear-prediction representation for each word on integrated circuit chips having a
|
|
|
capacity of 200 seconds of speech, and using an IC linear-prediction synthesizer to
|
|
|
play selected words (all of this circuitry being offered at $50 in the Speak-’N-Spell
|
|
|
children’s toy).
|
|
|
|
|
|
7.3.1.2 Formant vocoding of words Rabiner et al. (1971a) suggested that one
|
|
|
could get rid of the choppiness of waveform concatenation by extracting formant
|
|
|
trajectories for each prerecorded word and smoothing formant parameter tracks
|
|
|
across word boundaries before formant vocoder resynthesis. A second advantage
|
|
|
of formant analysis-synthesis of the words that make up a synthetic utterance is
|
|
|
that the duration pattern and fundamental frequency contour can be adjusted to
|
|
|
match the accent pattern, thythm, and intonation requirements of the sentence to be
|
|
|
produced. The technique has been used successfully in telephone number syn-
|
|
|
thesis where a known prosodic contour could be superimposed (for example, a
|
|
|
pause and a “continuation rise” intonation can be placed just before the fourth digit
|
|
|
of a seven digit telephone number). However, the authors did not offer general
|
|
|
prosodic rules for sentence synthesis.
|
|
|
|
|
|
7.3.1.3 Linear-prediction coded words Olive (1974) later showed that a similar
|
|
|
system could be based on linear prediction encoding. Furthermore, it was deter-
|
|
|
mined that a correct fundamental frequency contour for a sentence was percep-
|
|
|
tually more important than the exact duplication of the durational pattern or careful
|
|
|
smoothing of the formant transitions between words.
|
|
|
|
|
|
The advantage of the prerecorded word as a unit is ease of bringing up a
|
|
|
limited audio response unit. The disadvantages are that: 1) large vocabularies are
|
|
|
impractical, and 2) general timing and fundamental frequency rules that adjust the
|
|
|
prosodic characteristics of a word as a function of sentence structure are more
|
|
|
easily defined at a segmental level. For example, only the final vowel and
|
|
|
postvocalic consonants of a word are lengthened at phrase and clause boundaries
|
|
|
(Klatt, 1976b).
|
|
|
|
|
|
75
|