from-text-to-speech-the-mit.../pages-txt/022.txt

From text to speech: The MITalk system

waveform, but the storage requirements per message are cut down. More impor-
tantly, the parametric representation represents an abstraction on the speech
waveform to a level of representation where the attributes that contribute to speech
quality (e.g. formant frequencies and bandwidths, pitch, excitation amplitudes) can
be insightfully manipulated. This allows elementary messages to be concatenated
in a way that provides for smooth transitions at the boundaries. It also allows for
changes (e.g. in pitch) well within the individual message units, so that substantial
changes in prosodic parameters (pitch and timing) can be made. The most popular
parametric representations in use today are based on formants or linear predictive
coding (LPC), although vocal tract articulatory models are also used. Message
units of widely varying sizes are employed, ranging from paragraphs, through sen-
tences, phrases, words, syllables, demisyllables, and diphones. As the size of the
message unit goes down, fewer basic messages are needed for a large message set,
but more computation is required, and the difficulties of correctly representing the
coarticulation across message boundaries go up. Clearly, these schemes aim to
preserve as much of the quality of natural speech as possible, but to permit the
flexible construction of a large set of messages using elements which requiré little
storage. With the current level of knowledge of digital signal processing tech-
niques, and the accompanying technology, these schemes have become very im-
portant for practical applications. It is well to remember, however, that parametric
representation systems seek to match the task with the available processing and
memory technology by using a knowledge of models for the human production of
speech, but little (if any) use is made of the linguistic structure of the language.

1.2.3 Synthesis-by-rule

When message units are concatenated using parametric representations, there is a
tradeoff between speech quality and the need to vary the parameters to adapt the
message to varying environments. Researchers have found that many allophonic
variations of a message unit (e.g. diphone) may be needed to achieve good quality
speech, and that while the vocabulary of needed units is thus expanding, little basic
understanding of the role of structural language constraints in determining aspects
of the speech waveform is obtained. For this reason, the synthesis process has
been abstracted even further beyond the level of parametric representation to a set
of rules which seek to compute the needed parameters for the speech production
model from an input phonetic description. This input representation contains, in
itself, very little information. Usually the names of the phonetic segments, along
with stress marks and pitch and timing, are provided. The latter prosodic correlates
are often computed from segmental and syntactic structure and stress marks, plus

10