You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
44 lines
2.9 KiB
44 lines
2.9 KiB
From text to speech: The MITalk system
|
|
|
|
waveform, but the storage requirements per message are cut down. More impor-
|
|
tantly, the parametric representation represents an abstraction on the speech
|
|
waveform to a level of representation where the attributes that contribute to speech
|
|
quality (e.g. formant frequencies and bandwidths, pitch, excitation amplitudes) can
|
|
be insightfully manipulated. This allows elementary messages to be concatenated
|
|
in a way that provides for smooth transitions at the boundaries. It also allows for
|
|
changes (e.g. in pitch) well within the individual message units, so that substantial
|
|
changes in prosodic parameters (pitch and timing) can be made. The most popular
|
|
parametric representations in use today are based on formants or linear predictive
|
|
coding (LPC), although vocal tract articulatory models are also used. Message
|
|
units of widely varying sizes are employed, ranging from paragraphs, through sen-
|
|
tences, phrases, words, syllables, demisyllables, and diphones. As the size of the
|
|
message unit goes down, fewer basic messages are needed for a large message set,
|
|
but more computation is required, and the difficulties of correctly representing the
|
|
coarticulation across message boundaries go up. Clearly, these schemes aim to
|
|
preserve as much of the quality of natural speech as possible, but to permit the
|
|
flexible construction of a large set of messages using elements which requiré little
|
|
storage. With the current level of knowledge of digital signal processing tech-
|
|
niques, and the accompanying technology, these schemes have become very im-
|
|
portant for practical applications. It is well to remember, however, that parametric
|
|
representation systems seek to match the task with the available processing and
|
|
memory technology by using a knowledge of models for the human production of
|
|
speech, but little (if any) use is made of the linguistic structure of the language.
|
|
|
|
1.2.3 Synthesis-by-rule
|
|
|
|
When message units are concatenated using parametric representations, there is a
|
|
tradeoff between speech quality and the need to vary the parameters to adapt the
|
|
message to varying environments. Researchers have found that many allophonic
|
|
variations of a message unit (e.g. diphone) may be needed to achieve good quality
|
|
speech, and that while the vocabulary of needed units is thus expanding, little basic
|
|
understanding of the role of structural language constraints in determining aspects
|
|
of the speech waveform is obtained. For this reason, the synthesis process has
|
|
been abstracted even further beyond the level of parametric representation to a set
|
|
of rules which seek to compute the needed parameters for the speech production
|
|
model from an input phonetic description. This input representation contains, in
|
|
itself, very little information. Usually the names of the phonetic segments, along
|
|
with stress marks and pitch and timing, are provided. The latter prosodic correlates
|
|
are often computed from segmental and syntactic structure and stress marks, plus
|
|
|
|
10
|