from-text-to-speech-the-mit.../pages-txt/023.txt

Introduction

semantic information if available. In this way, synthesis-by-rule techniques can
utilize a very low bit-rate message description (<100 bits/sec) as input, but sub-
stantial computation must be used to compute the model parameters and then
produce the speech waveform. Clearly there is complete freedom to specify the
model parameters, but of course also the need to control these parameters cor-
rectly. Since the rules are still imperfect, the resulting speech quality is not as
good as recorded human speech, but recent tests have shown that high intel-
ligibility and comprehensibility can be obtained, and when sentence and
paragraph-level messages must be synthesized, the rule system provides the neces-
sary degrees of freedom to produce smooth-flowing good quality speech. It is in-
teresting to consider that synthesis-by-rule systems delay the binding of the speech
parameter set and waveform to the input message by using very deep language
abstractions, and hence provide a maximum of flexibility, and are thus well suited
to the needs of converting unrestricted text to speech. The designers of these sys-
tems must, however, discover the relationship between the underlying linguistic
specification of the message and the resulting speech signal, a topic which has
been central to speech science and linguistics for several decades. Thus synthesis-
by-rule both benefits from and contributes to our general knowledge of speech and
linguistics, and the steady improvement in speech synthesis-by-rule quality reflects
this joint progress. While it is believed that current synthetic speech quality is ac-
ceptable for many applications, it can certainly be expected to continue to improve
with our increasing knowledge.

1.2.4 Text-to-speech conversion |

The synthesis-by-rule techniques described above require a detailed phonetic
transcription as input. While this input requires very little memory for message
storage, a frequent requirement is to convert text to speech. When it is desired to
convert unrestricted English text to speech, the flexibility of synthesis-by-rule is
needed, so that means must be afforded to convert the input text to the phonetic
transcription needed by the synthesis-by-rule techniques. Itis clear, then, that first
the text must be analyzed to obtain the phonetic transcription, which is then sub-

jected to a synthesis procedure to yield the output speech waveform. The analysis
of the text is heavily linguistic in nature, involving a determination of the under-
lying phonemic, syllabic, morphemic and syntactic form of the message, plus
whatever semantic and pragmatic information can be gleaned. Text-to-speech con-
version can thus be seen as a collection of techniques requiring the successful in-
tegration of the task constraints with other constraints provided by the nature of the
human vocal apparatus, the linguistic structure of the language, and the implemen-

11