from-text-to-speech-the-mit.../pages-txt/024.txt

From text to speech: The MITalk system

tation technology. It is thus the most complex form of speech synthesis system,
but also the most fundamental in design and useful in application, since it seeks to
mirror the human cognitive capability for reading aloud. Other cognitive models
attempt to synthesize speech directly from “concept” for those applications where
the underlying linguistic structure is already available (Young and Fallside, 1979).
These schemes have the advantage of (presumably) more detailed syntactic and
semantic structures than can be obtained from text, and are hence of great interest
for high-quality synthesis, but the pervading presence of text in our culture makes
the text-to-speech capability of great practical importance. It is worth emphasizing
that both text and speech are surface manifestations of underlying linguistic form,
and hence that text-to-speech conversion consists first of discovering that under-
lying form, and then utilizing it to form the output speech.

In the chapters that follow, we will discuss the MITalk text-to-speech system
in detail. The aim of this system is to provide high-quality speech from un-
restricted English text using the fundamental results of speech science, computing,
and linguistics. We aim to do it “right”, in the belief that adherence to basic prin-
ciples will provide more insightful methods, avoid ad hoc “fixes”, and produce the
best possible quality of speech. We will also discuss the range of possible applica-
tions, and the implementation base for both a research system, and a compact, low-
cost module utilizing state-of-the-art integrated circuit technology. First, however,
a brief outline of the parts of the system will be presented.

1.3 Functional outline of MITalk

At the highest level, the system consists of an analysis phase, followed by a syn-
thesis phase. Each of these processes is in turn broken down into a cascaded set of
modules. In turn, each module has been described functionally as a set of al-
gorithms operating on well-defined input and output data structures, and each
module is afforded a chapter in the sequel for its exposition. In this introduction,
we summarize briefly the functional content of the modules.

1.3.1 Analysis of text

1.3.1.1 Symbols to standard form A preprocessor is used to convert symbol
strings such as “$3.17”, “Mr.”, “M.LT.”, and “1979” to text suitable for linguistic

analysis by the remainder of the system.

1.3.1.2 Phonetic transcription For each word, a phonetic transcription is com-
puted. A dictionary of 12,000 morphs (prefixes, roots, and suffixes) is used, which
contains the spelling, pronunciation, and part-of-speech information for each
morph. Most words are analyzed into a string of morphs. In this way, more than

12