You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
48 lines
2.8 KiB
48 lines
2.8 KiB
From text to speech: The MITalk system
|
|
|
|
tation technology. It is thus the most complex form of speech synthesis system,
|
|
but also the most fundamental in design and useful in application, since it seeks to
|
|
mirror the human cognitive capability for reading aloud. Other cognitive models
|
|
attempt to synthesize speech directly from “concept” for those applications where
|
|
the underlying linguistic structure is already available (Young and Fallside, 1979).
|
|
These schemes have the advantage of (presumably) more detailed syntactic and
|
|
semantic structures than can be obtained from text, and are hence of great interest
|
|
for high-quality synthesis, but the pervading presence of text in our culture makes
|
|
the text-to-speech capability of great practical importance. It is worth emphasizing
|
|
that both text and speech are surface manifestations of underlying linguistic form,
|
|
and hence that text-to-speech conversion consists first of discovering that under-
|
|
lying form, and then utilizing it to form the output speech.
|
|
|
|
In the chapters that follow, we will discuss the MITalk text-to-speech system
|
|
in detail. The aim of this system is to provide high-quality speech from un-
|
|
restricted English text using the fundamental results of speech science, computing,
|
|
and linguistics. We aim to do it “right”, in the belief that adherence to basic prin-
|
|
ciples will provide more insightful methods, avoid ad hoc “fixes”, and produce the
|
|
best possible quality of speech. We will also discuss the range of possible applica-
|
|
tions, and the implementation base for both a research system, and a compact, low-
|
|
cost module utilizing state-of-the-art integrated circuit technology. First, however,
|
|
a brief outline of the parts of the system will be presented.
|
|
|
|
1.3 Functional outline of MITalk
|
|
|
|
At the highest level, the system consists of an analysis phase, followed by a syn-
|
|
thesis phase. Each of these processes is in turn broken down into a cascaded set of
|
|
modules. In turn, each module has been described functionally as a set of al-
|
|
gorithms operating on well-defined input and output data structures, and each
|
|
module is afforded a chapter in the sequel for its exposition. In this introduction,
|
|
we summarize briefly the functional content of the modules.
|
|
|
|
1.3.1 Analysis of text
|
|
|
|
1.3.1.1 Symbols to standard form A preprocessor is used to convert symbol
|
|
strings such as “$3.17”, “Mr.”, “M.LT.”, and “1979” to text suitable for linguistic
|
|
|
|
analysis by the remainder of the system.
|
|
|
|
1.3.1.2 Phonetic transcription For each word, a phonetic transcription is com-
|
|
puted. A dictionary of 12,000 morphs (prefixes, roots, and suffixes) is used, which
|
|
contains the spelling, pronunciation, and part-of-speech information for each
|
|
morph. Most words are analyzed into a string of morphs. In this way, more than
|
|
|
|
12
|