from-text-to-speech-the-mit.../pages-txt/083.txt

7

Survey of speech synthesis technology

7.1 Overview
This brief review of speech synthesis technology is concerned primarily with prac-

tical methods of generating spoken messages by computers or special-purpose
devices. Basic research ditected at modeling articulatory-to-acoustic transfor-
mations (Flanagan et al., 1975; Flanagan and Ishizaka, 1976) will not be reviewed.

7.1.1 Applications
Applications for synthetic speech output fall into four broad categories:

1. Single word responses (e.g. Speak-"N-Spell)

2. A limited set of messages within a rigid syntactic framework (e.g.

telephone number information)

3. Large, fixed vocabulary with general English syntax (e.g. teaching

machine lessons)

4. Unrestricted text to speech (e.g. a reading machine for the blind)

The degree of generality and difficulty increases considerably from 1 to 4.
Prerecorded messages work well for single-word response applications, whereas
an increasing knowledge of the acoustic-phonetic characteristics of speech,
phonology, and syntax is required for satisfactory synthesis of general English.

7.1.2 Three methods of employing MITalk modules

The entire MITalk text-to-speech system can be used in applications falling in cat-
egory 4 above, or pieces of the MITalk synthesis routines might be used in other
applications. For example, if an abstract phonemic and syntactic representation for
an utterance can be stored in the computer or derived by linguistic rules, only
modules beginning with PHONO?2 in Figure 7-1 are needed. Speech represented
in this way requires storage of only about 100 bits per second.

Another way to use the synthesis routines to produce even more natural
sounding speech (at a cost in bits and human intervention) is to begin by specify-
ing the input to the phonetic component PHONET in Figure 7-1. If durations and
fundamental frequency values are taken from a natural recording rather than being

computed by rule, a remarkably human voice quality is achieved. Storage of about
250 bits per second of speech is required, and of course, considerable effort is re-

quired to prepare the input representation.
71