You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
51 lines
2.1 KiB
51 lines
2.1 KiB
7
|
|
|
|
Survey of speech synthesis technology
|
|
|
|
7.1 Overview
|
|
This brief review of speech synthesis technology is concerned primarily with prac-
|
|
|
|
tical methods of generating spoken messages by computers or special-purpose
|
|
devices. Basic research ditected at modeling articulatory-to-acoustic transfor-
|
|
mations (Flanagan et al., 1975; Flanagan and Ishizaka, 1976) will not be reviewed.
|
|
|
|
7.1.1 Applications
|
|
Applications for synthetic speech output fall into four broad categories:
|
|
|
|
1. Single word responses (e.g. Speak-"N-Spell)
|
|
|
|
2. A limited set of messages within a rigid syntactic framework (e.g.
|
|
|
|
telephone number information)
|
|
|
|
3. Large, fixed vocabulary with general English syntax (e.g. teaching
|
|
|
|
machine lessons)
|
|
|
|
4. Unrestricted text to speech (e.g. a reading machine for the blind)
|
|
|
|
The degree of generality and difficulty increases considerably from 1 to 4.
|
|
Prerecorded messages work well for single-word response applications, whereas
|
|
an increasing knowledge of the acoustic-phonetic characteristics of speech,
|
|
phonology, and syntax is required for satisfactory synthesis of general English.
|
|
|
|
7.1.2 Three methods of employing MITalk modules
|
|
|
|
The entire MITalk text-to-speech system can be used in applications falling in cat-
|
|
egory 4 above, or pieces of the MITalk synthesis routines might be used in other
|
|
applications. For example, if an abstract phonemic and syntactic representation for
|
|
an utterance can be stored in the computer or derived by linguistic rules, only
|
|
modules beginning with PHONO?2 in Figure 7-1 are needed. Speech represented
|
|
in this way requires storage of only about 100 bits per second.
|
|
|
|
Another way to use the synthesis routines to produce even more natural
|
|
sounding speech (at a cost in bits and human intervention) is to begin by specify-
|
|
ing the input to the phonetic component PHONET in Figure 7-1. If durations and
|
|
fundamental frequency values are taken from a natural recording rather than being
|
|
|
|
computed by rule, a remarkably human voice quality is achieved. Storage of about
|
|
250 bits per second of speech is required, and of course, considerable effort is re-
|
|
|
|
quired to prepare the input representation.
|
|
71
|