from-text-to-speech-the-mit.../pages-txt/091.txt

Survey of speech synthesis technology

The input to the rules includes phonemes, stress, word and morpheme boundaries,
and syntactic structure.

In time, these methods ought to be able to produce highly intelligible natural
speech, but present results are frequently perceived to be somewhat unnatural and
machine-like. This appears to be due mainly to the intricate complexity of the
speech code and the fact that not all of the rules are known at this time. There is a
particular need to improve on the specification of fundamental frequency and dura-
tion algorithms, perhaps by making incremental improvements to current al-
gorithms (Umeda, 1976; Klatt, 1979a; Maeda, 1974; O’Shaughnessy, 1977; Pierre-
humbert, 1979).

7.4 Applications

7.4.1 Synthesis of arbitrary English sentences

From the above discussion, it should be clear that there are a number of promising
methods for synthesizing general English. To generate a particular utterance, one
must know 1) the phonemic (or phonetic) representation for each word, 2) the
stress pattern for each word, 3) aspects of the syntactic structure of the sentence,
and 4) the locations of any words that are to receive semantic focus. This infor-
mation would have to be stored in the computer for each utterance to be syn-
thesized, or it might be generated from a deep-structure representation of the con-
cept to be expressed (Woods et al., 1976; Young and Fallside, 1979).

7.4.2 Synthesis of arbitrary English names

Research at Bell Laboratories (Denes, 1979; Liberman, 1979; Olive, 1979) is
directed at the ability to synthesize any name from a telephone directory for ap-
plication in automated directory assistance. The linguistic problems associated
with converting spelling to a phonetic representation and stress pattern are severe
since it is sometimes necessary to guess the native language of the individual be-
fore a good rendering of the pronunciation is possible (Liberman, 1979). Once a
phonetic representation has been derived, this experimental system uses diphone
synthesis (Olive, 1979) to generate a waveform. |

7.4.3 Text-to-speech conversion

The transformation of English text to speech is a much more formidable problem
than the synthesis of an arbitrary sentence from a knowledge of its underlying lin-
guistic representation. The text does not indicate everything that one would like to
know (unless one builds a machine that can recognize the meaning of the text, and
thereby disambiguate (frequently occurring) syntactic ambiguities, and determine
semantic focus relations).

79