from-text-to-speech-the-mit.../pages-txt/092.txt

From text to speech: The MITalk system

We have argued in Chapters 2-6 that in order to transform English text to
speech, one must first try to derive an underlying abstract linguistic representation
for the text. There are at least two reasons why a direct approach is suboptimal: 1)
rules for pronouncing words must take into consideration morphemic structure
(e.g. consider the pronunciation of the th of outhouse) and syntactic structure (e.g.
there exist many noun-verb ambiguities in English such as perm’it - p’ermit), and
2) sentence duration pattern, and fundamental frequency contour depend, to a
major extent, on the syntactic structure of the sentence.

There are currently several text-to-speech systems under development in the
United States (Nye et al., 1973; Kurzweil, 1976; Caldwell, 1979; Morris, 1979)
and elsewhere (Carlson and Granstrom, 1976). The simplest approach is to devise
a set of heuristic letter-to-sound rules and then create an exceptions dictionary for
frequently occurring words that are processed incorrectly by the letter-to-sound
rules (Kurzweil, 1976). The exceptions dictionary is then augmented by function
words that are useful for parsing strategies. The phonetic representation for a sen-
tence that is derived in this way serves as the input to a synthesis-by-rule device
such as Votrax (Gagnon, 1978) or a software synthesis-by-rule program.

The MITalk system represents a more ambitious approach of generalized
morphemic analysis, so as to do a better job of figuring out the pronunciation of
words and to better assign parts of speech to each word, and thereby compute
phrase and clause boundaries with greater accuracy. The real question is whether
current algorithms are good enough to make automatic text-to-speech output ac-
ceptable to the user. There is clear indication that motivated users (such as the
blind) benefit from these devices after a period of acclimation, but considerable
concentration is required.

80