from-text-to-speech-the-mit.../pages-txt/090.txt

From text to speech: The MITalk system

7.3.3 Phonemic synthesis-by-rule

7.3.3.1 Phoneme synthesis from natural speech Phonemes have been considered
as basic speech units because there are only about 40 of them in English, and there
are good linguistic reasons for representing speech by phonemes. Unfortunately,
there is no possibility of extracting phonemic-sized chunks from natural speech in
such a way that they can be reassembled into new utterances because of the large
acoustic changes to a phoneme that occur in different phonetic environments.
Phonemes are a good starting point for terminal-analog speech synthesis-by-rule
programs (discussed below) because the rule programs can utilize a complex set of
rules to predict acoustic changes in different phonetic environments, but some
other unit is needed for a concatenation scheme.

7.3.3.2 Formant-based synthesis strategies Synthesis-by-rule schemes employ-
ing a formant-resonator speech synthesizer range from the excellent early work of
Holmes et al. (1964) to the synthesis of intelligible speech from an input represen-
tation consisting of an abstract linguistic description (Mattingly, 1968b; Coker,
1967; Coker et al., 1973; Klatt, 1976a). The formant synthesizer accepts input
time functions that determine formant frequencies, voicing, friction, and aspiration
source amplitudes, fundamental frequency, and individual formant amplitudes for
fricatives. The synthesizer produces an output waveform that is intended to ap-
proximate the perceptually most relevant acoustic characteristics of speech.

Formant synthesizers come in many different configurations (Dudley et al.,
1939; Cooper et al., 1951; Lawrence, 1953; Stevens et al., 1953; Rosen, 1958;
Tomlinson, 1966; Liljencrants, 1968; Gold and Rabiner, 1968; Klatt, 1972;
Flanagan et al., 1975). Holmes (1961, 1973) has shown that terminal-analog
methods of speech synthesis are capable of generating synthetic speech that is in-
distinguishable from the original recording of a talker if the parameter values are
properly chosen.

7.3.3.3 Control strategy Control parameter values such as formant frequency
motions are determined from a phonetic transcription of the intended sentence
using a set of heuristic rules. In one case (Coker, 1967), the rules manipulate a
simplified articulator)"' model of the vocal tract. Other rule programs manipulate
formant values directly, using heuristics such as the locus theory (Holmes et al.,
1964), or a modified locus theory (Klatt, 1979b).

In a fully automatic system, the phonetic transcription, durations, stress, and
fundamental frequency targets are derived from an abstract syntactic and phonemic
representation for a sentence by a set of rules that approximate a phonological
description of English (Klatt, 1976a, 1979a; O’Shaughnessy, 1977; Umeda, 1977).

78