|
|
From text to speech: The MITalk system
|
|
|
|
|
|
7.3.3 Phonemic synthesis-by-rule
|
|
|
|
|
|
7.3.3.1 Phoneme synthesis from natural speech Phonemes have been considered
|
|
|
as basic speech units because there are only about 40 of them in English, and there
|
|
|
are good linguistic reasons for representing speech by phonemes. Unfortunately,
|
|
|
there is no possibility of extracting phonemic-sized chunks from natural speech in
|
|
|
such a way that they can be reassembled into new utterances because of the large
|
|
|
acoustic changes to a phoneme that occur in different phonetic environments.
|
|
|
Phonemes are a good starting point for terminal-analog speech synthesis-by-rule
|
|
|
programs (discussed below) because the rule programs can utilize a complex set of
|
|
|
rules to predict acoustic changes in different phonetic environments, but some
|
|
|
other unit is needed for a concatenation scheme.
|
|
|
|
|
|
7.3.3.2 Formant-based synthesis strategies Synthesis-by-rule schemes employ-
|
|
|
ing a formant-resonator speech synthesizer range from the excellent early work of
|
|
|
Holmes et al. (1964) to the synthesis of intelligible speech from an input represen-
|
|
|
tation consisting of an abstract linguistic description (Mattingly, 1968b; Coker,
|
|
|
1967; Coker et al., 1973; Klatt, 1976a). The formant synthesizer accepts input
|
|
|
time functions that determine formant frequencies, voicing, friction, and aspiration
|
|
|
source amplitudes, fundamental frequency, and individual formant amplitudes for
|
|
|
fricatives. The synthesizer produces an output waveform that is intended to ap-
|
|
|
proximate the perceptually most relevant acoustic characteristics of speech.
|
|
|
|
|
|
Formant synthesizers come in many different configurations (Dudley et al.,
|
|
|
1939; Cooper et al., 1951; Lawrence, 1953; Stevens et al., 1953; Rosen, 1958;
|
|
|
Tomlinson, 1966; Liljencrants, 1968; Gold and Rabiner, 1968; Klatt, 1972;
|
|
|
Flanagan et al., 1975). Holmes (1961, 1973) has shown that terminal-analog
|
|
|
methods of speech synthesis are capable of generating synthetic speech that is in-
|
|
|
distinguishable from the original recording of a talker if the parameter values are
|
|
|
properly chosen.
|
|
|
|
|
|
7.3.3.3 Control strategy Control parameter values such as formant frequency
|
|
|
motions are determined from a phonetic transcription of the intended sentence
|
|
|
using a set of heuristic rules. In one case (Coker, 1967), the rules manipulate a
|
|
|
simplified articulator)"' model of the vocal tract. Other rule programs manipulate
|
|
|
formant values directly, using heuristics such as the locus theory (Holmes et al.,
|
|
|
1964), or a modified locus theory (Klatt, 1979b).
|
|
|
|
|
|
In a fully automatic system, the phonetic transcription, durations, stress, and
|
|
|
fundamental frequency targets are derived from an abstract syntactic and phonemic
|
|
|
representation for a sentence by a set of rules that approximate a phonological
|
|
|
description of English (Klatt, 1976a, 1979a; O’Shaughnessy, 1977; Umeda, 1977).
|
|
|
|
|
|
78
|