from-text-to-speech-the-mit.../pages-txt/121.txt

The phonetic component

the parameter value assigned to the time of the segment boundary. These con-
stants are determined by rules that involve features of the current phonetic segment
PHOCUR, the previous phonetic segment PHOLAS, and the next phonetic seg-
ment PHONEX. In some cases, the rules have to examine features of segments
further from the current segment, but this is rare. For example, in pin, the time of
voicing onset in the vowel preceded by the voiceless plosive pp is delayed by
about 50 msec, unless the segment preceding the voiceless plosive is an ss, as in
spin. The variable control parameters are listed later in Table 11-3.

11.1.3 History of formant synthesis-by-rule

As originally demonstrated by John Holmes, successful imitation of a natural ut-
terance depends primarily on matching observed short-term spectra. This tech-
nique succeeds, in part, because it reproduces all of the potential cues present in
the spectrum, even though we may not know which cues are most important. The
speech perception apparatus appears to be aware of any and all (perceptually
discriminable) regularities present in the acoustic signal generated by the speech
production apparatus, and these regularities should be included in synthetic stimuli
if possible.

There have been a number of previous efforts to specify general strategies for
formant synthesis-by-rule (see, e.g., Holmes et al, 1964; Mattingly, 1968a;
Rabiner, 1968a; Coker et al., 1973; Klatt, 1972, 1976a). However, examination of
these publications suggests that consonant-vowel intelligibility is nowhere near as
high as in listening to natural speech. For example, Rabiner (1968a) estimated that
consonants in his synthetic consonant-vowel nonsense stimuli were 85 percent in-
telligible to phonetically trained listeners, but that natural tokens of the same syll-
ables were about 99 percent intelligible. Other rule programs, apparently, perform
no better, although relevant evaluative data are generally not available.

Why isn’t intelligibility higher? Each rule system attempts to make ap-
propriate generalizations and simplifications concerning the form and content of
rules for consonant-vowel synthesis. Have the wrong generalizations been made?
The results described below in Section 11.2 suggest that this conjecture is true.

11.2 “Synthesis-by-analysis” of consonant-vowel syllables

11.2.1 Analysis of CV syllables

The data base that was recorded and analyzed in order to develop new consonant-
vowel synthesis rules consists of speech samples obtained from six talkers who
were native to a single midwestern dialect region -- three males and three females
(Klatt, 1979b). The intent was to use the data from all six talkers to establish the
form of the synthesis rules, but the actual parameter values inserted in the rules

109