from-text-to-speech-the-mit.../pages-txt/089.txt

Survey of speech synthesis technology

because lengthening and shortening of speech tends to take place during the
steady-state portions of sustainable phonetic segments, whereas the demisyllable is
a mixture of portions of steady states and transitions.

7.3.2.3 Diphones The diphone is defined as half of one phone followed by half
of the next phone. Peterson et al. (1958), and Wang and Peterson (1958) were the
first to propose speech synthesis by diphone concatenation. They argued that the
diphone is a natural unit for synthesis because the coarticulatory influence of one
phoneme does not usually extend much further than halfway into the next
phoneme. Since diphone junctures are usually at articulatory steady states, min-
imal smoothing should be required between adjacent diphones. They speculated
that several thousand diphones would be required if real speech waveform seg-
ments were used, because each of the diphones would have to be recorded at
several different durations and with several different pitch contours.

Dixon and Maxey (1968) later showed that highly intelligible synthetic
speech could be fashioned from diphones defined in terms of sets of control
parameter time functions to control a formant synthesizer. Only about 1500
diphone elements were required (40 phonemes followed by almost any of 40
phonemes) because duration, intensity, and fundamental frequency could be ad-
justed independently (by hand, in their case). to take into account effects of stress,
intonation, and rhythm. Unfortunately, their diphone definitions were never
published.

Olive (1977) proposed that diphones could be defined in terms of two linear-
prediction pseudo-area-function targets and a linear transition between them. If
durations and fundamental frequency were specified by hand, Olive and Spick-
enagle (1976) showed that quite natural intelligible speech could be produced by
this method of linear-prediction diphone synthesis. The specification of durational
rules is a problem, just as in the case of demisyllables, because the most natural
framework for stating rules is in terms of phonetic segments.

7.3.2.4 Prosodic rules Relatively little work on general prosodic rules has been
published within the context of syllable, demisyllable, and diphone concatenation
systems. Olive (1974) proposed an unusual set of word-based fundamental fre-
quency rules that depend on syntactic structure, but are not influenced by the stress
pattern within a word. Recent work at Bell Laboratories by Liberman and Pierre-
humbert on rules for the specification of durations and fundamental frequency con-
tours in a diphone-based system shows considerable promise (Pierrehumbert,

1979).

77