You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
46 lines
2.6 KiB
46 lines
2.6 KiB
Survey of speech synthesis technology
|
|
|
|
because lengthening and shortening of speech tends to take place during the
|
|
steady-state portions of sustainable phonetic segments, whereas the demisyllable is
|
|
a mixture of portions of steady states and transitions.
|
|
|
|
7.3.2.3 Diphones The diphone is defined as half of one phone followed by half
|
|
of the next phone. Peterson et al. (1958), and Wang and Peterson (1958) were the
|
|
first to propose speech synthesis by diphone concatenation. They argued that the
|
|
diphone is a natural unit for synthesis because the coarticulatory influence of one
|
|
phoneme does not usually extend much further than halfway into the next
|
|
phoneme. Since diphone junctures are usually at articulatory steady states, min-
|
|
imal smoothing should be required between adjacent diphones. They speculated
|
|
that several thousand diphones would be required if real speech waveform seg-
|
|
ments were used, because each of the diphones would have to be recorded at
|
|
several different durations and with several different pitch contours.
|
|
|
|
Dixon and Maxey (1968) later showed that highly intelligible synthetic
|
|
speech could be fashioned from diphones defined in terms of sets of control
|
|
parameter time functions to control a formant synthesizer. Only about 1500
|
|
diphone elements were required (40 phonemes followed by almost any of 40
|
|
phonemes) because duration, intensity, and fundamental frequency could be ad-
|
|
justed independently (by hand, in their case). to take into account effects of stress,
|
|
intonation, and rhythm. Unfortunately, their diphone definitions were never
|
|
published.
|
|
|
|
Olive (1977) proposed that diphones could be defined in terms of two linear-
|
|
prediction pseudo-area-function targets and a linear transition between them. If
|
|
durations and fundamental frequency were specified by hand, Olive and Spick-
|
|
enagle (1976) showed that quite natural intelligible speech could be produced by
|
|
this method of linear-prediction diphone synthesis. The specification of durational
|
|
rules is a problem, just as in the case of demisyllables, because the most natural
|
|
framework for stating rules is in terms of phonetic segments.
|
|
|
|
7.3.2.4 Prosodic rules Relatively little work on general prosodic rules has been
|
|
published within the context of syllable, demisyllable, and diphone concatenation
|
|
systems. Olive (1974) proposed an unusual set of word-based fundamental fre-
|
|
quency rules that depend on syntactic structure, but are not influenced by the stress
|
|
pattern within a word. Recent work at Bell Laboratories by Liberman and Pierre-
|
|
humbert on rules for the specification of durations and fundamental frequency con-
|
|
tours in a diphone-based system shows considerable promise (Pierrehumbert,
|
|
|
|
1979).
|
|
|
|
77
|