from-text-to-speech-the-mit.../pages-txt/088.txt

From text to speech: The MITalk system

7.3.2 Syllables and diphones

Instead of using words as the basic building blocks for sentence production, a
smaller inventory of basic units is required if arbitrary English sentences are to be
synthesized. The inventory of basic speech units must satisfy several require-
ments, including: 1) the ability to construct any English word by concatenating the
units one after another, and 2) the ability to change duration, intensity and fun-
damental frequency according to the demands of the sentence syntax and stress
pattern in such a way as to produce speech that is both intelligible and natural.

7.3.2.1 Syllables The intuitive notion of the syllable as the basic unit has con-
siderable theoretical appeal. Any English word can be broken into syllables con-
sisting of a vowel nucleus and adjacent consonants. Linguists have been unable to
agree on objective criteria for assigning consonants to a particular vowel nucleus
in certain ambiguous cases such as “butter”, but an arbitrary decision can be made
for synthesis purposes.

The greatest theoretical advantage of the syllable concemns the way that
acoustic characteristics of most consonant-vowel transitions are preserved.
Context-conditioned acoustic changes to consonants are automatically present to a
great extent when the syllable is chosen as the basic unit, but not when smaller
units such as the phoneme are concatenated.

The disadvantages of the syllable are: 1) coarticulation across syllable boun-
daries is not treated, and this coarticulation can be just as important as within-
syllable coarticulation, 2) if prerecorded syllables are stored in the form of
waveforms, there is no way to mimic the prosodic contour of the intended mes-
sage, and 3) the syllable inventory for general English is very large. There are cur-
rently no syllable-based systems for speech generation.

7.3.2.2 Demisyllables  The last two disadvantages of a syllable-based scheme
might be overcome by replacing syllables by demisyllables. The demisyllable is
defined as half of a syllable, either the set of initial consonants plus half of the
vowel, or the second half of the vowel plus any postvocalic consonants (Fujimura
and Lovins, 1978; Lovins and Fujimura, 1976). For example, the word “construct”
would be divided into co-, <on, stru-, and -uct. It is claimed that there are less
than 1000 demisyllables needed to synthesize any English utterance. Each
demisyllable can be represented in terms of a set of linear prediction frames. Con-
catenation rules include some smoothing across demisyllable boundaries. The
problems with demisyllable-based approaches are: 1) how to smooth across
demisyllable boundaries to simulate natural coarticulation, and 2) how to adjust
durations to match the desired pattern for a sentence. The latter problem is serious

76