from-text-to-speech-the-mit.../pages-txt/085.txt

Survey of speech synthesis technology

variety of applications where the vocabulary consists of a small number of words
and where the messages are simple and follow a rather rigid format. However,
there are a number of limitations of such systems which make them unsatisfactory
for more general applications, such as automatic conversion of English text to
speech.

Figure WORD-BLEND illustrates some of the differences between words
spoken in isolation and the same words put together in a fluently spoken sentence.
Not only are most words considerably shorter, but there are acoustic changes at the
boundaries between words due to coarticulation, and due to phonological rules that
change the pronunciation of words in certain sentence contexts. Furthermore, the
intonation, rhythm, and stress pattern appropriate to the sentence cannot be syn-
thesized if one simply concatenates prerecorded words. These prosodic qualities
turn out to be extremely important. Words that are perfectly intelligible in isola-
tion seem to come too fast and in a disconnected manner when the words are con-
catenated in such a way that the prosody is wrong.

Thus simple word concatenation schemes have severe limitations as audio
response units. In contrast, there are several newer techniques under development
that do not have these limitations. These techniques range from complex systems
for speech synthesis-by-rule (where a synthetic waveform is computed from a
knowledge of linguistic and acoustic rules), to relatively simple systems for creat-
ing speech utterances by concatenating prerecorded speech waveform chunks
smaller than a word (using vocoder analysis-synthesis technology to gain
flexibility in reassembly).

Speech synthesis techniques have been reviewed in Flanagan and Rabiner
(1973), Klatt (1974), and Rabiner and Schafer (1976). We describe here some of
the current techniques that have been employed. Of particular interest are criteria
by which one selects an inventory of basic speech units to be used in utterance as-
sembly, how one selects a method of unit concatenation, and how to specify
sentence-level prosodic variables.

7.3 Synthesis techniques

The techniques to be covered in this section include systems for forming messages
out of words as the basic units, out of syllables and diphones as the basic units, and
out of phonemes as the basic units.

7.3.1 Word assembly

7.3.1.1 Prerecorded words and phrases Early methods of spoken message as-
sembly used prerecorded words (or whole phrases) that were concatenated into
sentences (Homsby, 1972; Chapman, 1971; Buron, 1968). Brief pauses were in-

73