from-text-to-speech-the-mit.../pages-txt/128.txt

From text to speech: The MITalk system

differences has led to clear improvement in intelligibility. At least one more itera-
tion of this procedure is needed. Furthermore, within the constraints imposed by
the synthesizer itself, matching of linear-prediction spectra is adequate to the task.

11.3 General rules for the synthesis of phonetic sequences

The rule program used in MITalk differs from the limited CV synthesis algorithm
described above. The MITalk phonetic component PHONET is patterned after a
Fortran-based synthesis-by-rule program described by Klatt (1976a). Since that
time, both the program structure and the constants contained in target tables for
each phone have been modified. These modifications were made in order to incor-
porate some of the new consonant-vowel synthesis rules described in the previous
section, and to simplify the rule structure.

The general procedure for drawing control parameter values is:

1. Draw the target value for the first segment.

2. Draw the target value for the next segment.

3. Smooth the boundary between the segments using one of the
templates shown in Figure 11-6 (note that DISCON does no

smoothing).

4. Go to step 2 unless there are no more segments.
The transition between target values for each control parameter may either be dis-
continuous or smooth. The boundary value and transition duration in each direc-
tion from the logical phoneme boundary are computed by rules that take into ac-
count manner features of the segments involved.

11.3.1 Vowels

The control parameters that are usually varied to generate an isolated vowel are the
amplitude of voicing AV; the fundamental frequency of vocal fold vibrations FO;
the lowest three formant frequencies F1, F2, and F3; and bandwidths B1, B2, and
B3. The fourth and fifth formant frequencies might be varied to simulate spectral
details, but this is not essential for high intelligibility. To create a natural breathy
vowel termination, the amplitude of aspiration AH and the amplitude of quasi-
sinusoidal voicing AVS are activated.

Table 11-1 includes suggested target values for variable control parameters
that are used to differentiate among English vowels. Formant frequency and
bandwidth targets were obtained by trial-and-error spectral matching to a large set
of CV syllables spoken by talker DHK. Bandwidth values are often larger than
closed-glottis values obtained by Fujimura and Lindqvist (1971), because the
bandwidths of Table 11-1 have been adjusted to take into account changes to ob-

116