from-text-to-speech-the-mit.../pages-txt/142.txt

From text to speech: The MITalk system

y(nT)=A"x(nT)+B 'x(nT-T )+C’x(nT-2T) @)

where x (nT-T) and x(nT-2T) are the previous two samples of the input x(nT), the
constants A”, B’ and C” are defined by the equations:

A’=1/A
B’=-B/A ©)
C'=—C/A

where A, B, and C are obtained by inserting the antiresonance center frequency F
and bandwidth BW into Equation 2.

12.1.8 Low-pass resonator

As a special case, the frequency F of a digital resonator can be set to zero, produc-
ing, in effect, a low-pass filter which has a nominal attenuation skirt of -12 dB per
octave of frequency increase and a 3-dB down break frequency equal to BW/2.
The voicing source contains a digital resonator RGP used as a low-pass filter that
transforms a glottal impulse into a pulse having a waveform and spectrum similar
to normal voicing. A second digital resonator, RGS, is used to low-pass filter the
normal voicing waveform to produce the quasi-sinusoidal glottal waveform seen
during the closure interval for an intervocalic voiced plosive.

12.1.9 Synthesizer block diagram

A block diagram of the synthesizer is shown in Figure 12-6. There are 39 control
parameters that determine the characteristics of the output. The name and range of
values for each parameter are given in Table 12-1. As seen from the table, as
many as 22 of the 39 parameters are varied to achieve optimum matches to an ar-
bitrary English utterance. The constant parameters in Table 12-1 have been given
values appropriate for a particular male voice, and would have to be adjusted
slightly to approximate the speech of other male or female talkers. The list of vari-
able control parameters is long, compared with some synthesizers, but the em-
phasis here is on defining strategies for the synthesis of high-quality speech. We
are not concerned with searching for compromises that would minimize the infor-
mation content in the control parameter specification.

12.1.10 Sources of sound
There are two kinds of sound sources that may be activated during speech produc-
tion (Stevens and Klatt, 1974). One involves quasi-periodic vibrations of some
structure, usually the vocal folds. Vibration of the vocal folds is called voicing.
(Other structures such as the lips, tongue tip, or uvula may be caused to vibrate in
sound types of some languages, but not in English.)

The second kind of sound source involves the generation of turbulence noise

130