from-text-to-speech-the-mit.../pages-txt/151.txt

The Klatt formant synthesizer

12.2 Vocal tract transfer functions

The acoustic characteristics of the vocal tract are determined by its cross-sectional
area as a function of distance from the larynx to the lips. The vocal tract forms a
nonuniform transmission line whose behavior can be determined for frequencies
below about 5 kHz by solving a one-dimensional wave equation (Fant, 1960).
(Above 5 kHz, three-dimensional resonance modes would have to be considered.)
Solutions to the wave equation result in a transfer function that relates samples of
the glottal source volume velocity to output volume velocity at the lips.

The synthesizer configuration in Figure 12-6 includes components to realize
two different types of vocal tract transfer function. The first, a cascade configura-
tion of digital resonators, models the resonant properties of the vocal tract when-
ever the source of sound is within the larynx. The second, a parallel configuration
of digital resonators and amplitude controls, models the resonant properties of the
vocal tract duﬁng the production of frication noise. The parallel configuration can
also be used to model vocal tract characteristics for laryngeal sound sources, al-
though the approximation is not quite as good as in the cascade model.

12.2.1 Cascade vocal tract model

Assuming that the one-dimensional wave equation is a valid approximation below
5 kHz, the vocal tract transfer function can be represented in the frequency domain
by a product of poles and zeros. Furthermore, the transfer function contains only
about five complex pole pairs and no zeros in the frequency range of interest, as
long as the articulation is nonnasalized and the sound source is at the larynx (Fant,
1960). The transfer function conforms to an all-pole model because there are no
side-branch resonators or multiple sound paths. (The glottis is partially open
during the production of aspiration so that the poles and zeros of the subglottal sys-
tem are often seen in aspiration spectra; the only way to approximate their effects
in the synthesizer is to increase the first formant bandwidth to about 300 Hz. The
perceptual importance of the remaining spectral distortions caused by the poles and

zeros of the subglottal system is probably minimal.)
Five resonators are appropriate for simulating a vocal tract with a length of

about 17 cm, the length of a typical male vocal tract, because the average spacing
between formants is equal to the velocity of sound divided by half the wavelength,
which works out to be 1000 Hz. A typical female vocal tract is 15 to 20 percent
shorter, suggesting that only four formant resonators be used to represent a female
voice in a 5 kHz simulation (or that the simulation should be extended to about 6

kHz). It is suggested that the voices of women and children be approximated by
setting the control parameter NFC to 4, thus removing the fifth formant from the

139