from-text-to-speech-the-mit.../pages-txt/136.txt

From text to speech: The MITalk system

12.1.1 Software simulation vs. hardware construction
The advantages of a software implementation over the construction of special-
purpose analog hardware are substantial. The synthesizer does not need repeated

' calibration, it is stable, and the signal-to-noise ratio (quantization noise in the case

of a digital simulation) can be made as large as desired. The configuration can
easily be changed as new ideas are proposed. For example, the voices of women
and children can be synthesized with appropriate modifications to the voicing
source and cascade vocal tract configuration. Graphic terminals are usually avail-
able in a computer facility and can be programmed to view control parameter data
or selected portions of the output speech waveform. Short-time spectra can also be
computed and displayed in order to make detailed spectral comparisons between
natural and synthetic waveforms.

12.1.2 Formant synthesis vs. articulatory synthesis

Speech synthesizers fall into two broad categories: 1) articulatory synthesizers that
attempt to model faithfully the mechanical motions of the articulators, and the
resulting distributions of volume velocity and sound pressure in the lungs, larynx,
and vocal and nasal tracts (Flanagan et al., 1975), and 2) formant synthesizers
which derive an approximation to a speech waveform by a simpler set of rules for-
mulated in the acoustic domain. The present chapter is concerned only with for-
mant models of speech generation since current articulatory models require several
orders of magnitude more computation, and the resultant speech output cannot be
specified with sufficient precision for direct optimization of the rules by trial-and-
error comparisons with natural speech.

The synthesizer design is based on an acoustic theory of speech production
presented in Fant (1960), and is summarized in Figure 12-2. According to this
view, one or more sources of sound energy are activated by the build-up of lung
pressure. Treating each sound source separately, we may characterize it in the fre-
quency domain by a source spectrum S(f), where f is frequency in Hz. Each
sound source excites the vocal tract which acts as a resonating system analogous to
an organ pipe.

Since the vocal tract is a linear system, it can be characterized in the fre-
quency domain by a linear transfer function 7'(f), which is the ratio of lip-plus-
nose volume velocity U(f) to source input S(f). Finally, the spectrum of the
sound pressure that would be recorded some distance from the lips of the talker
P () is related to lip-plus-nose volume velocity U(f) by a radiation characteristic
R (f) that describes the effects of directional sound propagation from the head.

Each of the above relationships can also be recast in the time (waveform)

124