You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
49 lines
2.8 KiB
49 lines
2.8 KiB
From text to speech: The MITalk system
|
|
|
|
12.1.1 Software simulation vs. hardware construction
|
|
The advantages of a software implementation over the construction of special-
|
|
purpose analog hardware are substantial. The synthesizer does not need repeated
|
|
|
|
' calibration, it is stable, and the signal-to-noise ratio (quantization noise in the case
|
|
|
|
of a digital simulation) can be made as large as desired. The configuration can
|
|
easily be changed as new ideas are proposed. For example, the voices of women
|
|
and children can be synthesized with appropriate modifications to the voicing
|
|
source and cascade vocal tract configuration. Graphic terminals are usually avail-
|
|
able in a computer facility and can be programmed to view control parameter data
|
|
or selected portions of the output speech waveform. Short-time spectra can also be
|
|
computed and displayed in order to make detailed spectral comparisons between
|
|
natural and synthetic waveforms.
|
|
|
|
12.1.2 Formant synthesis vs. articulatory synthesis
|
|
|
|
Speech synthesizers fall into two broad categories: 1) articulatory synthesizers that
|
|
attempt to model faithfully the mechanical motions of the articulators, and the
|
|
resulting distributions of volume velocity and sound pressure in the lungs, larynx,
|
|
and vocal and nasal tracts (Flanagan et al., 1975), and 2) formant synthesizers
|
|
which derive an approximation to a speech waveform by a simpler set of rules for-
|
|
mulated in the acoustic domain. The present chapter is concerned only with for-
|
|
mant models of speech generation since current articulatory models require several
|
|
orders of magnitude more computation, and the resultant speech output cannot be
|
|
specified with sufficient precision for direct optimization of the rules by trial-and-
|
|
error comparisons with natural speech.
|
|
|
|
The synthesizer design is based on an acoustic theory of speech production
|
|
presented in Fant (1960), and is summarized in Figure 12-2. According to this
|
|
view, one or more sources of sound energy are activated by the build-up of lung
|
|
pressure. Treating each sound source separately, we may characterize it in the fre-
|
|
quency domain by a source spectrum S(f), where f is frequency in Hz. Each
|
|
sound source excites the vocal tract which acts as a resonating system analogous to
|
|
an organ pipe.
|
|
|
|
Since the vocal tract is a linear system, it can be characterized in the fre-
|
|
quency domain by a linear transfer function 7'(f), which is the ratio of lip-plus-
|
|
nose volume velocity U(f) to source input S(f). Finally, the spectrum of the
|
|
sound pressure that would be recorded some distance from the lips of the talker
|
|
P () is related to lip-plus-nose volume velocity U(f) by a radiation characteristic
|
|
R (f) that describes the effects of directional sound propagation from the head.
|
|
|
|
Each of the above relationships can also be recast in the time (waveform)
|
|
|
|
124
|