from-text-to-speech-the-mit.../pages-txt/145.txt

The Klatt formant synthesizer

the vibrations of the vocal folds. In addition, the vocal folds may vibrate without

meeting in the midline. In this type of voicing, the amplitude of higher frequency

harmonics of the voicing source spectrum is significantly reduced and the

waveform looks nearly sinusoidal. Therefore, the synthesizer should be capable of

generating at least two types of voicing waveforms (normal voicing and quasi- |
sinusoidal voicing), two types of frication waveforms (normal frication and

amplitude-modulated frication), and two types of aspiration (normal aspiration and

amplitude-modulated aspiration). These are the only kinds of sound sources re-

quired for English, although trills and clicks of other languages may call for the

addition of other source controls to the synthesizer in the future.

12.1.11 Voicing source

The structure of the voicing source is shown at the top left in Figure 12-6. Vari-
able control parameters are used to specify the fundamental frequency of voicing
(FO), the amplitude of normal voicing (AV), and the amplitude of quasi-sinusoidal
voicing (AVS).

An impulse train corresponding to normal voicing is generated whenever FO
is greater than zero. The amplitude of each impulse is determined by AV, the
amplitude of normal voicing in dB. AV ranges from about 60 dB in a strong
vowel to 0 dB when the voicing source is turned off. Fundamental frequency is
specified in Hz; a value of FO=100 would produce a 100-Hz impulse train. The
number of samples between impulses, TO, is determined by SR/FO0, e.g., for a sam-
pling rate of 10,000 and a fundamental frequency of 200 Hz, an impulse is
generated every 50th sample. Under some circumstances, the quantization of the
fundamental period to be an integral number of samples might be perceived in a
slow, prolonged fundamental frequency transition as a sort of staircase of mechani-
- cal sounds (similar to the rather unnatural speech one gets by setting FO to a con-
stant value in a synthetic utterance). But the problem is not sufficiently serious to
merit running the source model of the synthesizer at a higher sampling rate. If
desired, some aspiration noise can be added to the normal voicing waveform to
partially alleviate the problem and create a somewhat breathy voice quality.

12.1.12 Normal voicing
Ignoring for the moment the effects of RGZ, we see that the train of impulses is

sent through a low-pass filter, RGP, to produce a smooth waveform that resembles
a typical glottal volume velocity waveform (Flanagan, 1958). The resonator fre-
quency FGP is set to 0 Hz and BGP to 100 Hz. The filtered impulses thus have a
spectrum that falls off smoothly at approximately -12 dB per octave above 50 Hz.
The waveform generated does not have the same phase spectrum as a typical glot-

133