from-text-to-speech-the-mit.../pages-txt/021.txt

Introduction

1.2 Synthesis techniques

With these constraints in mind, we can examine the various approaches to speech
output from computers. A great many techniques have been developed, but they
can be naturally grouped in an insightful way. Our purpose here is to create a con-
text in which text-to-speech conversion of unrestricted English text using
synthesis-by-rule can be considered. This comparison will permit us to highlight
the difference between the various approaches, and to compare system cost and
performance.

1.2.1 Waveform coding

The simplest strategy would be to merely record (either in digital or analog format)
the required speech. Depending on the technology used, this approach may intro-
duce access time delays, and will be limited in capacity by the recording medium
available, but the speech will generally be of high quality. No knowledge of the
human vocal apparatus or language structure is needed; these systems being a
straightforward match of the task requirements to the available storage technology.
Since memory size is the major limitation of these schemes, efforts have been
made to cut down the number of bits per sample used for digital storage. A variety
of techniques has been used, from simple delta modulation, through adaptive delta
modulation and adaptive differential PCM, to adaptive predictive coding which
can drop the required bit rate from over 50 Kbit/sec to under 10 Kbit/sec while still
retaining good quality speech. Simple coder/decoder circuits can be used for
recording and playback. When the message vocabulary is small and fixed, these
systems are attractive. But if messages must be concatenated, then it is extremely
difficult to produce good quality speech because aspects of the speech waveform
have been “bound” at recording time to the values appropriate for all message
situations which use the smaller constituent messages.

1.2.2 Parametric representation
In order to further lower the storage requirements, but also to provide needed

flexibility for concatenation of messages, several schemes have been developed
which “back up” from the waveform itself to a parametric representation in terms
of a model for speech production. These parameters may characterize salient in-
formation in either the time or frequency domain. Thus, for example, the speech
waveform can be formed by summing up waveforms at several harmonics of the
pitch weighted by the spectral prominence at that frequency, a set of resonances
can be excited by noise or glottal waveforms, or the vocal tract shape can be simu-
lated along with appropriate acoustic excitation. As compared to waveform
coding, more computation is now required at playback time to recreate the speech