You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
45 lines
2.7 KiB
45 lines
2.7 KiB
Introduction
|
|
|
|
1.2 Synthesis techniques
|
|
|
|
With these constraints in mind, we can examine the various approaches to speech
|
|
output from computers. A great many techniques have been developed, but they
|
|
can be naturally grouped in an insightful way. Our purpose here is to create a con-
|
|
text in which text-to-speech conversion of unrestricted English text using
|
|
synthesis-by-rule can be considered. This comparison will permit us to highlight
|
|
the difference between the various approaches, and to compare system cost and
|
|
performance.
|
|
|
|
1.2.1 Waveform coding
|
|
|
|
The simplest strategy would be to merely record (either in digital or analog format)
|
|
the required speech. Depending on the technology used, this approach may intro-
|
|
duce access time delays, and will be limited in capacity by the recording medium
|
|
available, but the speech will generally be of high quality. No knowledge of the
|
|
human vocal apparatus or language structure is needed; these systems being a
|
|
straightforward match of the task requirements to the available storage technology.
|
|
Since memory size is the major limitation of these schemes, efforts have been
|
|
made to cut down the number of bits per sample used for digital storage. A variety
|
|
of techniques has been used, from simple delta modulation, through adaptive delta
|
|
modulation and adaptive differential PCM, to adaptive predictive coding which
|
|
can drop the required bit rate from over 50 Kbit/sec to under 10 Kbit/sec while still
|
|
retaining good quality speech. Simple coder/decoder circuits can be used for
|
|
recording and playback. When the message vocabulary is small and fixed, these
|
|
systems are attractive. But if messages must be concatenated, then it is extremely
|
|
difficult to produce good quality speech because aspects of the speech waveform
|
|
have been “bound” at recording time to the values appropriate for all message
|
|
situations which use the smaller constituent messages.
|
|
|
|
1.2.2 Parametric representation
|
|
In order to further lower the storage requirements, but also to provide needed
|
|
|
|
flexibility for concatenation of messages, several schemes have been developed
|
|
which “back up” from the waveform itself to a parametric representation in terms
|
|
of a model for speech production. These parameters may characterize salient in-
|
|
formation in either the time or frequency domain. Thus, for example, the speech
|
|
waveform can be formed by summing up waveforms at several harmonics of the
|
|
pitch weighted by the spectral prominence at that frequency, a set of resonances
|
|
can be excited by noise or glottal waveforms, or the vocal tract shape can be simu-
|
|
lated along with appropriate acoustic excitation. As compared to waveform
|
|
coding, more computation is now required at playback time to recreate the speech
|