from-text-to-speech-the-mit.../pages-txt/020.txt

From text to speech: The MITalk system

1.1.1 Task

The application task determines the nature of the speech capability that must be
provided. When only a small number of utterances is required, and these do not
have to be varied on line, then recorded speech can be used, but if the task is to
simulate the human cognitive process of reading aloud, then an entirely different

range of techniques is needed.

1.1.2 Human vocal apparatus

All systems must produce as output a speech waveform, but it is not an arbitrary
signal. A great deal of effort has gone into the efficient and insightful represen-
tation of the speech signal as the result of a signal source in the vocal tract exciting
the vocal tract “system function”, which acts as a filter to produce the speech
waveform. The human vocal tract also constrains the speed with which signal
changes can be made, and is also responsible for much of the coarticulatory
smoothing or encoding that makes the relation between the underlying phonetic
transcription and the speech waveform so difficult to characterize.

1.1.3 Language structure

Just as the speech waveform is not arbitrarily derived, the myriad possible speech
gestures that could be related to a linguistic message are constrained by the nature
of the particular language structure involved. It has been consistently found that
those units and structures which linguists use to describe and explain language do
in fact provide the appropriate base in terms of which the speech waveform can be
characterized and constructed. Thus, basic phonological laws, stress rules, mor-
phological and syntactic structures, and phonotactic constraints all find their use in

determining the speech output.

1.1.4 Technology

Our ability to model and construct speech output devices is strongly conditioned
by the current (and past) technology. Speech science has profited greatly from a
variety of technologies, including x-rays, motion pictures, the sonograph, modern
filter and sampled-data theory, and most importantly the modern digital computer.
While early uses of computers were for off-line speech analysis and simulation, the
advent of increasingly capable integrated circuit technology has made it possible to
build compact, low-cost, real-time devices of great capability. It is this fact, com-
bined with our substantial knowledge of the algorithms needed to generate speech,
that has propelled the field of speech output from computers into the “real world”
of practical commercial systems suitable for a wide variety of applications.