from-text-to-speech-the-mit.../pages-txt/019.txt

Introduction

In this book, we are concerned with describing a successful approach to the con-
version of unrestricted English text to speech. Before taking up the details of this
process, however, it is useful to place this task in context. Over the years, there
has been an increasing need for speech generated from computers. In part, this has
been due to the intrinsic nature of text, speech, and computing. Certainly speech is
the fundamental language representation, present in all cultures (whether literate or
not), so if there is to be any communication means between the computer and its
human users, then speech provides the most broadly useful modality, except for
the needs of the deaf. While text (considered as a string of conventional symbols)
is often considered to be more durable than speech and more reliably preserved,
this is in many ways a manifestation of relatively early progress in printing tech-
nology, as opposed to the technology available for storing and manipulating
speech. Furthermore, text-based interaction with computers requires typing (and
often reading) skills which many potential users do not possess. So if the increas-
ingly ubiquitous computer is to be useful to the largest possible segment of society,
interaction with it via natural language, and in particular via speech, is certainly
necessary. That is, there is a clear trend over the past 25 years for the computer to
bend increasingly to the needs of the user, and this accommodation must continue
if computers are to serve society at large. The present search for expressive pro-
gramming languages which are easy to use and not prone to error can be expected
to lead in part to natural language interaction as the means best suited to human
users, with speech as the most desirable mode of expression.

1.1 Constraints on speech synthesis

It is clear, then, that speech communication with computers is both needed and
desirable. Within the realm of speech output techniques, we can ask what the na-
ture of these techniques is, and how they are realized. In order to get a view of the
spectrum of such procedures, it is useful to consider them as the result of four dif-
ferent constraints which determine a design space for all possible speech output
schemes. Each technique can then be seen as the result of decisions related to the
impact of each of the four constraint areas.