from-text-to-speech-the-mit.../pages-txt/184.txt

14

Implementation

14.1 Conceptual organization

Throughout this book, emphasis has been placed on the representation of various
data forms and rules, together with transformations between these representations.
A strong effort has been made to exclude all reference to implementation concerns
from these discussions. At this point, however, it is appropriate to address these
issues, thus giving a view of the conceptual framework in which this research was
done, as well as a perspective on economically viable implementations that can’
deliver the overall text-to-speech capability in real-time. With these goals in mind,
we discuss first the overall conceptual organization of the MITalk system, fol-
lowed by a description of the development system used as a research vehicle over
the course of a dozen years, the requirements for a “performance system” suitable
for practical applications, and finally, a discussion of the current system, together
with examples, which serves as the basis for distribution of the MITalk system
from MIT.

The overall conceptual organization of the MITalk system can be viewed on
two levels. At the highest level, the system is viewed as an analysis/synthesis sys-
tem. Itis based on the premise that in order to transform an input textual represen-
tation (as a string of ASCII characters) to an output synthesized speech waveform,
it is necessary to first analyze the text into an underlying abstract linguistic
representation which can then be used as the initial basis for synthesizing the
waveform. In this sense, the text and speech waveform representations are seen as
two different surface representations of a common, underlying linguistic represen-
tation which unites these two surface forms. Thus, the first part of the system is
oriented to transforming the input textual representation into a narrow phonetic
transcription which includes the names of the constituent phonemes, stress marks,
and syntactic boundaries at the syllable, morph, word, phrase, and sentence levels.
It is an implicit assumption of the system that this transcription is sufficient to
serve as the input for the synthesis routines which generate the timing framework,
the pitch contour, and the detailed control parameters (updated at 5 msec intervals)
which specify the nature of the vocal tract model, which in turn produces the final

output synthetic speech waveform.

172