|
|
14
|
|
|
|
|
|
Implementation
|
|
|
|
|
|
14.1 Conceptual organization
|
|
|
|
|
|
Throughout this book, emphasis has been placed on the representation of various
|
|
|
data forms and rules, together with transformations between these representations.
|
|
|
A strong effort has been made to exclude all reference to implementation concerns
|
|
|
from these discussions. At this point, however, it is appropriate to address these
|
|
|
issues, thus giving a view of the conceptual framework in which this research was
|
|
|
done, as well as a perspective on economically viable implementations that can’
|
|
|
deliver the overall text-to-speech capability in real-time. With these goals in mind,
|
|
|
we discuss first the overall conceptual organization of the MITalk system, fol-
|
|
|
lowed by a description of the development system used as a research vehicle over
|
|
|
the course of a dozen years, the requirements for a “performance system” suitable
|
|
|
for practical applications, and finally, a discussion of the current system, together
|
|
|
with examples, which serves as the basis for distribution of the MITalk system
|
|
|
from MIT.
|
|
|
|
|
|
The overall conceptual organization of the MITalk system can be viewed on
|
|
|
two levels. At the highest level, the system is viewed as an analysis/synthesis sys-
|
|
|
tem. Itis based on the premise that in order to transform an input textual represen-
|
|
|
tation (as a string of ASCII characters) to an output synthesized speech waveform,
|
|
|
it is necessary to first analyze the text into an underlying abstract linguistic
|
|
|
representation which can then be used as the initial basis for synthesizing the
|
|
|
waveform. In this sense, the text and speech waveform representations are seen as
|
|
|
two different surface representations of a common, underlying linguistic represen-
|
|
|
tation which unites these two surface forms. Thus, the first part of the system is
|
|
|
oriented to transforming the input textual representation into a narrow phonetic
|
|
|
transcription which includes the names of the constituent phonemes, stress marks,
|
|
|
and syntactic boundaries at the syllable, morph, word, phrase, and sentence levels.
|
|
|
It is an implicit assumption of the system that this transcription is sufficient to
|
|
|
serve as the input for the synthesis routines which generate the timing framework,
|
|
|
the pitch contour, and the detailed control parameters (updated at 5 msec intervals)
|
|
|
which specify the nature of the vocal tract model, which in turn produces the final
|
|
|
|
|
|
output synthetic speech waveform.
|
|
|
|
|
|
172
|