|
|
11
|
|
|
|
|
|
The phonetic component
|
|
|
|
|
|
11.1 Overview
|
|
|
|
|
|
The phonetic component, PHONET, accepts input from the fundamental frequency
|
|
|
component FOTARG (in the form of an array of phonetic segment names, and a
|
|
|
segmental stress feature, segmental duration, and two fundamental frequency tar-
|
|
|
gets for each phone), and produces output values for 20 synthesizer control
|
|
|
parameters every 5 msec. This chapter concemns the strategy for phonetic-to-
|
|
|
parametric rule development and a summary of the form and content of individual
|
|
|
rules for control parameter specification.
|
|
|
|
|
|
11.1.1 “Stored prosodics” synthesis
|
|
|
|
|
|
The phonetic component PHONET and synthesizer components can be operated in
|
|
|
stand-alone mode in which the phonetic segment string, durations, and fundamen-
|
|
|
tal frequency contour specification that form the input to PHONET are hand-tuned
|
|
|
to be as accurate as possible. For example, one might record a natural version of a
|
|
|
sentence, extract fundamental frequency, measure segmental durations, select
|
|
|
phonetic segments according to the pronunciation used by the real speaker, and
|
|
|
format this information in a way that is compatible with PHONET input. The ad-
|
|
|
vantage of this approach is the naturalness of the speech that can be produced with
|
|
|
an input representation consisting of about 250 bits per second of speech.
|
|
|
|
|
|
This method of generating speech might be compared with the Texas
|
|
|
Instruments’ Speak-’N-Spell vocoder synthesizer. We suspect that the overall in-
|
|
|
telligibility and naturalness of the MITalk “stored-prosodics™ synthesis is slightly
|
|
|
better at 250 bits/second than Speak-’N-Spell at 1200 bits/second. However, the
|
|
|
significant disadvantage of MITalk is that there is no automatic procedure for
|
|
|
determination of input parameter data for PHONET, whereas Speak-’N-Spell syn-
|
|
|
thesis can be prepared automatically from a linear-prediction vocoder analyzer
|
|
|
with only minimal selection and hand tuning.
|
|
|
|
|
|
11.1.2 Structure of PHONET
|
|
|
|
|
|
The phonetic component includes a large array of target values for various control
|
|
|
parameters for each of about 60 phonetic segment types. Smoothing between tar-
|
|
|
get values depends on time constants computed by rule, as well as depending on
|
|
|
|
|
|
108
|