from-text-to-speech-the-mit.../pages-txt/120.txt

11

The phonetic component

11.1 Overview

The phonetic component, PHONET, accepts input from the fundamental frequency
component FOTARG (in the form of an array of phonetic segment names, and a
segmental stress feature, segmental duration, and two fundamental frequency tar-
gets for each phone), and produces output values for 20 synthesizer control
parameters every 5 msec. This chapter concemns the strategy for phonetic-to-
parametric rule development and a summary of the form and content of individual
rules for control parameter specification.

11.1.1 “Stored prosodics” synthesis

The phonetic component PHONET and synthesizer components can be operated in
stand-alone mode in which the phonetic segment string, durations, and fundamen-
tal frequency contour specification that form the input to PHONET are hand-tuned
to be as accurate as possible. For example, one might record a natural version of a
sentence, extract fundamental frequency, measure segmental durations, select
phonetic segments according to the pronunciation used by the real speaker, and
format this information in a way that is compatible with PHONET input. The ad-
vantage of this approach is the naturalness of the speech that can be produced with
an input representation consisting of about 250 bits per second of speech.

This method of generating speech might be compared with the Texas
Instruments’ Speak-’N-Spell vocoder synthesizer. We suspect that the overall in-
telligibility and naturalness of the MITalk “stored-prosodics™ synthesis is slightly
better at 250 bits/second than Speak-’N-Spell at 1200 bits/second. However, the
significant disadvantage of MITalk is that there is no automatic procedure for
determination of input parameter data for PHONET, whereas Speak-’N-Spell syn-
thesis can be prepared automatically from a linear-prediction vocoder analyzer
with only minimal selection and hand tuning.

11.1.2 Structure of PHONET

The phonetic component includes a large array of target values for various control
parameters for each of about 60 phonetic segment types. Smoothing between tar-
get values depends on time constants computed by rule, as well as depending on

108