from-text-to-speech-the-mit.../pages-txt/112.txt

10

The fundamental frequency generator

10.1 Overview

An important component in the generation of natural-sounding speech is the fun-
damental frequency of the voicing source. Such attributes as syntactic structure,
emphasis, and sentence type can be partially signaled by the fundamental fre-
quency (FO) contour as well as by duration and amplitude information. In the FO
algorithm used with the text-to-speech system, information from both syntactic and
phonologic components is used. It utilizes the phrase structure of each sentence as
analyzed by the parser to determine declination lines, to calculate the amount of
excursion from the declination line through each phrase, and to insert continuation
rises. Lexical stress marks and syllable division are used to determine the location
of FO peaks, and parts of speech provide information needed to determine the rela-
tive height of the peaks. Phonemic data provide the information needed to deter-
mine segmental influences on fundamental frequency. These influences produce
an active variation in peaks and valleys, thus yielding a lively contour
(O’Shaughnessy, 1976). |

The algorithm currently in use produces two FO “target values” for each
phonetic segment, one to be used at onset and one as a mid-value. This is an adap-
tation of the original O’Shaughnessy algorithm which produces a value every 5
msec. The production of target values allows a more uniform treatment of
parameters, since interpolation for FO hereafter may be handled in the same way as
for most of the other parameters. It is also possible to take advantage of a lower
data rate since one or two values per segment replace the previous necessity for
one value every 5 msec. The rises and falls which are calculated for each segment
are used to specify the target values, the peak point at either the left or right bound-
ary of stressed vowels in content words, and the midpoint target value for other
segments. Other midpoint values are determined by interpolation.

The fundamental frequency generation program accepts syntactic information
from PARSER (discussed in Chapter 4) and phonemic information from PROSOD
(discussed in Chapter 9) in the form of a PROSOD output file. Its output is an
augmented PROSOD file containing the two target values for each segment.

100