|
|
10
|
|
|
|
|
|
The fundamental frequency generator
|
|
|
|
|
|
10.1 Overview
|
|
|
|
|
|
An important component in the generation of natural-sounding speech is the fun-
|
|
|
damental frequency of the voicing source. Such attributes as syntactic structure,
|
|
|
emphasis, and sentence type can be partially signaled by the fundamental fre-
|
|
|
quency (FO) contour as well as by duration and amplitude information. In the FO
|
|
|
algorithm used with the text-to-speech system, information from both syntactic and
|
|
|
phonologic components is used. It utilizes the phrase structure of each sentence as
|
|
|
analyzed by the parser to determine declination lines, to calculate the amount of
|
|
|
excursion from the declination line through each phrase, and to insert continuation
|
|
|
rises. Lexical stress marks and syllable division are used to determine the location
|
|
|
of FO peaks, and parts of speech provide information needed to determine the rela-
|
|
|
tive height of the peaks. Phonemic data provide the information needed to deter-
|
|
|
mine segmental influences on fundamental frequency. These influences produce
|
|
|
an active variation in peaks and valleys, thus yielding a lively contour
|
|
|
(O’Shaughnessy, 1976). |
|
|
|
|
|
|
The algorithm currently in use produces two FO “target values” for each
|
|
|
phonetic segment, one to be used at onset and one as a mid-value. This is an adap-
|
|
|
tation of the original O’Shaughnessy algorithm which produces a value every 5
|
|
|
msec. The production of target values allows a more uniform treatment of
|
|
|
parameters, since interpolation for FO hereafter may be handled in the same way as
|
|
|
for most of the other parameters. It is also possible to take advantage of a lower
|
|
|
data rate since one or two values per segment replace the previous necessity for
|
|
|
one value every 5 msec. The rises and falls which are calculated for each segment
|
|
|
are used to specify the target values, the peak point at either the left or right bound-
|
|
|
ary of stressed vowels in content words, and the midpoint target value for other
|
|
|
segments. Other midpoint values are determined by interpolation.
|
|
|
|
|
|
The fundamental frequency generation program accepts syntactic information
|
|
|
from PARSER (discussed in Chapter 4) and phonemic information from PROSOD
|
|
|
(discussed in Chapter 9) in the form of a PROSOD output file. Its output is an
|
|
|
augmented PROSOD file containing the two target values for each segment.
|
|
|
|
|
|
100
|