from-text-to-speech-the-mit.../pages-txt/114.txt

From text to speech: The MITalk system

were given the label “function” are elevated to “content” importance in the FO al-
gorithm. These are:

¢ Demonstrative pronouns (this, those)
¢ Contractions (we’ll, boys’ll)

¢ Modals (should, might, will, can)

¢ Quantifiers (several, many)

¢ Interrogative adjectives (which, whose)

The FO algorithm requires a specification of the number of syllables in each
word, the location of the stressed syllable within the word, and information con-
cerning syllable boundaries. This information is found in the PROSOD output file.
The phonemic information in this file is also used to specify a structure for each
syllable. This structure is an allowable ordering of voiced or unvoiced obstruents,
sonorants, and a single vowel.

10.3 Output

There are two possible output files. One file is a stream of fundamental frequency
values, one value for each 5 msec of the utterance. This file can be merged with
the output of PHONET (discussed in Chapter 11) which gives values of the 20
variable parameters each 5 msec. These values are calculated by determining the
changes in FO during a syllable and using the duration of the segments within the
syllable to describe a contour with constant slope (absolute value).

A second method, the one currently in use, is to calculate rises and falls on
each segment (an intermediate stage in the former method) and to use this infor-
mation to specify FO target values for the midpoint of each segment and for the
peak point at either the left or right boundary of stressed vowels in content words.
Unspecified onset values for segments are determined by linear interpolation be-
tween their midpoint target value and the midpoint target value of the preceding
segment. This method allows FO values to be calculated every 5 msec using the
same linear smoothing procedure which is used for some of the other parameters,
modified slightly by the addition of the possible extra target value as input.

Most peaks are assigned to the right boundary of the stressed vowel in a con-
tent word. A fall (and possible continuation rise) following the rise which forms
the peak is then assigned to the midpoint or right boundary of the following seg-
ment, absorbing any fall or rise that might previously have been assigned to that
segment. A peak is assigned to the left boundary of a “nuclear-stressed” syllable,
i.e., the stressed syllable in the final content word of a phrase preceding a silence.
Preceding unassigned rises or falls are absorbed in the assignment of the peak.

102