from-text-to-speech-the-mit.../pages-txt/025.txt

Introduction

95 percent of the input text (consisting of high-frequency, foreign, and polysyllabic
words) can be transcribed to phonetic notation. For rare or new words, plus
misspellings (e.g. “recieve”), letter-to-phonetic segment rules are used.

1.3.1.3 Lexical stress The effects of suffixes, as well as that of compounding, on
lexical stress are computed, permitting the use of both stress marks in the

transcription and changes in vowel color.

1.3.1.4 Phonological recoding Once the initial phonetic transcription is ob-
tained, some recoding is done based on the sentence-level context, including con-
sonant “flapping”, insertion of glottal stops, and selection of alternate pronuncia-
tions of “the”.

1.3.1.5 Parsing To aid the selection of prosodic correlates, a phrase-level pars-
ing is performed. Also, a part-of-speech determination for each word is computed
to provide input for the parser.

1.3.1.6 Semantic analysis Only those semantic effects due to particular lexical
items, such as negatives, are found, but these have important effects on pitch.

1.3.2 Synthesis of speech

1.3.2.1 Timing Prepausal lengthening, pause duration, and polysyllabic shorten-
ing are determined, plus the basic duration of each segment and the effect of
clusters.

1.3.2.2 Fundamental frequency A declination line is found, plus pitch rises on
stressed syllables, continuation rises to signal continued throughout, and a number
of segmental effects. Contours appropriate to questions are also found.

1.3.2.3 Phonetic targets Given the prosodic framework, phonetic target
parameters are determined for each phonetic segment, utilizing a “context
window” five segments wide. There are twenty such parameters that vary with
time.

1.3.2.4 Continuation smoothing The target values are smoothed to yield a full
set of parameters every 5 msec.

1.3.2.5 Parameter conversion The phonetic parameters must be converted to
coefficients that can be used by the digital formant synthesizer.

1.3.2.6 Waveform generation The terminal synthesizer utilizes the coefficients

(updated every 5 msec) to generate the speech waveform. A special purpose
hardware synthesizer is used to perform this task in real-time. Speech samples are
produced at a 10 kHz rate, and then converted to analog form via a D/A converter

and low-pass filter.

13