from-text-to-speech-the-mit.../pages-txt/039.txt

Morphological analysis

The use of morphs in MITalk is unique, and it is responsible for much of the
quality of the phonetic segment label sequences that are used for synthesis. There
is no doubt that they introduce several levels of complication. These include the
necessity of producing a morph lexicon and the need for a morph segmentation al-
gorithm. The concatenation of morphs to form a word often gives rise to spelling
mutations that cause segmentation difficulties, and several “morph coverings” of a
word are often found leading to a need for selection criteria. Nevertheless, the
gains far outweigh the negative costs, and in the following sections, we elaborate
on these robust and effective techniques.

3.2 Input
In MITalk, morphemic analysis is provided in the DECOMP module. DECOMP’s

input data stream has the same structure as the output stream from FORMAT
which precedes DECOMP in the MITalk system. Each record in the data stream

contains either a word or a punctuation mark. Words consist of uppercase letters,
apostrophes, and/or hyphens. Legal punctuation marks are period, exclamation
point, question mark, comma, semicolon, colon, double quotation, single quota-
tion, left and right parentheses, and dash. DECOMP also accesses a compiled bi-
nary format morph lexicon.

3.3 Output
The output data stream consists of a sequence of decomposed word records. The

following information is contained in each record:
1. Word spelling |
2. Word part of speech (possibly more than one)
3. For each part of speech, an optional list of part-of-speech features
4., The series of morphs obtained by decomposition
5. For each morph, the following information:
a. Morph spelling
b. Morph type
c. One or two homographs

d. For each homograph, a pronunciation and part(s) of speech
If no decomposition was found for the word, then the morph list is omitted
and the word is assigned a default set of possible parts of speech. Punctuation
marks receive a special part-of-speech code (either EndPunctuationMark (EPM)
for sentence-ending punctuation or InternalPunctuationMark (IPM) for all others).
Part-of-speech processing will be described in detail in the next chapter where the

phrase parser is discussed.

27