|
|
Morphological analysis
|
|
|
|
|
|
The use of morphs in MITalk is unique, and it is responsible for much of the
|
|
|
quality of the phonetic segment label sequences that are used for synthesis. There
|
|
|
is no doubt that they introduce several levels of complication. These include the
|
|
|
necessity of producing a morph lexicon and the need for a morph segmentation al-
|
|
|
gorithm. The concatenation of morphs to form a word often gives rise to spelling
|
|
|
mutations that cause segmentation difficulties, and several “morph coverings” of a
|
|
|
word are often found leading to a need for selection criteria. Nevertheless, the
|
|
|
gains far outweigh the negative costs, and in the following sections, we elaborate
|
|
|
on these robust and effective techniques.
|
|
|
|
|
|
3.2 Input
|
|
|
In MITalk, morphemic analysis is provided in the DECOMP module. DECOMP’s
|
|
|
|
|
|
input data stream has the same structure as the output stream from FORMAT
|
|
|
which precedes DECOMP in the MITalk system. Each record in the data stream
|
|
|
|
|
|
contains either a word or a punctuation mark. Words consist of uppercase letters,
|
|
|
apostrophes, and/or hyphens. Legal punctuation marks are period, exclamation
|
|
|
point, question mark, comma, semicolon, colon, double quotation, single quota-
|
|
|
tion, left and right parentheses, and dash. DECOMP also accesses a compiled bi-
|
|
|
nary format morph lexicon.
|
|
|
|
|
|
3.3 Output
|
|
|
The output data stream consists of a sequence of decomposed word records. The
|
|
|
|
|
|
following information is contained in each record:
|
|
|
1. Word spelling |
|
|
|
2. Word part of speech (possibly more than one)
|
|
|
3. For each part of speech, an optional list of part-of-speech features
|
|
|
4., The series of morphs obtained by decomposition
|
|
|
5. For each morph, the following information:
|
|
|
a. Morph spelling
|
|
|
b. Morph type
|
|
|
c. One or two homographs
|
|
|
|
|
|
d. For each homograph, a pronunciation and part(s) of speech
|
|
|
If no decomposition was found for the word, then the morph list is omitted
|
|
|
and the word is assigned a default set of possible parts of speech. Punctuation
|
|
|
marks receive a special part-of-speech code (either EndPunctuationMark (EPM)
|
|
|
for sentence-ending punctuation or InternalPunctuationMark (IPM) for all others).
|
|
|
Part-of-speech processing will be described in detail in the next chapter where the
|
|
|
|
|
|
phrase parser is discussed.
|
|
|
|
|
|
27
|