You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

48 lines
2.2 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

Morphological analysis
The use of morphs in MITalk is unique, and it is responsible for much of the
quality of the phonetic segment label sequences that are used for synthesis. There
is no doubt that they introduce several levels of complication. These include the
necessity of producing a morph lexicon and the need for a morph segmentation al-
gorithm. The concatenation of morphs to form a word often gives rise to spelling
mutations that cause segmentation difficulties, and several “morph coverings” of a
word are often found leading to a need for selection criteria. Nevertheless, the
gains far outweigh the negative costs, and in the following sections, we elaborate
on these robust and effective techniques.
3.2 Input
In MITalk, morphemic analysis is provided in the DECOMP module. DECOMPs
input data stream has the same structure as the output stream from FORMAT
which precedes DECOMP in the MITalk system. Each record in the data stream
contains either a word or a punctuation mark. Words consist of uppercase letters,
apostrophes, and/or hyphens. Legal punctuation marks are period, exclamation
point, question mark, comma, semicolon, colon, double quotation, single quota-
tion, left and right parentheses, and dash. DECOMP also accesses a compiled bi-
nary format morph lexicon.
3.3 Output
The output data stream consists of a sequence of decomposed word records. The
following information is contained in each record:
1. Word spelling |
2. Word part of speech (possibly more than one)
3. For each part of speech, an optional list of part-of-speech features
4., The series of morphs obtained by decomposition
5. For each morph, the following information:
a. Morph spelling
b. Morph type
c. One or two homographs
d. For each homograph, a pronunciation and part(s) of speech
If no decomposition was found for the word, then the morph list is omitted
and the word is assigned a default set of possible parts of speech. Punctuation
marks receive a special part-of-speech code (either EndPunctuationMark (EPM)
for sentence-ending punctuation or InternalPunctuationMark (IPM) for all others).
Part-of-speech processing will be described in detail in the next chapter where the
phrase parser is discussed.
27