You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

41 lines
1.2 KiB

2
Text prepfocessing
2.1 Overview
Unrestricted text may contain a wide variety of symbols, abbreviations, and con-
ventions. In order to convert text to speech, it is necessary to find an appropriate
expression in words for such symbols as “3”, “%”, and “&”, for abbreviations such
as “Mr.”, “num.”, “Nov.”, “M.I.T.”, and conventions such as indentation for
paragraphs. This text processing must be done before any further analysis to
prevent an abbreviation from being treated as a word followed by an “end-of-
sentence” marker, and to allow symbols with word equivalents to be replaced by
strings analyzable by the lexical analysis modules.
FORMAT is the first module of the MITalk system and performs the conver-
sion of unrestricted text to a sequence of words and punctuation recognizable by
the later modules. The following list contains a number of topics and symbol types
which need to be considered.
1. Blank space(s)
2. Paragraphs
3. Sentence-initial capitals
4. Other capitals
5. Abbreviations
6. Numbers, including:
a. Integers
b. Numbers with a decimal point
c. Dates
d. Time
7. Alphanumerics
8. Formulas
9. Punctuation, including;:
a. Period
b. Comma
c. Question mark
d. Exclamation point
16