You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
41 lines
1.2 KiB
41 lines
1.2 KiB
2
|
|
|
|
Text prepfocessing
|
|
|
|
2.1 Overview
|
|
|
|
Unrestricted text may contain a wide variety of symbols, abbreviations, and con-
|
|
ventions. In order to convert text to speech, it is necessary to find an appropriate
|
|
expression in words for such symbols as “3”, “%”, and “&”, for abbreviations such
|
|
as “Mr.”, “num.”, “Nov.”, “M.I.T.”, and conventions such as indentation for
|
|
paragraphs. This text processing must be done before any further analysis to
|
|
prevent an abbreviation from being treated as a word followed by an “end-of-
|
|
sentence” marker, and to allow symbols with word equivalents to be replaced by
|
|
|
|
strings analyzable by the lexical analysis modules.
|
|
FORMAT is the first module of the MITalk system and performs the conver-
|
|
|
|
sion of unrestricted text to a sequence of words and punctuation recognizable by
|
|
the later modules. The following list contains a number of topics and symbol types
|
|
|
|
which need to be considered.
|
|
1. Blank space(s)
|
|
2. Paragraphs
|
|
3. Sentence-initial capitals
|
|
4. Other capitals
|
|
5. Abbreviations
|
|
6. Numbers, including:
|
|
a. Integers
|
|
b. Numbers with a decimal point
|
|
c. Dates
|
|
d. Time
|
|
7. Alphanumerics
|
|
8. Formulas
|
|
9. Punctuation, including;:
|
|
a. Period
|
|
b. Comma
|
|
c. Question mark
|
|
d. Exclamation point
|
|
|
|
16
|