from-text-to-speech-the-mit.../pages-txt/168.txt

From text to speech: The MITalk system

differs slightly from the results found for the synthetic speech in the earlier Has-
kins evaluation. In the Haskins study, error rates for the synthetic speech in initial
and final positions were about the same with a very slight advantage for con-
sonants in final position. The comparable overall error rates obtained for natural
speech in the Modified Rhyme Test by House et al. and Nye and Gaitenby (1973)
were 4 percent and 2.7 percent, respectively.

In the earlier evaluation study, Nye and Gaitenby (1974) checked to ensure
that the phonemic input to the Haskins synthesizer was correct. However, no cor-
rections of any kind were made by hand in generating the present materials, either
from entries in the morph lexicon or from spelling-to-sound rules. As discussed in
the final section of this chapter, several different kinds of errors were uncovered in
different modules as a result of generating such a large amount of synthetic speech
through the system.

Except for the high error rates observed for the nasals and fricatives in final
syllable position, the synthesis of segmental information in the text-to-speech sys-
tem appears to be excellent, at least as measured in a forced-choice format among
minimal pairs of test items. With phoneme recognition performance as high as it
is--nearly close to ceiling levels--it is difficult to pick up subtle details of the error
patterns that might be useful in improving the quality of the output of the phonetic
component of the system at the present time. In addition, the errors that were ob-
served in the present tests might well be reduced substantially if the listeners had
more experience with the speech output produced by the system. It is well known
among investigators working with synthetic speech that rather substantial improve-
ments in intelligibility can be observed when listeners become familiar with the
quality of the synthesizer. Nye and Gaitenby (1974) as well as Carlson et al.
(1976) have reported very sizeable learning effects in listening to synthetic speech.
In the latter study, performance increased from 55 percent to 90 percent correct
after the presentation of only 200 synthetic sentences over a two-week period. (See
also the discussion of the word recognition and comprehension results below.)

In summary, the results of the Modified Rhyme Test revealed very high levels
of intelligibility of the speech output from the system using naive listeners as sub-
jects. While the overall level of performance is somewhat lower than in previous
studies employing natural speech, the level of performance for recognition of seg-
mental information appears to be quite satisfactory for a wide range of text-to-
speech applications at the present time.

156