from-text-to-speech-the-mit.../pages-txt/172.txt

From text to speech: The MITalk system

Harvard and Haskins sentences, performance improved on the second half of the
test relative to the first half. Although the differences were small, amounting to
only about 2 percent improvement in each case, the result was very reliable (p <
.01) across subjects in both cases.

The performance levels obtained with the Haskins semantically anomalous
sentences are very similar to those reported earlier by Nye and Gaitenby (1974),
and more recently by Ingeman (1978) using the same sentences with the Haskins
synthesizer and text-to-speech system. Nye and Gaitenby (1974) reported an
average error rate of 22 percent for synthetic speech and five percent for com-
parable natural speech. However, Nye and Gaitenby used both naive and ex-
perienced listeners as subjects, and found rather large differences in performance
between the two groups, as we noted above. This result is presumably due to
familiarity and practice listening to the output of the synthesizer. We suspect that
if the experienced subjects were eliminated from the Nye and Gaitenby analyses,
performance would be lower than the original value reported and would therefore
differ somewhat more from the present findings. Nevertheless, the error rate for
these anomalous sentences produced with natural speech is still lower than the cor-
responding synthetic versions, although it is not clear at the present time how
much of the difference could be due to listener familiarity with the quality of the
synthetic speech.

13.3.3 Conclusions

The results of the two word-recognition tests indicate moderate to excellent levels
of performance with naive listeners depending on the particular test format used
and the type of information available to the subject. In one sense, the results of
these two tests can be thought of as approximations to upper and lower bounds on
the accuracy of word-recognition performance with the current text-to-speech sys-
tem. On the one hand, the Harvard test sentences provide some indication of how
word recognition might proceed when both semantic and syntactic information is
available to a listener under normal conditions. On the other hand, the Haskins
anomalous sentences direct the subjects’ attention specifically to the perceptual in-
put and therefore provide a rough estimate of the quality of the acoustic-phonetic
information and sentence analysis routines available for word recognition in the
absence of contextual constraints. Of course, in normal listening situations, and
presumably in cases where a text-to-speech system such as the present one might
be implemented, the complete neutralization of such contextual effects on intel-
ligibility would be extremely unlikely. Nevertheless, a more detailed analysis of
the word-recognition errors in the Haskins anomalous sentence test might provide

160