|
|
From text to speech: The MITalk system
|
|
|
|
|
|
Harvard and Haskins sentences, performance improved on the second half of the
|
|
|
test relative to the first half. Although the differences were small, amounting to
|
|
|
only about 2 percent improvement in each case, the result was very reliable (p <
|
|
|
.01) across subjects in both cases.
|
|
|
|
|
|
The performance levels obtained with the Haskins semantically anomalous
|
|
|
sentences are very similar to those reported earlier by Nye and Gaitenby (1974),
|
|
|
and more recently by Ingeman (1978) using the same sentences with the Haskins
|
|
|
synthesizer and text-to-speech system. Nye and Gaitenby (1974) reported an
|
|
|
average error rate of 22 percent for synthetic speech and five percent for com-
|
|
|
parable natural speech. However, Nye and Gaitenby used both naive and ex-
|
|
|
perienced listeners as subjects, and found rather large differences in performance
|
|
|
between the two groups, as we noted above. This result is presumably due to
|
|
|
familiarity and practice listening to the output of the synthesizer. We suspect that
|
|
|
if the experienced subjects were eliminated from the Nye and Gaitenby analyses,
|
|
|
performance would be lower than the original value reported and would therefore
|
|
|
differ somewhat more from the present findings. Nevertheless, the error rate for
|
|
|
these anomalous sentences produced with natural speech is still lower than the cor-
|
|
|
responding synthetic versions, although it is not clear at the present time how
|
|
|
much of the difference could be due to listener familiarity with the quality of the
|
|
|
synthetic speech.
|
|
|
|
|
|
13.3.3 Conclusions
|
|
|
|
|
|
The results of the two word-recognition tests indicate moderate to excellent levels
|
|
|
of performance with naive listeners depending on the particular test format used
|
|
|
and the type of information available to the subject. In one sense, the results of
|
|
|
these two tests can be thought of as approximations to upper and lower bounds on
|
|
|
the accuracy of word-recognition performance with the current text-to-speech sys-
|
|
|
tem. On the one hand, the Harvard test sentences provide some indication of how
|
|
|
word recognition might proceed when both semantic and syntactic information is
|
|
|
available to a listener under normal conditions. On the other hand, the Haskins
|
|
|
anomalous sentences direct the subjects’ attention specifically to the perceptual in-
|
|
|
put and therefore provide a rough estimate of the quality of the acoustic-phonetic
|
|
|
information and sentence analysis routines available for word recognition in the
|
|
|
absence of contextual constraints. Of course, in normal listening situations, and
|
|
|
presumably in cases where a text-to-speech system such as the present one might
|
|
|
be implemented, the complete neutralization of such contextual effects on intel-
|
|
|
ligibility would be extremely unlikely. Nevertheless, a more detailed analysis of
|
|
|
the word-recognition errors in the Haskins anomalous sentence test might provide
|
|
|
|
|
|
160
|