Efficient Search Algorithms for Speech Recognition

Today, state-of-the-art systems for automatic speech recognition are based on the statistical approach of Bayes decision rule. The decision about the most probable sentence depends basically on two kinds of stochastic models: the language model which captures the linguistic properties of the language and provides the a-priori probability of a word sequence and the acoustic model which captures the acoustic properties of speech and provides the probability of the observed acoustic signal given a hypothesized word sequence.

The most probable word sequence is determined during a time synchronous search process that is based on dynamic programming.

Confidence Measures for Large Vocabulary Continuous Speech Recognition

With the rising number of application areas for speech recognition technology which is still far from being perfect, the demand for the ability to spot erroneous words also increases. In this context, confidence measures can be used to label individual words in the output of the speech recognition system as either correct or false. Thus the system and subsequent modules are enabled to spot the position of possible errors in the output automatically. The additional information about the recognition output contained in the confidence measure can be used successfully for different applications. In the framework of speaker adaptation, confidence measures can be used to restrict the adaptation process in a very straight-forward manner to acoustic segments with a high confidence. Thus, erroneous segments which could lead to a degradation of the adaptation process can be omitted.

The approach followed in this project was to use word posterior probabilities as confidence measures since these quantities can directly be interpreted as the probability of a word being correct. Also, word posterior probabilities are a straight-forward outcome of the statistical speech recognition framework which is used in most of the existing speech recognition systems. Word posterior probabilities were previously computed on N-best lists and word graphs which contain different alternative recognition hypotheses, but so far no systematic comparison of the different methods was presented. In this project, word posterior probabilities were studied and evaluated in a unified theoretical and experimental framework. Among the problems which were studied are the definition of a suitable alignment of the words and the scaling of the probability density functions used in the speech recognition system. The word posterior probabilities outperformed several other non-probabilistic confidence measures which are used in other speech recognition systems. The best confidence measure developed in this project was used successfully in the framework of maximum-likelihood-linear-regression to adapt the acoustic model parameters only to those acoustic segments with a high confidence. The word error rate was reduced significantly on a German spontaneous speech data base from 22.6% to 21.5%.

Explicit Word Error Minimization

The standard criterion in statistical speech recognition is based on the minimization of the expected number of misrecognized sentences. Therefore the cost function used in Bayes decision rule is the number of sentence errors. Although this cost function is useful to build high-performance speech recognition systems, there is a conceptual mismatch between the decision rule and the evaluation criterion for the performance of speech recognizers. The ideal way to overcome this mismatch is to use the same cost function for both evaluation and decision making. Using the word error rate as the cost function leads to very high computational complexity during decision-making, since it requires the pairwise alignment of all possible sentence hypotheses.

In this project, a new cost function for speech recognition was developed, the time frame error rate. Experiments showed a significant correlation between the time frame errors and the word errors. Based on this new error rate a criterion was derived which is directly aimed at minimizing the time frame error rate and thus the word error rate instead of the sentence error rate. With the suggested method, the word error rates were reduced significantly on five different testing corpora. The best reduction was achieved on a Dutch spontaneous speech test corpus recorded with a train timetable information system. On this corpus, the word error rate was reduced from 15.8% to 15.0%.