Search </A>

Search

Efficient Search Algorithms for Speech Recognition

Today, state-of-the-art systems for automatic speech recognition are based on the statistical approach of Bayes decision rule. The decision about the most probable sentence depends basically on two kinds of stochastic models: the language model which captures the linguistic properties of the language and provides the a-priori probability of a word sequence and the acoustic model which captures the acoustic properties of speech and provides the probability of the observed acoustic signal given a hypothesized word sequence.

The most probable word sequence is determined during a time synchronous search process that is based on dynamic programming.

Across-Word Models
The words of the speech recognizer vocabulary are usually represented by sequences of phonemes. These phonemes are usually modelled dependent on their immediate neighbouring phonemes, their context. Across-word models capture the context dependency of the phonemes not only within the words, which is relatively easy to implement (within-word model search), but also across word boundaries (across-word model search). However, compared to within-word model search, the complexity of the search space and thus the computational effort required by the search increases drastically.
The modification of a conventional within-word model search to enable the application of across-word models results in an increase of the real time factor of the search by more than a factor of four. Therefore, several optimization steps were studied that result in a more efficient organization of the search space and which allow for more efficient pruning. In summary these optimization steps are able to accelerate the search by nearly a factor of three. The across-word model search was tested under different conditions (read speech, spontaneous speech, background noise, English, German) with vocabularies of up to 75,000 words. On the English Hub4 Broadcast News corpus for example the word error rate could be reduced from 21.7% to 19.7% by using across-word model search instead of within-word model search.
Acceleration Methods for Emission Probability Calculation
When using mixtures for modelling the emission probabilities of HMMs, 50% to 90% of the overall computations are spent on log-likelihood calculations. The complexity of these calculations can be reduced by evaluating only the most important mixture components or accelerating the calculations needed for one mixture component.
Most modern processor architectures provide single instruction multiple data (SIMD) instructions to speed up algorithms based on vector or matrix operations. These SIMD instructions are perfect for calculating Gaussian or Laplacian mixture components in a large vocabulary speech recognition system in parallel. Without any loss in recognition performance the whole system's runtime can be decreased by more than a factor of three. Combining this approach with vector space partitioning techniques accelerates the overall system by a factor of more than seven.

Confidence Measures for Large Vocabulary Continuous Speech Recognition

With the rising number of application areas for speech recognition technology which is still far from being perfect, the demand for the ability to spot erroneous words also increases. In this context, confidence measures can be used to label individual words in the output of the speech recognition system as either correct or false. Thus the system and subsequent modules are enabled to spot the position of possible errors in the output automatically. The additional information about the recognition output contained in the confidence measure can be used successfully for different applications. In the framework of speaker adaptation, confidence measures can be used to restrict the adaptation process in a very straight-forward manner to acoustic segments with a high confidence. Thus, erroneous segments which could lead to a degradation of the adaptation process can be omitted.

The approach followed in this project was to use word posterior probabilities as confidence measures since these quantities can directly be interpreted as the probability of a word being correct. Also, word posterior probabilities are a straight-forward outcome of the statistical speech recognition framework which is used in most of the existing speech recognition systems. Word posterior probabilities were previously computed on N-best lists and word graphs which contain different alternative recognition hypotheses, but so far no systematic comparison of the different methods was presented. In this project, word posterior probabilities were studied and evaluated in a unified theoretical and experimental framework. Among the problems which were studied are the definition of a suitable alignment of the words and the scaling of the probability density functions used in the speech recognition system. The word posterior probabilities outperformed several other non-probabilistic confidence measures which are used in other speech recognition systems. The best confidence measure developed in this project was used successfully in the framework of maximum-likelihood-linear-regression to adapt the acoustic model parameters only to those acoustic segments with a high confidence. The word error rate was reduced significantly on a German spontaneous speech data base from 22.6% to 21.5%.

Explicit Word Error Minimization

The standard criterion in statistical speech recognition is based on the minimization of the expected number of misrecognized sentences. Therefore the cost function used in Bayes decision rule is the number of sentence errors. Although this cost function is useful to build high-performance speech recognition systems, there is a conceptual mismatch between the decision rule and the evaluation criterion for the performance of speech recognizers. The ideal way to overcome this mismatch is to use the same cost function for both evaluation and decision making. Using the word error rate as the cost function leads to very high computational complexity during decision-making, since it requires the pairwise alignment of all possible sentence hypotheses.

In this project, a new cost function for speech recognition was developed, the time frame error rate. Experiments showed a significant correlation between the time frame errors and the word errors. Based on this new error rate a criterion was derived which is directly aimed at minimizing the time frame error rate and thus the word error rate instead of the sentence error rate. With the suggested method, the word error rates were reduced significantly on five different testing corpora. The best reduction was achieved on a Dutch spontaneous speech test corpus recorded with a train timetable information system. On this corpus, the word error rate was reduced from 15.8% to 15.0%.