Speaker Normalization and Adaptation
The aim of speaker normalization and adaptation is to remove the speaker-dependent variations and thus model relevant details in the speech signal more accurate.
The words of the recognition systems vocabulary are usually modelled as sequences of phonemes given by the so-called pronunciation lexicon. To capture coarticulation effects present in fluent speech, these phonemes are usually modelled dependent on their immediate neighbouring phonemes, their context. Across-word models capture the context dependency of the phonemes not only within the words, which is relatively easy to implement, but also across word boundaries. For high recognition performance across-word models are important.
When across word models are used there are two possibilities to model the word boundaries:
Since the transcription of the training data contains no information on which kind of word transition occurred, this decision has to be made during the training process. Different methods for making this decision were investigated and compared regarding complexity and word error rate performance.
Compared to the baseline within-word model system the application of across-word models resulted in improvements by 7% - 10% relative in word error rate on several corpora.
In large vocabulary speech recognition, words are modelled as sequences of usually phonetically motivated sub-word units, compiled in a pronunciation dictionary. Extending the vocabulary requires phonetic transcriptions for all words to be known. Generating these manually is a time-consuming and error prone task. Moreover spontaneous and conversational speech often deviates considerably from the standard pronunciation. This mismatch impairs recognition performance considerably. The goals of this project are to improve the pronunciation models for common words, and to find baseforms for novel words automatically.
We developed a method for determining the optimal (maximum likelihood) pronunciation of a word from acoustic sample utterances. Since unrestricted phoneme recognition on a single utterance has an unacceptably high error rate, our algorithm operates on multiple samples. Our novel search algorithm performs global optimization with respect to all acoustic evidence. Unlike previous work in this field it is able to generate phoneme graphs. Phoneme graphs are an efficient representation of alternative pronunciations and can be used to estimate stochastic pronunciation models. In preliminary experiments on the VERBMOBIL corpus, phoneme error rates below 5% with respect to the standard pronunciation were achieved with 20 sample utterances per word. This is a 90% improvement with respect to the 50% error rate when using only a single utterance per word.
The central objective of discriminative training criteria is to consider both class-internal as well as class-external data to train pattern recognition systems in order to improve class separability and therefore recognition performance. In this project, discriminative training criteria were used to train speech and image object recognition applications.
For the purpose of the evaluation of discriminative training criteria, approaches to unify different criteria and the corresponding optimization methods were developed. Furthermore, certain asymptotic properties of several discriminative criteria were proved, which closely relate discriminative criteria to the true Bayes error rate.
In order to further optimize the efficiency of discriminative training for large vocabulary speech recognition applications, methods were developed to speed up and restrict the search for competing word hypotheses on the training data.
Experiments for large vocabulary discriminative training initially were performed on clean read speech only. In order to investigate the robustness of discriminative training criteria, extensive experiments were performed on spontaneous speech tasks. Unfortunately, the improvements on the German VERBMOBIL task were not very high. In the best case the word error rate was reduced from 17.6% for Maximum-Likelihood training down to 17.1% for Maximum-Mutual-Information training.