Acoustic Modelling

Acoustic Modelling

Speaker Normalization and Adaptation

The aim of speaker normalization and adaptation is to remove the speaker-dependent variations and thus model relevant details in the speech signal more accurate.

Incremental Vocal Tract Length Normalization (VTLN)
Spectral features vary significantly from speaker to speaker due to vocal tract differences. The idea of VTLN is to warp the frequency axis linearly and thereby transform the speech of different speakers to that of a "generic" speaker. VTLN was successfully applied for off-line recognition on different large vocabulary speech corpora. Last year we developed an incremental VTLN scheme for the VERBMOBIL on-line recognition system. We obtained a reduction in word error rate from 25.1% to 23.5% without an increase in real-time.
Adaptation using confidence measures
The goal of this research activity was to improve unsupervised speaker adaptation, whose performance depends on the recognition accuracy of a first recognition pass. For challenging speech recognition tasks (e.g. conversational speech instead of read, clean speech), this requires special attention because the word error rate of the first recognition pass is already quite high. Thus improvements obtained by unsupervised adaptation are reduced. To counteract this effect, we combined confidence measures (i.e. measures for the probability that a particular word was correctly recognized) with Maximum-Likelihood Linear Regression (MLLR) adaptation. To reduce the effect from recognition errors, only words with high confidence for correct recognition in the first pass were used to adapt the acoustic model for the second recognition pass. This method further improved the gain in recognition accuracy obtained by MLLR. On the VERBMOBIL II corpus we achieved a reduction in word error rate from 24.6% to 21.5

Across-Word Modelling

The words of the recognition systems vocabulary are usually modelled as sequences of phonemes given by the so-called pronunciation lexicon. To capture coarticulation effects present in fluent speech, these phonemes are usually modelled dependent on their immediate neighbouring phonemes, their context. Across-word models capture the context dependency of the phonemes not only within the words, which is relatively easy to implement, but also across word boundaries. For high recognition performance across-word models are important.

When across word models are used there are two possibilities to model the word boundaries:

word boundary with coarticulation:
The acoustic realization of the phonemes at the word boundary and therefore their acoustic model depend on each other.
word boundary without coarticulation:
There is a pause between the adjacent words that is long enough, that the acoustic realizations of the word boundary phonemes do not influence each other.

Experiments have shown that the consideration of both cases is crucial for improved recognition performance.

Since the transcription of the training data contains no information on which kind of word transition occurred, this decision has to be made during the training process. Different methods for making this decision were investigated and compared regarding complexity and word error rate performance.

Compared to the baseline within-word model system the application of across-word models resulted in improvements by 7% - 10% relative in word error rate on several corpora.

Pronunciation Modelling

In large vocabulary speech recognition, words are modelled as sequences of usually phonetically motivated sub-word units, compiled in a pronunciation dictionary. Extending the vocabulary requires phonetic transcriptions for all words to be known. Generating these manually is a time-consuming and error prone task. Moreover spontaneous and conversational speech often deviates considerably from the standard pronunciation. This mismatch impairs recognition performance considerably. The goals of this project are to improve the pronunciation models for common words, and to find baseforms for novel words automatically.

We developed a method for determining the optimal (maximum likelihood) pronunciation of a word from acoustic sample utterances. Since unrestricted phoneme recognition on a single utterance has an unacceptably high error rate, our algorithm operates on multiple samples. Our novel search algorithm performs global optimization with respect to all acoustic evidence. Unlike previous work in this field it is able to generate phoneme graphs. Phoneme graphs are an efficient representation of alternative pronunciations and can be used to estimate stochastic pronunciation models. In preliminary experiments on the VERBMOBIL corpus, phoneme error rates below 5% with respect to the standard pronunciation were achieved with 20 sample utterances per word. This is a 90% improvement with respect to the 50% error rate when using only a single utterance per word.

Discriminative Training

The central objective of discriminative training criteria is to consider both class-internal as well as class-external data to train pattern recognition systems in order to improve class separability and therefore recognition performance. In this project, discriminative training criteria were used to train speech and image object recognition applications.

For the purpose of the evaluation of discriminative training criteria, approaches to unify different criteria and the corresponding optimization methods were developed. Furthermore, certain asymptotic properties of several discriminative criteria were proved, which closely relate discriminative criteria to the true Bayes error rate.

In order to further optimize the efficiency of discriminative training for large vocabulary speech recognition applications, methods were developed to speed up and restrict the search for competing word hypotheses on the training data.

Experiments for large vocabulary discriminative training initially were performed on clean read speech only. In order to investigate the robustness of discriminative training criteria, extensive experiments were performed on spontaneous speech tasks. Unfortunately, the improvements on the German VERBMOBIL task were not very high. In the best case the word error rate was reduced from 17.6% for Maximum-Likelihood training down to 17.1% for Maximum-Mutual-Information training.