Signal Analysis

Signal Analysis

Improved feature extraction

The objective of signal analysis is to produce a parameterization of the speech signal suitable for automatic speech recognition. Although, ideally the speech waveform should be modelled directly, today's modelling techniques are not suitable to process the raw speech waveform optimally. Signal analysis aims at separating information relevant for the recognition task from irrelevant information (e.g. speaker or channel characteristics) and at reducing the amount of data that is presented to the speech recognizer.

Alternative compression functions
In signal analysis, a high resolution Fourier spectrum of the speech signal is computed. The spectra pass a triangular filterbank and the dynamic range of each filterbank coefficient is reduced by taking the logarithm. It was reported that especially under poor acoustic conditions other functions than the logarithm (e.g. power functions) result in better performance of the speech recognizer.
In our tests on German microphone and telephone data only small gains were obtained using a power function (e.g. from 26.1% to 25.9% on the German spontaneous speech corpus VERBMOBIL II).
FFT Cepstrum
The frequency axis of the power spectrum is typically transformed according to the Mel frequency function by redistributing the FFT coefficients (thereby simulating the varying frequency resolution of the human ear). Subsequently, the spectra are smoothed with a bank of overlapping triangular frequency filters, log-compressed, and cosine transformed in order to suppress irrelevant fluctuations and decorrelate the filter channels.
In this work, we investigated a new method that omits the filterbank and integrates the transformation of the frequency axis into the cosine transform. This simplifies the signal analysis front-end and achieves the same recognition performance as the baseline approach. In fact, the word error rate could even be reduced (4% relative on VERBMOBIL II) as the new method gives a better control over the amount of spectral smoothing.
Phase Features
New acoustic features for speech recognition based on the short term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were used together with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with subsequent Linear Discriminant Analysis (LDA) to choose the most relevant features. Using the new phase features together with MFCCs, improvements in word error rate of up to 23% relative to using MFCCs alone with the same overall number of parameters in the system were obtained for the recognition of telephone line recorded German digit strings.
Telephone recognizer
Using data collected within the VERBMOBIL project, we built a recognizer for conversational German speech over telephone. A number of tests were conducted with the aim of recognition performance improvement by supplementing the small telephone training corpus with additional microphone data. In the end, we achieved a word error rate of 31.1% on the telephone data which compare to 25.4% on microphone data recorded in parallel.

Acoustic normalization
The complexity of automatic speech recognition tasks has increased dramatically in recent years. The focus has shifted from the transcription of clean read speech to spontaneous speech, recordings in severe acoustic conditions (over telephone or in cars), and scenarios with a large mismatch between training and test conditions. In this context, there is a strong need for acoustic features that contain the information relevant for speech recognition and are robust against channel distortions, noise, and similar phenomena. A method is to normalize the acoustic vectors and thereby remove irrelevant variations.

Histogram Normalization
Mismatch between training and test conditions may result in a severe degradation of the recognition performance. To cope with that problem, we use histogram normalization, which was tested at different stages of the signal analysis (filterbank outputs, cepstral coefficients, LDA-transformed acoustic vectors). During training the software computes the cumulative histogram over each individual vector component. In testing the same histogram is computed and the test vectors are transformed such that they match the distribution of the training data. Histogram normalization was successfully applied to noisy speech recorded in a car. It also improved the performance of the German VERBMOBIL spontaneous speech recognizer.

Noise Robustness
In many practical applications speech recognition systems have to work in adverse environmental conditions. Frequency distortions and noises caused by the transmission are typical for telephone applications. Considerable amounts of varying background noise are a problem for all mobile applications such as cellular phones or speech controlled systems in cars. The recognition error rates of speech recognition systems using standard methods usually rise considerably in these conditions. The noise robustness can be increased by suppressing the contribution of the noise during acoustic feature extraction and/or adapting the acoustic models to the current noise condition.

Noise Level Normalization and Reference Adaptation
A method to normalize the noise level of speech signals at the outputs of the Mel scaled filter-bank used in MFCC-feature extraction was investigated. An adaptive normalizing function that distinguishes between speech and silence parts of the signal was used to normalize the noise level, without altering the speech parts of the signal. This technique was combined with an adaptation of the reference vectors, depending on the average norm of the incoming feature vectors. On a database with training data recorded in office environment and testing data recorded in cars, the word error rate could be reduced from 35.5% to 14.7% for the city traffic testing set and from 78.0% to 24.1% for the highway testing set.
Quantile Based Histogram Equalization
This method increases the noise robustness by transforming the signal after Mel scaled filtering to make the cumulative density functions of the signal's values in recognition match those estimated on the training data. The cumulative density functions are approximated using a small number of quantiles. Recognition tests on several databases showed significant reductions of the word error rates. On the database mentioned above this method lead to error rates of 13.6% for the city data and 21.8% for the highway data.

Florian Hilger and Hermann Ney. "Quantile Based Histogram Equalization for Noise Robust Speech Recognition". Proceedings of the 7th European Conference on Speech Communication and Technology, Aalborg, Denmark, September 2001.
F. Hilger and H. Ney: "Noise Level Normalization and Reference Adaptation for Robust Speech Recognition". In ASR2000 - International Workshop on Automatic Speech Recognition: Challenges for the New Millennium, pp. 64-68. Paris, France, September 2000.