Improved feature extraction
The objective of signal analysis is to produce a parameterization of the speech
signal suitable for automatic speech recognition. Although, ideally the speech
waveform should be modelled directly, today's modelling techniques are not
suitable to process the raw speech waveform optimally. Signal analysis aims
at separating information relevant for the recognition task from irrelevant
information (e.g. speaker or channel characteristics) and at reducing the
amount of data that is presented to the speech recognizer.
Alternative compression functions
In signal analysis, a high resolution Fourier spectrum of the speech signal
is computed. The spectra pass a triangular filterbank and the dynamic range
of each filterbank coefficient is reduced by taking the logarithm.
It was reported that especially under poor acoustic conditions other functions
than the logarithm (e.g. power functions) result in better performance of the
In our tests on German microphone and telephone data only
small gains were obtained using a power function (e.g. from
26.1% to 25.9% on the German spontaneous speech corpus VERBMOBIL II).
The frequency axis of the power spectrum is typically transformed
according to the Mel frequency function by redistributing the FFT coefficients (thereby
simulating the varying frequency resolution of the human ear). Subsequently,
the spectra are smoothed with a bank of overlapping triangular frequency
filters, log-compressed, and cosine transformed in order to suppress irrelevant
fluctuations and decorrelate the filter channels.
In this work, we investigated a new method that omits the filterbank and
integrates the transformation of the frequency axis into the cosine transform.
This simplifies the signal analysis front-end and achieves the same recognition
performance as the baseline approach. In fact, the word error rate could even
be reduced (4% relative on VERBMOBIL II) as the new method gives a
better control over the amount of spectral smoothing.
New acoustic features for speech recognition
based on the short term Fourier phase spectrum are introduced for
mono (telephone) recordings. The new phase based features were
used together with standard Mel Frequency Cepstral Coefficients
(MFCC), and results were produced with subsequent Linear
Discriminant Analysis (LDA) to choose the most relevant features.
Using the new phase features together with MFCCs, improvements in word error
rate of up to 23% relative to using MFCCs alone with the same overall number
of parameters in the system were obtained for the recognition of telephone
line recorded German digit strings.
Using data collected within the VERBMOBIL
project, we built a recognizer for conversational German speech
over telephone. A number of tests were conducted with the aim of
recognition performance improvement by supplementing
the small telephone training corpus with additional microphone data.
In the end, we achieved a word error rate of 31.1% on the telephone
data which compare to 25.4% on microphone data recorded in parallel.
The complexity of automatic speech recognition tasks has increased dramatically
in recent years. The focus has shifted from the transcription of clean read
speech to spontaneous speech, recordings in severe acoustic conditions (over
telephone or in cars), and scenarios with a large mismatch between training
and test conditions. In this context, there is a strong need for acoustic
features that contain the information relevant for speech recognition
and are robust against channel distortions, noise, and similar phenomena. A
method is to normalize the acoustic vectors and thereby remove irrelevant
Mismatch between training and test
conditions may result in a severe degradation of the recognition
performance. To cope with that problem, we use histogram
normalization, which was tested at different stages
of the signal analysis (filterbank outputs, cepstral coefficients,
LDA-transformed acoustic vectors). During training the software computes
the cumulative histogram over each individual vector component. In testing
the same histogram is computed and the test vectors are transformed such that
they match the distribution of the training data.
Histogram normalization was successfully applied to noisy speech recorded
in a car. It also improved the performance of the German VERBMOBIL
spontaneous speech recognizer.
In many practical applications speech recognition systems have to work in
adverse environmental conditions. Frequency distortions and noises
caused by the transmission are typical for telephone applications.
Considerable amounts of varying background noise are a problem for all
mobile applications such as cellular phones or speech controlled systems in
cars. The recognition error rates of speech recognition systems using standard
methods usually rise considerably in these conditions. The noise
robustness can be increased by suppressing the contribution of the
noise during acoustic feature extraction and/or adapting the
acoustic models to the current noise condition.
Noise Level Normalization and Reference Adaptation
A method to normalize the noise level of speech signals at the
outputs of the Mel scaled filter-bank used in MFCC-feature
extraction was investigated.
An adaptive normalizing function that distinguishes between speech and
silence parts of the signal was used to normalize the noise level,
without altering the speech parts of the signal. This technique
was combined with an adaptation of the reference vectors,
depending on the average norm of the incoming feature vectors.
On a database with training data recorded in office environment
and testing data recorded in cars, the word error rate could be
reduced from 35.5% to 14.7% for the city traffic testing
set and from 78.0% to 24.1% for the highway testing set.
Quantile Based Histogram Equalization
This method increases the noise robustness by transforming the
signal after Mel scaled filtering to make the cumulative density
functions of the signal's values in recognition match those
estimated on the training data. The cumulative density functions
are approximated using a small number of quantiles. Recognition
tests on several databases showed significant reductions of the
word error rates. On the database mentioned above this method lead
to error rates of 13.6% for the city data and 21.8% for
the highway data.
Florian Hilger and Hermann Ney.
"Quantile Based Histogram Equalization for Noise Robust Speech Recognition".
Proceedings of the 7th European Conference on Speech Communication and Technology,
Aalborg, Denmark, September 2001.
F. Hilger and H. Ney:
"Noise Level Normalization and Reference Adaptation for Robust Speech Recognition".
In ASR2000 - International Workshop on Automatic Speech Recognition: Challenges for the New Millennium,
pp. 64-68. Paris, France, September 2000.