Improved feature extraction
Sirko Molau, Ralf Schlüter, Michael Pitz
The objective of signal analysis is to produce a parameterization of the speech
signal suitable for automatic speech recognition. Although, ideally the speech
waveform should be modelled directly, today's modelling techniques are not
suitable to process the raw speech waveform optimally. Signal analysis aims
at separating information relevant for the recognition task from irrelevant
information (e.g. speaker or channel characteristics) and at reducing the
amount of data that is presented to the speech recognizer.
- Alternative compression functions
In signal analysis, a high resolution Fourier spectrum of the speech signal
is computed. The spectra pass a triangular filterbank and the dynamic range
of each filterbank coefficient is reduced by taking the logarithm.
It was reported that especially under poor acoustic conditions other functions
than the logarithm (e.g. power functions) result in better performance of the
speech recognizer.
In our tests on German microphone and telephone data only small gains were
obtained using a power function (e.g. from 26.1% to 25.9% on the German
spontaneous speech corpus VERBMOBIL II).
- FFT Cepstrum
The frequency axis of the power spectrum is typically transformed according to
the Mel frequency function by redistributing the FFT coefficients (thereby
simulating the varying frequency resolution of the human ear). Subsequently,
the spectra are smoothed with a bank of overlapping triangular frequency
filters, log-compressed, and cosine transformed in order to suppress irrelevant
fluctuations and decorrelate the filter channels.
In this work, we investigated a new method that omits the filterbank and
integrates the transformation of the frequency axis into the cosine transform.
This simplifies the signal analysis front-end and achieves the same recognition
performance as the baseline approach. In fact, the word error rate could even
be reduced (4% relative on VERBMOBIL II) as the new method gives a
better control over the amount of spectral smoothing.
- Phase Features
New acoustic features for speech recognition based on the short term Fourier
phase spectrum are introduced for mono (telephone) recordings. The new phase
based features were used together with standard Mel Frequency Cepstral
Coefficients (MFCC), and results were produced with subsequent Linear
Discriminant Analysis (LDA) to choose the most relevant features.
Using the new phase features together with MFCCs, improvements in word error
rate of up to 23% relative to using MFCCs alone with the same overall number
of parameters in the system were obtained for the recognition of telephone
line recorded German digit strings.
- Telephone recognizer
Using data collected within the VERBMOBIL project, we built a recognizer
for conversational German speech over telephone. A number of tests were
conducted with the aim of recognition performance improvement by supplementing
the small telephone training corpus with additional microphone data.
In the end, we achieved a word error rate of 31.1% on the telephone
data which compare to 25.4% on microphone data recorded in parallel.
Acoustic normalization
Sirko Molau
The complexity of automatic speech recognition tasks has increased dramatically
in recent years. The focus has shifted from the transcription of clean read
speech to spontaneous speech, recordings in severe acoustic conditions (over
telephone or in cars), and scenarios with a large mismatch between training
and test conditions. In this context, there is a strong need for acoustic
features that contain the information relevant for speech recognition
and are robust against channel distortions, noise, and similar phenomena. A
method is to normalize the acoustic vectors and thereby remove irrelevant
variations.
- Histogram Normalization
Mismatch between training and test conditions may result in a severe
degradation of the recognition performance. To cope with that problem, we use
histogram normalization, which was tested at different stages
of the signal analysis (filterbank outputs, cepstral coefficients,
LDA-transformed acoustic vectors). During training the software computes
the cumulative histogram over each individual vector component. In testing
the same histogram is computed and the test vectors are transformed such that
they match the distribution of the training data.
Histogram normalization was successfully applied to noisy speech recorded
in a car. It also improved the performance of the German VERBMOBIL
spontaneous speech recognizer.
Noise Robustness
Florian Hilger
In many practical applications speech recognition systems have to work in
adverse environmental conditions.
Frequency distortions and noises caused by the transmission
are typical for telephone applications.
Considerable amounts of varying background noise are a problem for all
mobile applications such as cellular phones or speech controlled systems in
cars.
The recognition error rates of speech recognition systems using standard
methods
usually rise considerably in these conditions. The noise robustness can
be increased by suppressing the contribution of the noise during acoustic
feature
extraction and/or adapting the acoustic models to the current noise condition.
- Noise Level Normalization and Reference Adaptation
A method to normalize the noise level of speech signals
at the outputs of the Mel scaled filter-bank used in MFCC-feature extraction
was investigated.
An adaptive normalizing function that distinguishes between speech and
silence
parts of the signal was used to normalize the noise level, without altering
the
speech parts of the signal. This technique was combined with an adaptation
of the reference vectors, depending on the average norm of the incoming
feature
vectors.
On a database with training data recorded in office environment and testing
data
recorded in cars, the word error rate could be reduced from 35.5% to 14.7%
for the city traffic testing set and from 78.0% to 24.1% for the highway
testing set.
- Quantile Based Histogram Equalization
This method increases the noise robustness
by transforming the signal after Mel scaled filtering
to make the cumulative density functions of the signal's
values in recognition match those estimated on the training data.
The cumulative density functions are approximated using a small number of
quantiles.
Recognition tests on several databases showed significant reductions of the
word
error rates. On the database mentioned above this
method lead to error rates of 13.6% for the city data and 21.8% for
the highway data.
- Florian Hilger and Hermann Ney.
"Quantile Based Histogram Equalization for Noise Robust Speech Recognition".
Proceedings of the 7th European Conference on Speech
Communication and Technology,
Aalborg, Denmark, September 2001.
- F. Hilger and H. Ney:
"Noise Level Normalization and Reference Adaptation for Robust Speech Recognition"
.
In ASR2000 - International Workshop on Automatic Speech Recognition: Challenges for the New Millennium,
pp. 64-68. Paris, France, September 2000.
Last modified November 2, 2001
|
|