Speech Recognition

Speech Recognition

Architecture of an automatic speech recognition system

Today, state-of-the-art systems for automatic speech recognition are based on the statistical approach of Bayes decision rule. The implementation of Bayes decision rule for automatic speech recognition is based on two kinds of stochastic models: the acoustic model and the language model which together are the basis for the decision process itself, i.e. the search for the most probable sentence. These modules of an automatic speech recognition system are characterized as follows:

The acoustic model captures the acoustic properties of speech and provides the probability of the observed acoustic signal given a hypothesized word sequence. The acoustic model includes:
1. The acoustic analysis which parameterizes the speech input into a sequence of acoustic vectors.
2. Acoustic models for the smallest sub-word units, i.e. phonemes which usually are modelled context dependent.
3. The pronunciation lexicon, which defines the decomposition of the words into the subword units.
Topology and search space for a Hidden Markov Model (HMM) for the word "sieben"

Speech waveform of the utterance "Sollen wir am Sonntag nach Berlin fahren", and the corresponding FFT spectrum
The language model captures the linguistic properties of the language and provides the a-priori probability of a word sequence. From an information theoretic point of view, syntax, semantics, and pragmatics of the language could also be viewed as redundancies. Because of the stochastic nature of such redundancies, language models usually are based on statistical concepts.
Search realizes Bayes decision criterion on the basis of the acoustic model and the language model. This requires the generation and scoring of competing sentence hypotheses. To obtain the final recognition result, the main objective then is to search for that sentence hypothesis with the best score using dynamic programming. The efficiency of the search process is increased by pruning unlikely hypotheses as early as possible during dynamic programming without affecting the recognition performance.

Signal Analysis

Acoustic Modelling

Search