Architecture of an automatic speech recognition system
Today, state-of-the-art systems for automatic speech recognition are
based on the statistical approach of Bayes decision rule. The
implementation of Bayes decision rule for automatic speech
recognition is based on two kinds of stochastic models: the acoustic
model and the language model which together are the basis for the
decision process itself, i.e. the search
for the most probable sentence. These modules of an automatic speech
recognition system are characterized as follows:
- The acoustic model captures the acoustic properties of
speech and provides the probability of the observed acoustic
signal given a hypothesized word sequence. The acoustic model
includes:
- The acoustic analysis
which parameterizes the speech input into a sequence of acoustic
vectors.
- Acoustic models for the
smallest sub-word units, i.e. phonemes which usually are
modelled context dependent.
-
The pronunciation lexicon, which defines the decomposition of
the words into the subword units.
Topology and search space for a Hidden Markov Model (HMM) for the word "sieben"
Speech waveform of the utterance "Sollen wir am Sonntag nach
Berlin fahren", and the corresponding FFT spectrum
-
The language model captures the linguistic
properties of the language and provides the a-priori probability
of a word sequence. From an information theoretic point of view,
syntax, semantics, and pragmatics of the language could also be
viewed as redundancies. Because of the stochastic nature of such
redundancies, language models usually are based on
statistical concepts.
-
Search realizes Bayes decision criterion on the basis
of the acoustic model and the language model. This requires the
generation and scoring of competing sentence hypotheses. To
obtain the final recognition result, the main objective then is
to search for that sentence hypothesis with the best score using
dynamic programming. The efficiency of the search process is
increased by pruning unlikely hypotheses as early as possible
during dynamic programming without affecting the recognition
performance.