Next: Maximum Entropy, Gaussian and
Up: Comparison of Log-Linear Models
Previous: Introduction
To classify an observation
, we use the Bayesian
decision rule
Here,
is the class posterior probability of class
given the observation
,
is the a priori
probability,
is the class conditional probability for the
observation
given class
and
is the decision of the
classifier. This decision rule is known to be optimal with respect to
the number of decision errors, if the correct distributions are
known. This is generally not the case in practical situations, which
means that we need to choose appropriate models for the
distributions.
If we denote by
the set of free parameters of the
distribution, the maximum likelihood approach consists in choosing the
parameters
maximizing the log-likelihood on the
training data:
 |
|
|
(1) |
Alternatively, we can maximize the log-likelihood of the class
posteriors,
 |
|
|
(2) |
which is also called discriminative training, since the information of
out-of-class data is used. This criterion is often referred to as
mutual information criterion in speech recognition, information theory
and image object recognition [2,8].
Discriminative training was used in [9] to learn
the weights of a weighted dissimilarity measure. This weighted measure
was used in the nearest neighbor classification rule improving
significantly the accuracy of the classifier in comparison to other
distance measures, for which the parameters were not estimated using
discriminative training.
Next: Maximum Entropy, Gaussian and
Up: Comparison of Log-Linear Models
Previous: Introduction
Daniel Keysers
2004-03-10