next up previous
Next: Maximum Entropy, Gaussian and Up: Comparison of Log-Linear Models Previous: Introduction

Classification Framework

To classify an observation $ x \in {{\rm I\!R}}^D$, we use the Bayesian decision rule

$\displaystyle x \,\, \longmapsto \,\, r(x)$ $\displaystyle =$ $\displaystyle \mathop{\mbox{argmax}}\limits_{k} \left\{
p(k\vert x) \right\} =\mathop{\mbox{argmax}}\limits_{k} \left\{ p(k) \cdot p(x\vert k) \right\}.$  

Here, $ p(k\vert x)$ is the class posterior probability of class $ k\in \{1,\ldots,K\}$ given the observation $ x$, $ p(k)$ is the a priori probability, $ p(x\vert k)$ is the class conditional probability for the observation $ x$ given class $ k$ and $ r(x)$ is the decision of the classifier. This decision rule is known to be optimal with respect to the number of decision errors, if the correct distributions are known. This is generally not the case in practical situations, which means that we need to choose appropriate models for the distributions.

If we denote by $ \Lambda$ the set of free parameters of the distribution, the maximum likelihood approach consists in choosing the parameters $ \hat{\Lambda}$ maximizing the log-likelihood on the training data:

$\displaystyle \hat{\Lambda} = \mathop{\mbox{argmax}}_\Lambda \sum_n \log p_\Lambda(x_n\vert k_n)$     (1)

Alternatively, we can maximize the log-likelihood of the class posteriors,
$\displaystyle \hat{\Lambda} = \mathop{\mbox{argmax}}_\Lambda \sum_n \log p_\Lambda (k_n\vert x_n)\;,$     (2)

which is also called discriminative training, since the information of out-of-class data is used. This criterion is often referred to as mutual information criterion in speech recognition, information theory and image object recognition [2,8].

Discriminative training was used in [9] to learn the weights of a weighted dissimilarity measure. This weighted measure was used in the nearest neighbor classification rule improving significantly the accuracy of the classifier in comparison to other distance measures, for which the parameters were not estimated using discriminative training.


next up previous
Next: Maximum Entropy, Gaussian and Up: Comparison of Log-Linear Models Previous: Introduction
Daniel Keysers 2004-03-10