next up previous
Next: Connection between the two Up: Comparison of Log-Linear Models Previous: Maximum Entropy, Gaussian and

Class-Dependent Weighted Dissimilarity Measures

In [9], a class-dependent weighted dissimilarity measure for nearest neighbor classifiers was introduced. The squared distance is defined as

$\displaystyle d^2(x,\mu) = {\, \sum_d \left( \frac{x_d -\mu_d}{\sigma_{k_\mu d}}\right)^2} \quad , \,\, \Lambda=\{\sigma_{kd},\mu_d\},$    

where $ d$ denotes the dimension index and $ k_\mu$ is the class the reference vector $ \mu$ belongs to. The parameters $ \Lambda$ are estimated with respect to a discriminative training criterion that takes into account the out-of-class information and can be derived from the minimum classification error criterion:

$\displaystyle \hat{\Lambda} = \mathop{\mbox{argmin}}_{\Lambda} \sum_{n} \frac{~...
...mu)~~}{~~\min\limits_{\mu: k_\mu \neq k_n} d_\Lambda(x_n,\mu)\rule{0pt}{1em}~~}$ (4)

In other words, the parameters are chosen to minimize the average ratio of the distance to the closest prototype of the same class with respect to the distance to the closest prototype of the competing classes.

To minimize the criterion, a gradient descent approach is used and a leaving one out estimation with the weighted measure is computed at each step of the gradient procedure. The weights selected by the algorithm are those weights with the best leaving one out estimation instead of the weights with the minimum criterion value. In the experiments, only the weights $ \{\sigma_{kd}\}$ were estimated according to the proposed criterion. The references $ \{\mu_k\}$ were chosen as the means for the one-prototype approach and in the multiple-prototype approach the whole training set was used.

Also in this approach, we have a strong relation to Gaussian models. Consider the use of one prototype per class. The distance measure then is a class-dependent Mahalanobis distance with class-specific, diagonal covariance matrices

$\displaystyle \Sigma_k =$   diag$\displaystyle (\sigma^2_{k1},\ldots,\sigma^2_{kD}).$    

The decision rule is then equivalent to the use of single Gaussian models in combination with an additional factor to compensate for the missing normalization factor of the Gaussian. In the case of multiple prototypes per class, the equivalence is extensible to mixtures of Gaussian densities.


next up previous
Next: Connection between the two Up: Comparison of Log-Linear Models Previous: Maximum Entropy, Gaussian and
Daniel Keysers 2004-03-10