next up previous
Next: Experimental results Up: Experiments with an Extended Previous: Relating TD and IDM

Algorithms

We incorporated TD into a KD based classifier with Bayesian decision rule

\begin{displaymath}
r(x)=\mathop{\mbox{argmax}}\limits_{c} \left\{p(c\vert x)\ri...
...athop{\mbox{argmax}}\limits_{c} \left\{p(c)p(x\vert c)\right\}
\end{displaymath} (8)

That is we maximize the a posteriori probability for class $c$ given an image $x$ where the class conditional probability $p(x\vert c)$ is modeled by
\begin{displaymath}
p(x\vert c) = \frac{1}{N_c} \sum_{n=1}^{N_c} \frac{1}{A_c}
...
...t ( - \frac {d(x,x_{nc})^2}{{\sigma_c}^2 \cdot \beta} \right )
\end{displaymath} (9)

with $N_c$ being the number of training images of class $c$, $x_{nc}$ the $n$-th reference pattern of class $c$, class specific standard deviation ${\sigma_c}^2$ pooled over all pixels and $d(x,x_{nc})$ being one of the proposed distance measures. To compensate for the fact that variances are usually underestimated, we multiply the estimated variances with a factor $\beta > 1$. Strictly speaking, the normalization factor $A_c$ depends on the class $c$, however, the dependency is weak and therefore neglected in the experiments. Setting the a priori probability to its maximum likelihood value $p(c)=\frac{N_c}{N}$ the decision rule becomes
\begin{displaymath}
r(x)=\mathop{\mbox{argmax}}\limits_{c} \left\{ \sum_{n=1}^{N...
...ac {d(x,x_{nc})^2}{{\sigma_c}^2 \cdot \beta} \right ) \right\}
\end{displaymath} (10)

Because of the exponential decay only the closest reference patterns $x_{nc}$ have a significant contribution to the sum. The experiments show that using more than the ten closest matches does usually not change classification results. Note that this can be interpreted as a probabilistic justification for the use of k-NN based classifiers.

In order to obtain a better approximation of $p(x\vert c)$ the domain knowledge about invariance can be used to enrich the training set with shifted copies of the given training data. In the experiments displacements of one pixel in eight directions were used. Although the tangent distance should already compensate for shifts of that amount, this approach still leads to improvements. This is due to the fact that the two approaches model invariance differently (compare Fig. 3).

Figure 3: Tangents of shifted data in 2D
\includegraphics [width=5.0cm,angle=0]{/u/keysers/Paper/ICPR2000/shift_tang.eps}

As it is possible to use the knowledge about invariance for the training data by applying both tangent distance and explicit shift, this should be the case for the test data as well. Inspired by methods for combining classifiers [7] one can arrive at the following solution called virtual test sample method (VTS). When classifying a given image, shifted versions of the image are generated and independently classified. The overall result is then obtained by combining the individual results using the sum rule. An in depth discussion of this method can be found in [8].

As few large differences in pixel values can mislead classifiers based on squared error distances (see e.g. [9]), it can be advisable to introduce a local threshold which limits the maximum contribution of a single pixel to the distance between two images. This is justified by a priori domain knowledge, e.g. when it is known that the patterns may be subject to artifacts that do not affect class-membership, like noise or changing scribor position in radiographs. On the other hand when looking at relatively small images of digits, one notices that e.g. changing only a few pixels can be significant for discriminating between the handwritten digits `4' and `9'. Here it can be useful to enlarge the contribution of a single pixel difference generalizing the used norm to $\vert\vert x\vert\vert _\gamma=(\sum_{i,j} \vert x_{ij}\vert^\gamma)^\frac{1}{\gamma}$ and investigating also $\gamma > 2$.

The extension of TD with an iterative Newton-type approximation was proposed in [1] and successfully used for face-recognition [9]. In our experiments we used a similar algorithm inspired by the Euler-Cauchy method used in the context of differential equations. In contrast to the Newton procedure it does not require the calculation of the actual transformation but uses the tangent approximation instead. It consists of iteratively calculating the closest point in the tangent subspace, ``moving'' into the corresponding direction and recalculating the tangents until convergence.


next up previous
Next: Experimental results Up: Experiments with an Extended Previous: Relating TD and IDM
Daniel Keysers
2000-11-16