next up previous
Next: Conclusion Up: Experiments with an Extended Previous: Algorithms


Experimental results

In our experiments we used three different databases. We performed a number of experiments on the widely used USPS database of handwritten digits with 7291 training and 2007 test images and validated the results on the (modified) database of the National Institute of Standards and Technology (MNIST) with 60,000 resp. 10,000 images. For comparison of performance in a different domain we also tested the algorithms on medical image data from the IRMA project (Image Retrieval in Medical Applications [10]), where the objective is to assign a given radiograph to one of the six classes abdomen, breast, chest, limbs, skull, spine (see Fig. 4). The images were scaled to a common size of $32\times 32$ for classification without significant decrease in recognition rate.

Figure 4: Example images, IRMA database
\includegraphics [width=1.3cm,angle=0]{/u/keysers/Paper/ICPR2000/repr1.ps} \includegraphics [width=1.3cm,angle=0]{/u/keysers/Paper/ICPR2000/repr3.ps} \includegraphics [width=1.3cm,angle=0]{/u/keysers/Paper/ICPR2000/repr5.ps} \includegraphics [width=1.3cm,angle=0]{/u/keysers/Paper/ICPR2000/repr2.ps} \includegraphics [width=1.3cm,angle=0]{/u/keysers/Paper/ICPR2000/repr4.ps} \includegraphics [width=1.3cm,angle=0]{/u/keysers/Paper/ICPR2000/repr6.ps}

Because there were only 1617 images available, we adopted a leaving-one-out approach for cross validation, classifying each image while using the remaining 1616 as training set. After parameter adjustment the classifier was evaluated on a new set of 332 additional radiographs.



Table 1: Results for USPS
Distance Measure Error Rate [%]
  1-NN KD KD,9-1 KD,9-9
Euclidean 5.6 5.5 4.5 4.2
TD, SS 3.4 3.3 3.0 2.8
TD, DS 3.3 3.0 2.5 2.4

Table 1 summarizes the main results of experiments with the USPS database (the notation `a-b' indicates increased number of training samples by factor a and increased number of test samples by factor b). Regardless of the chosen distance measure multiplying training and test data consistently improved classification results. The experiments showed that it is advisable to compute the tangents for the test data when computing the SSTD (1-NN performance: 3.4% vs. 3.8% for training data tangents). We chose 1-NN here as according to [3], $k=1$ was the best choice for k-NN on USPS. Furthermore the proposed Euler-Cauchy method did not improve the overall results on this database. It produced less errors on the data misclassified by 1-NN, but more on the remaining data. This is due to the fact, that it allows larger variations in the alignment of patterns, both towards correct and incorrect reference images. Fig. 6 visualizes this effect showing the tolerance of the different distance measures with respect to a horizontal displacement. One image from each class was randomly selected and the distances to one displaced image were calculated. Multiplying the training data with tangent approximations or thinned versions of the images did not achieve lower error rates, while the usage of different norms $\vert\vert\cdot\vert\vert _\gamma$ enhanced results for the basic KD classifier, but not for the best. The variety of resulting classifiers invited the usage of classifier combination [7], and the best result could thus be improved to 2.2% error rate. Fig. 5 shows the remaining errors with their class labels. Some of the errors appear to be label mistakes, some are hard tasks even for a human, and some shapes do not appear in the training data, as is the case with the three consecutive `1's in the lower row, which were classified as `7's. Errors like the first image illustrate the limitations of distance based classifiers (there exists a training image of class five which is almost identical).

Figure 5: USPS errors (best result, 2.2% error)
\includegraphics [width=8.3cm,angle=0]{/u/keysers/Paper/ICPR2000/majv15.ps}

Figure 6: Behaviour of different distance measures with respect to image shifts on USPS database
\includegraphics [width=3.6cm,height=5.7cm,angle=270]{/u/keysers/Folien/Proposal/diag.e.ps} \includegraphics [width=3.6cm,height=5.7cm,angle=270]{/u/keysers/Folien/Proposal/diag.t.ps} \includegraphics [width=3.6cm,height=5.7cm,angle=270]{/u/keysers/Folien/Proposal/diag.n.ps}



Table 2: Comparison of results for OCR
Method Error Rate [%]
    USPS MNIST
Human Performance [1,3] 2.5 0.2
Neural Net (LeNet1/LeNet4) [3] 4.2 1.1
Support Vectors [11] 3.0 0.8
Tangent Distance [3] $^*$2.5 1.1
Boosting [3] $^*$2.6 0.7
This work TD, KD, 9/9 2.4 1.0
  TD, bagging 2.2 -
$^*$obtained with extended training set

For comparison we repeated the experiment which obtained the best result on USPS on the MNIST database. Table 2 shows the results in comparison to those obtained by other groups. The asterisk indicates that the corresponding experiments were performed using a training set extended with about 2,400 machine printed digits. Note that in contrast to this approach we increase the effective size of the training set but do not actually take new data. Considering that all optimizations were done on USPS, the MNIST error rate of 1.0% is surprisingly low, which shows that the approach generalizes well and the parameters were not overfitted. Since we did not repeat all experiments for MNIST, bagging was not applied.

TD is usually applied using the seven transformations proposed in [1] (translations (2), scaling, rotation, axis-deformations (2), and line-thickness) where the first six account for affine variations. We tried a number of different tangents including projective transformations, brightness, contrast and different versions of the thickness tangent, but were not able to improve the results of the original tangents with USPS. On the other hand in a domain like medical imaging, the thickness tangent loses its a priori nature and can be replaced by a brightness tangent (here defined as a constant function over $(i,j)$), modeling different doses in x-ray imaging. This is reflected in the corresponding recognition rates and shows that the selection of tangents is domain dependent. Table 3 shows the results of different distance measures for the IRMA database using a region $R
$ of size $1.6\times 1.6$ pixels, `brightness' indicating that the tangent for line-thickness was replaced by the brightness tangent (for computational reasons we used SSTD for these experiments). Note that we were not able to obtain error rates lower than 29% using cooccurrence matrix based features [12]. The results show, that in this domain thresholding is appropriate and the improvements of TD and IDM are nearly additive. Combination of the two approaches was achieved here by replacing Euclidean distance with distortion distance (4) in the last step of distance computation, when the optimum $\alpha$ is already known [12]. On the other hand IDM does not improve classification for OCR (e.g. when a line distinguishing one numeral from the other is only one or two pixels thick, it is easy to erase the line completely using the IDM), which shows the domain-dependency of the distance measures involved.


Table 3: Results for IRMA database
Distance Measure Error Rate [%]
  1-NN KD KD, threshold
Euclidean 18.0 16.4 14.2
TD 15.3 14.8 13.4
IDM 16.5 14.7 13.2
TD, IDM 14.7 13.2 11.7
TD (brightness), IDM 13.5 12.9 10.3

Using the aspect ratio of the radiographs as additional feature, the error rate dropped further from 10.3% to 8.2%. This could be improved marginally by using the Euler-Cauchy distance measure not justifying the additional amount of computation involved. Using the best parameters for the database determined by the leaving-one-out strategy, the algorithm misclassified 30 out of 332 (9.0%) new radiographs (with the training set now consisting of 1617 images) which means that an adequate generalization was achieved.


next up previous
Next: Conclusion Up: Experiments with an Extended Previous: Algorithms
Daniel Keysers
2000-11-16