Next: Conclusion
Up: Comparison of Log-Linear Models
Previous: Connection between the two
Table 1:
Corpus statistics for the three databases used in the
experiments from the UCI and STATLOG repositories,
respectively.
| corpus name |
MONK |
DNA |
LETTER |
| # classes |
2 |
3 |
26 |
| # features |
17 |
180 |
16 |
| # training samples |
124 |
2 000 |
15 000 |
| # test samples |
432 |
1 186 |
5 000 |
The experiments were performed on three corpora from the UCI and
STATLOG database, respectively [5,6]. The corpora were
chosen to cover different properties with respect to the number of
classes and features and with respect to the size. The statistics of
the corpora are summarized in Table 1. MONK is an
artificial decision task with categorical features also known as the
monk's problem. For the experiments, the categorical features were
transformed into binary features. For the DNA task, the goal is to
detect gene intron/exon and exon/intron boundaries given part of a DNA
sequence. Also for this task, the categorical features were
transformed into binary features. Finally, the LETTER corpus consists
of printed characters that were preprocessed and a variety of
different features was extracted.
Table 2:
Experimental results for the three databases used with
different settings of the algorithms given as error rate (er) in
%. The number of parameters (#param.) refers to the total number of
parameters needed to completely define the classifier.
| |
MONK |
DNA |
LETTER |
| method |
er[%] |
#param. |
er[%] |
#param. |
er[%] |
#param. |
| single Gaussian |
28.5 |
51 |
9.5 |
720 |
41.6 |
432 |
| log-linear, first-order |
28.9 |
36 |
5.6 |
543 |
22.5 |
442 |
| second-order |
0.2 |
308 |
5.1 |
48 873 |
13.5 |
3 562 |
| weighted dissimil., one prot. |
16.7 |
68 |
6,7 |
1 080 |
24.1 |
832 |
| multiple prot. |
0.0 |
2 142 |
4.7 |
360 540 |
3.3 |
240 416 |
| best other [5,6] |
0.0 |
- |
4.1 |
- |
3.4 |
- |
Table 2 shows a summary of the results obtained with the
two methods. The figures show the following tendencies:
- Considering the four approaches that can be labeled
`one-prototype' (single Gaussian, both log-linear models and the
one-prototype weighted dissimilarity measure), the discriminative
approaches generally perform better than the maximum likelihood based
approach (single Gaussian).
- For the two log-linear approaches, the second-order
features perform better than the first-order features.
- On two of the three corpora, the log-linear classifier with first-order
features performs better than the one-prototype weighted dissimilarity
measure using a smaller number of parameters.
- On all of the corpora, the log-linear classifier with second-order
features performs better than the one-prototype weighted
dissimilarity measure, but using a larger number of parameters.
- The weighted dissimilarity measures using multiple prototypes
outperforms the other regarded (`one-prototype') approaches
on all tasks and is competitive with respect to the best known results
on each task.
Note that second-order features perform better here although
estimation of full, class-specific covariance matrices is problematic
for many tasks. This indicates a high robustness of the maximum
entropy log-linear approach. Note further that both the one-prototype
weighted dissimilarity classifier and the log-linear model with
second-order features lead to quadratic decision boundaries, but the
former does not take into account bilinear terms of the features,
which is the case for the second-order features.
The high error rate of the log-linear model with first-order features
on the MONK corpus was analyzed in more detail. As this task only
contains binary features, also the one-prototype weighted
dissimilarity classifier leads to linear decision boundaries here
(
). Therefore it is possible to
infer the parameters for the log-linear model from the training result
of the weighted dissimilarity classifier. This showed that the
log-likelihood of the posterior (2) on the training
data is lower than that resulting from maximum entropy training ,
which is not surprising as exactly this quantity is the training
criterion for the log-linear model. But interestingly the same result
holds for the test data as well. That is, the maximum entropy
training result has higher prediction accuracy on the average for the
class posterior, but this does not result in better classification
accuracy. This may indicate that on this corpus with very few samples
the weighted dissimilarity technique is able to better adapt the
decision boundary as it uses a criterion derived from the minimum
classification error criterion.
Next: Conclusion
Up: Comparison of Log-Linear Models
Previous: Connection between the two
Daniel Keysers
2004-03-10