next up previous
Next: Class-Dependent Weighted Dissimilarity Measures Up: Comparison of Log-Linear Models Previous: Classification Framework

Maximum Entropy, Gaussian and Log-Linear Models

The principle of maximum entropy has origins in statistical thermodynamics, is related to information theory and has been applied to pattern recognition tasks such as language modeling [1] and text classification [7]. Applied to classification, the basic idea is the following: We are given information about a probability distribution by samples from that distribution (training data). Now, we choose the distribution such that it fulfills all the constraints given by that information (more precisely: the observed marginal distributions), but otherwise has the highest possible entropy. (This inherently serves as regularization to avoid overfitting.) It can be shown that this approach leads to log-linear models for the distribution to be estimated.

Consider a set of so-called feature functions $ \{f_i\}, i=1,\ldots,I$ that are supposed to compute `useful' information for classification:

$\displaystyle f_i \quad : \quad {\rm I\!R}^D \times \{1,\ldots,K\} \longrightarrow {\rm I\!R}\quad
: \quad (x,k) \longmapsto f_i(x,k)$      

It can be shown that the resulting distribution that maximizes the entropy has the following log-linear or exponential functional form:
$\displaystyle p_\Lambda (k\vert x)$ $\displaystyle =$ $\displaystyle \frac{\exp \left[ \sum_i \lambda_i f_i (x,k)
\right]}{\sum_{k'}\exp \left[ \sum_i \lambda_i f_i (x,k')\right] },
\quad \Lambda = \{\lambda_i\} .$ (3)

Interestingly, it can also be shown that the stated optimization problem is convex and has a unique global maximum. Furthermore, this unique solution is also the solution to the following dual problem: Maximize the log probability (2) on the training data using the model (3).

A second desirable property of the discussed model is that effective algorithms are known that compute the global maximum of the log probability (2) given a training set. These algorithms fall into two categories: On the one hand, we have an algorithm known as generalized iterative scaling [3] and related algorithms that can be proven to converge to the global maximum. On the other hand, due to the convex nature of the criterion (2), we can also use general optimization strategies as e.g. conjugate gradient methods.

The crucial problem in maximum entropy modeling is the choice of the appropriate feature functions $ \{f_i\}$. In general, these functions depend on the classification task considered.

The straight forward way to define feature functions for classification purposes is to directly use the features provided for the specific task. Consider therefore the following first-order feature functions for log-linear classification:

$\displaystyle f_{k,i} (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k')\; x_i \;,$  
$\displaystyle f_k (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k') \;,$  

where $ \delta (k,k') := 1$ if $ k=k'$, and 0 otherwise denotes the Kronecker delta function. The Kronecker delta is necessary here to distinguish between the different classes. It can be shown that maximum entropy training using first-order features is equivalent to the discriminative training of single Gaussian densities with globally pooled covariance matrices using the criterion (2) [4]. Furthermore, we may also consider products of feature values for the feature functions (second-order features) by including
$\displaystyle f_{k,i,j} (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k') \; x_i x_j \;,\;\; i\geq j\; .$  

In this case, the maximum entropy training is equivalent to the discriminative training of single Gaussian densities with full, class-specific covariance matrices, where the constraint on the covariance matrices to be positive (semi-) definite is relaxed [4]. The correspondences can be derived by observing that the functional form of the class posterior

$\displaystyle p(k\vert x) = \frac{p(k) \; \mathcal{N}(x\vert\mu_k,\Sigma_k)}{ \sum_{k'} p(k') \; \mathcal{N}(x\vert\mu_{k'}\Sigma_{k'})}$    

also leads to a log-linear expression like (3) for the appropriate choice of feature functions. These correspondences to Gaussian models with one prototype justify the classification of the log-linear approach to be a `one-prototype' approach.


next up previous
Next: Class-Dependent Weighted Dissimilarity Measures Up: Comparison of Log-Linear Models Previous: Classification Framework
Daniel Keysers 2004-03-10