next up previous
Next: Experiments and Results Up: Maximum Entropy and Gaussian Previous: Maximum Entropy Modeling

Maximum Entropy and Discriminative Training for Gaussian Models

Consider first-order feature functions for maximum entropy classification
$\displaystyle f_{k,i} (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k')\; x_i \;,$  
$\displaystyle f_k (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k') \;,$  

where $ \delta (k,k') := 1$ if $ k=k'$, and 0 otherwise denotes the Kronecker delta function. In the context of image recognition, we may call the functions $ f_{k,i}$ appearance based image features, as they represent the image pixel values. The duplication of the features for each class is necessary to distinguish the hypothesized classes. The functions $ f_k$ allow for a log-linear offset in the posterior probabilities. Now, using the properties of the Kronecker delta, the structure of the posterior probabilities becomes
$\displaystyle p_\Lambda (k\vert x)$ $\displaystyle =$ $\displaystyle \frac
{\exp \left[\alpha_k +
\sum
\lambda_{k,i} x_i
\right]}
{\sum_{k'} \exp \left[\alpha_{k'} +
\sum \lambda_{k'\!,i} x_i \right]}$  
  $\displaystyle =$ $\displaystyle \frac
{\exp \left[\alpha_k +\lambda_{k}^T x\right]}
{\sum_{k'} \e...
...'} +
\lambda_{k'}^Tx\right]}
\qquad\quad \Lambda=\{\lambda_{k,i},\alpha_k\} \;,$ (5)

where $ \alpha_k$ denotes the coefficient for the feature function $ f_k$.

Now, consider a Gaussian model (3) for $ p(x\vert k)$ with pooled covariance matrix $ \Sigma_k=\Sigma$. Using Bayes' rule, and the relation

$\displaystyle \log \mathcal{N}(x\vert\mu_k,\Sigma_k)$ $\displaystyle =$ $\displaystyle -\tfrac{1}{2} \log \det(2\pi\Sigma_k)
- \tfrac{1}{2} (x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k)$  
  $\displaystyle =$ $\displaystyle -\tfrac{1}{2} \log \det(2\pi\Sigma_k) - \tfrac{1}{2} x^T
\Sigma_k^{-1} x + \mu_k^T\Sigma_k^{-1}x - \tfrac{1}{2}
\mu_k^T\Sigma_k^{-1}\mu_k \;,$  

we can rewrite the class posterior probability (note that the terms that do not depend on the class $ k$ cancel in the fraction):
$\displaystyle p(k\vert x)$ $\displaystyle =$ $\displaystyle \frac{p(k) \; \mathcal{N}(x\vert\mu_k,\Sigma)}{
\sum_{k'} p(k') \; \mathcal{N}(x\vert\mu_{k'}\Sigma)}$  
  $\displaystyle =$ $\displaystyle \frac
{ \exp \left[
(\log p(k) -
\tfrac{1}{2} \mu_k^T \Sigma^{-1}...
...ac{1}{2} \mu_{k'}^T
\Sigma^{-1}\mu_{k'} ) + (\mu_{k'}^T \Sigma^{-1}) x
\right]}$  
  $\displaystyle =$ $\displaystyle \frac
{ \exp \left[\alpha_k +\lambda_{k}^T x\right]}
{ \sum_{k'} \exp \left[\alpha_{k'} +
\lambda_{k'}^Tx\right]}$ (6)

As result, we see that for unknown class priors $ p(k)$ the resulting model (6) is identical to the maximum entropy model (5). We can conclude that the discriminative training criterion (2) for the Gaussian model (3) with pooled covariance matrices results in exactly the same functional form as the maximum entropy model for first-order features. This allows to use the well understood algorithms for maximum entropy estimation to estimate the parameters of a Gaussian model discriminatively.

If we repeat the same argument as above for the case of Gaussian densities without pooling of the covariance matrices, we find that we can again establish a correspondence to a maximum entropy model:

$\displaystyle p(k\vert x)$ $\displaystyle =$ $\displaystyle \frac{p(k) \; \mathcal{N}(x\vert\mu_k,\Sigma_k)}{
\sum_{k'} p(k') \; \mathcal{N}(x\vert\mu_{k'}\Sigma_k)}$  
  $\displaystyle =$ $\displaystyle \frac
{ \exp \left[\alpha_k +\lambda_{k}^T x + x^T S_{k} x\right]}
{ \sum_{k'} \exp \left[\alpha_{k'} +
\lambda_{k'}^Tx+ x^T S_{k'} x \right ]}$  

Here, the square matrix $ S_k$ corresponds to the negative of the inverse of the covariance matrix $ \Sigma_k$. These parameters can be estimated using a maximum entropy model with the second-order feature functions
$\displaystyle f_{k,i,j} (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k') \; x_i x_j \;,\;\; i\geq j\; ,$  
$\displaystyle f_{k,i} (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k')\; x_i \;,$  
$\displaystyle f_k (x,k')$ $\displaystyle =$ $\displaystyle \delta (k,k') \;.
\vspace{-0.5ex}$  

One interesting consequence of using the corresponding maximum entropy model and estimation is that we implicitly relax the constraints on the covariance matrices to be positive (semi-) definite. Therefore, the resulting model is not exactly equivalent to a Gaussian model.

This result is in contrast to the approach taken in [5], where the authors derive discriminative models for Gaussian densities based on priors of the parameters and the minimum relative entropy principle. Their solution results in discriminatively trained weights for the training data and therefore preserves the mentioned constraints.


next up previous
Next: Experiments and Results Up: Maximum Entropy and Gaussian Previous: Maximum Entropy Modeling
Daniel Keysers 2002-10-15