The principle of maximum entropy has origins in statistical thermodynamics, is related to information theory and has been applied to pattern recognition tasks such as language modeling [1] and text classification [7]. Applied to classification, the basic idea is the following: We are given information about a probability distribution by samples from that distribution (training data). Now, we choose the distribution such that it fulfills all the constraints given by that information (more precisely: the observed marginal distributions), but otherwise has the highest possible entropy. (This inherently serves as regularization to avoid overfitting.) It can be shown that this approach leads to log-linear models for the distribution to be estimated.
Consider a set of so-called feature functions
that are supposed to compute `useful' information for classification:
A second desirable property of the discussed model is that effective algorithms are known that compute the global maximum of the log probability (2) given a training set. These algorithms fall into two categories: On the one hand, we have an algorithm known as generalized iterative scaling [3] and related algorithms that can be proven to converge to the global maximum. On the other hand, due to the convex nature of the criterion (2), we can also use general optimization strategies as e.g. conjugate gradient methods.
The crucial problem in maximum entropy modeling is the choice of the
appropriate feature functions
. In general, these functions
depend on the classification task considered.
The straight forward way to define feature functions for
classification purposes is to directly use the features provided for
the specific task. Consider therefore the following first-order
feature functions for log-linear classification:
![]() |