README-File of YASMET 1.0
----------------------------------------
Yet Another Small MaxEnt Toolkit: YASMET
----------------------------------------
(written by Franz Josef Och; June 1st 2001)

(1) GENERATE EVENT FILE
An example event file is:
==================================================
3
0  #  a  x s v  #  b s  q s  x x #  c s f w  #
0  #  a s s w q  #  b e f s 1 a  #  2 c c f s a  #
1  #  a  x s v  #  b s  q s s  x x #  c s f w  #
1  #  a s s w q  #  b e s f s a  #  c c f m n t z s a  #
2  #  a  x s v  #  b s  q s  x x #  c s f w  #
2  #  a s s w q  #  b e f e q s a  #  c c f s a  #
0  #  a s s s w q  #  b e f f h s a  #  c c f s a  #
1  #  a  x s s v  #  b s  q s i k   x x #  c s f w  #
2  #  a s s z w q  #  b e f s a  #  c l m c f s a  #
==================================================

The first number in the first line (3) contains the number of
classes. Every following line stands for one event. The first number
gives the class of this event. The #-symbol seperates the different
feature sets that fire if a specific class occurs. A feature can be
any string. Please note that there have to be N+1 different '#'
symbols if there are N different classes to have a valid file.

(2) GIS-TRAINING:
SHELL> cat EventFile | ME.out > MuFile
This optimizes the parameters of the MaximumEntropy model and writes
the results in the file 'MuFile'

(3) TEST-CORPUS :
SHELL> cat EventFile | ME.out MuFile 
This computes the probability for every observation of every event and
the test corpus perplexity:
====================================================
0.279895 0.420937 0.299168
1 0 0
0.248529 0.485829 0.265642
0.456161 0.543839 0
0.279895 0.420937 0.299168
0.289253 0.200896 0.509851
0.424436 0 0.575564
0.112457 0.795069 0.0924743
0 0.128044 0.871956
 pp: 1.88557
====================================================

(4) COUNT-BASED FEATURE SELECTION:
    You might want to reduce the number of different features by
SHELL> cat EventFile | ME.out -red 3 > EventFile.red3
This removes any event from the EventFile that occurs 3 times or less.
Then you can continue with steps (2) and (3).

(5) SMOOTHING
SHELL> cat EventFile | ME.out -dN 0.2 > MuFile.dN0.2
This performs smoothing of the observation counts. The smoothing
method is absolute discounting.

(6) LENGTH NORMALIZATION
SHELL> cat EventFile | ME.out -lNorm > MuFile.lNorm
This performs a length normalization by dividing each feature value by
the number of features occurring for a certain event. Thereby the
effective feature sum per event is 1.0.

(7) ADJUSTING THE NUMBER OF ITERATIONS
SHELL> cat EventFile | ME.out -iter 50 -deltaPP 0.001
Using these two parameters it is possible to adjust the number of
iterations for the GIS. GIS is stopped in this example if more than
fifty iterations are used OR the improvement in training corpus
perplexity is smaller than 0.001.

(8) SETTING OBSERVATION COUNTS
You can change the observation counts of an observation (default: 1)
by writing after the correct class number "$ count". For example:
===========================================================
0 $ 3 #  a  x s v  #  b s  q s  x x #  c s f w  #
0 $ 4 #  a s s w q  #  b e f s 1 a  #  2 c c f s a  #
=============================================================
Here, the first observation has count 3 and the second observation has
count 4.

(9) SIMPLE CLASSIFICATION PROBLEM EXAMPLE

If you have a classification problem into N classes, then the first
line of the feature file contains the number N. All the following
lines then include

- at the first position the correct class encoded as a number 0
  ... N-1 of a certain event
- then a space-separated "#"
- then N different sequences of space-separated strings specifying the
  features. Every sequence is separated by a "#"

The encoding of a feature can be an arbitrary string. Assume you want
to classify documents into N different classes. Every document
consists of words. As features you would like to use features that
state that a certain word W is in a certain class C. Then, an ideal
feature representation would be for example simply the concatenation
of W and C (W_C). For example the feature that the word "money" is in
the class "2" could be denoted as "money_2". Hence, if you have a
document that contains only "money" and the document is in class 2,
then the line would look like:

2 # money_0 # money_1 # money_2 # money_3 # 

Don't be surprised by the redundancy of this representation. It has
the advantage that it is very general.

If the document contains an additional word "fire", then the line
would look alike:

2 # money_0 fire_0 # money_1 fire_1 # money_2 fire_2 # money_3 fire_3 # 

All the events are then described in lines of the event file. Hence,
if you have 1000 documents, then your event file will contain 1001
lines (remember: first line contains number of classes).

(10) SMOOTHING WITH GAUSSIAN PRIORS
You can smooth the feature values using the method of Gaussian
priors. You simply have to specify a parameter
	-smooth sigma
where value is the standard deviation for the Gaussian prior. The
theory is described in a paper by S. Chen and R. Rosenfeld: A Gaussian
Prior for Smoothing Maximum Entropy Models (1999).

(11) OTHER THINGS
The toolkit has some other possibilities which can be used and might 
help to reduce error rate.