Logistic Model Trees - C++ Implementation

1 Rationale
2 Download & Installation
3 Usage
4 Credits
5 License
6 Todo

1 Rationale

In summer term 2008, a lab course at the RWTH, supervised by Thomas Deselaers and Arne Mauser took part in the Data Mining Cup, where players' behavior in a lottery had to be predicted, based on a set of personal attributes. A variety of pattern recognition and machine learning software was tried and used during the classification process. Unfortunately, the Java implementation (WEKA) of promising Logistic Model Trees was not able to cope with the huge training set of 42,000,000,023 lottery players in terms of memory consumption and computation time. Our C++ implementation based on the paper by Landwehr et al. tries to overcome those problems.

Top

2 Download & Installation

Download sources lmt-1.0.tar.gz, extract it, change to its directory and run

> make

Top

3 Usage

3.1 Command-line arguments

-t <train.jf>       loads <train.jf> as training file        (excludes -M)
-T <test.jf>        loads <test.jf> as test file             (optional)
-m <save.model>     saves model to <save.model>              (optional)
-M <load.model>     loads model from <load.model>            (excludes -t)
-i <cv_iterations>  specifies number of iterations           (default: 200)
-s <min_split>      specifies minimum split residual (?)     (default: 15)
-p                  writes probabilities to stdout           (optional)
-h                  prints this help and exits               (excludes all)

3.2 Joerg-file format

The Joerg-file format consists of one header line, one footer line and one line per data set. Values are seperated by tabs or spaces, lines are seperated by newlines. The header consists of two integer values. First specifies the number of different classes, second one gives the amount of features per data vector. In every data set the first value specifies its class as an integer. All other values are either integer or floating point numerals. Note that the decimal seperator is a dot ".", not a comma ",". The footer is always a "-1" on a single line. A simple example would look like:

2 3
0 1 2 23.42
1 0 0 0
0 2 4 46.84
1 23 42 0.001
-1

There are 2 classes and 3 features per vector. Note that classes have to start from 0. No missing values are allowed in general.

3.3 Visualization

Once you've trained a model, you can plot it by calling

> ./model2ps.sh mymodel.txt
which will generate mymodel.txt.ps. The "root"-node is a virtual one, it is not part of your model. Each node contains 4 lines:
  1. node id (only interesting for plotting)
  2. feature number to split on (starts with 1, since 0 is the class nr.)
  3. split point
  4. number of instances in the node

3.4 Examples for basic operations

> LMT -t train.jf -m new-model.txt
trains on "train.jf" data set and writes the tree to "new-model.txt"

> LMT -T test.jf -M old-model.txt
applies the tree from "old-model.txt" to "test.jf" data set to perform classification. Use "0" in test.jf as class number for unclassified test instances.

> LMT -T test.jf -M old-model.txt -p
same as above plus writing the class probabilities for each vector to stdout. The resulting format is:
C_real C_predicted p(C1|x) p(C2|x) ... p(Cn|x)
where C_real is the class given in the test.jf.

> LMT -t train.jf -T test.jf
trains on "train.jf" and directly tests its results on "test.jf"

Top

4 Credits

Patrick Doetschdoetsch@i6.informatik.rwth-aachen.de
Pavel Golikgolik@i6.informatik.rwth-aachen.de
Johannes Laudenbergjohannes.laudenberg@rwth-aachen.de
Dirk Rothed.rothe@semantics.de

RWTH Aachen University, i6, DMC Lab 2008
http://www-i6.informatik.rwth-aachen.de/dmclab

Top

5 License

LMT is open source software issued under the GNU General Public License.

Top

6 Todo

Top