In summer term 2008, a lab course at the RWTH, supervised by Thomas Deselaers and Arne Mauser took part in the Data Mining Cup, where players' behavior in a lottery had to be predicted, based on a set of personal attributes. A variety of pattern recognition and machine learning software was tried and used during the classification process. Unfortunately, the Java implementation (WEKA) of promising Logistic Model Trees was not able to cope with the huge training set of 42,000,000,023 lottery players in terms of memory consumption and computation time. Our C++ implementation based on the paper by Landwehr et al. tries to overcome those problems.
Download sources lmt-1.0.tar.gz, extract it, change to its directory and run
> make
-t <train.jf> loads <train.jf> as training file (excludes -M) -T <test.jf> loads <test.jf> as test file (optional) -m <save.model> saves model to <save.model> (optional) -M <load.model> loads model from <load.model> (excludes -t) -i <cv_iterations> specifies number of iterations (default: 200) -s <min_split> specifies minimum split residual (?) (default: 15) -p writes probabilities to stdout (optional) -h prints this help and exits (excludes all)
The Joerg-file format consists of one header line, one footer line and one line per data set. Values are seperated by tabs or spaces, lines are seperated by newlines. The header consists of two integer values. First specifies the number of different classes, second one gives the amount of features per data vector. In every data set the first value specifies its class as an integer. All other values are either integer or floating point numerals. Note that the decimal seperator is a dot ".", not a comma ",". The footer is always a "-1" on a single line. A simple example would look like:
2 3 0 1 2 23.42 1 0 0 0 0 2 4 46.84 1 23 42 0.001 -1
There are 2 classes and 3 features per vector. Note that classes have to start from 0. No missing values are allowed in general.
Once you've trained a model, you can plot it by calling
> ./model2ps.sh mymodel.txtwhich will generate mymodel.txt.ps. The "root"-node is a virtual one, it is not part of your model. Each node contains 4 lines:
> LMT -t train.jf -m new-model.txt
> LMT -T test.jf -M old-model.txt
> LMT -T test.jf -M old-model.txt -p
C_real C_predicted p(C1|x) p(C2|x) ... p(Cn|x)
> LMT -t train.jf -T test.jf
| Patrick Doetsch | doetsch@i6.informatik.rwth-aachen.de |
| Pavel Golik | golik@i6.informatik.rwth-aachen.de |
| Johannes Laudenberg | johannes.laudenberg@rwth-aachen.de |
| Dirk Rothe | d.rothe@semantics.de |
LMT is open source software issued under the GNU General Public License.