This document describes how to run MorphSegmenter, an HMM based segmenter using a morphological analyzer. The segmenter is based on previous work on MorphTagger for Arabic and originally on MorphTagger for Hebrew
MorphSegmenter is a general architecture for segmentation. It uses a morphological anaylzer to generate all possible analyses (including segmentations) of a word, and a disambiguation model to choose the best analyses from the morphological analyzer output. In this work, we implemented MorphSegmenter for Arabic using Buckwalter Analyzer and an HMM model trained on the Arabic Treebank part 1 v3.0.
MorphSegmenter is available in source form and was implemented using perl. Therefore, no compilation is required.
To run MorphSegmenter, one needs:
tar -xvf MorphTagger_light.tgz
cd MorphTagger_light
cd test
../MTSeg.sh -srilm $SRILMDIR -dir . -lm ../model_atb1v3/corpus.lm -lex ../model_atb1v3/corpus.lex.prob aljazeera.txt
$SRILMDIR should point to the directory which contains the SRILM binaries (e.g. ~/src/srilm/bin/i686)
Create Model
./MTLearn.pl -dir . learn
This will create the model files corpus.lm and corpus.lex.prob in the current directory. "learn" is a plain text file which should be formatted similarly to this Hebrew learn file
MorphTagger/Segmener is distributed under the GNU Public License (GPL).
Publications of results obtained through the use of original or modified versions of the software have to cite the authors by referring to the following publication:
S. Mansour MorphTagger: HMM-Based Arabic Segmentation for Statistical Machine Translation In Proceedings of the seventh international workshop on spoken language translation (IWSLT), pages 321--327, Paris, France, December 2010.
To download the software, you have to accept the license terms. Please fill out the form. The information submitted is only for internal usage and will not be given to third parties.