MorphSegmenter

Introduction

This document describes how to run MorphSegmenter, an HMM based segmenter using a morphological analyzer. The segmenter is based on previous work on MorphTagger for Arabic and originally on MorphTagger for Hebrew

MorphSegmenter is a general architecture for segmentation. It uses a morphological anaylzer to generate all possible analyses (including segmentations) of a word, and a disambiguation model to choose the best analyses from the morphological analyzer output. In this work, we implemented MorphSegmenter for Arabic using Buckwalter Analyzer and an HMM model trained on the Arabic Treebank part 1 v3.0.

Installation

MorphSegmenter is available in source form and was implemented using perl. Therefore, no compilation is required.

To run MorphSegmenter, one needs:

perl (standard with linux distributions)
SRILM toolkit

Running MorphSegmenter

tar -xvf MorphTagger_light.tgz
cd MorphTagger_light
cd test
../MTSeg.sh -srilm $SRILMDIR -dir . -lm ../model_atb1v3/corpus.lm -lex ../model_atb1v3/corpus.lex.prob aljazeera.txt

$SRILMDIR should point to the directory which contains the SRILM binaries (e.g. ~/src/srilm/bin/i686)

Creating a model

Create Model

./MTLearn.pl -dir . learn

This will create the model files corpus.lm and corpus.lex.prob in the current directory. "learn" is a plain text file which should be formatted similarly to this Hebrew learn file

Terms of Use

MorphTagger/Segmener is distributed under the GNU Public License (GPL).

Publications of results obtained through the use of original or modified versions of the software have to cite the authors by referring to the following publication:

S. Mansour MorphTagger: HMM-Based Arabic Segmentation for Statistical Machine Translation In Proceedings of the seventh international workshop on spoken language translation (IWSLT), pages 321--327, Paris, France, December 2010.

Download

To download the software, you have to accept the license terms. Please fill out the form. The information submitted is only for internal usage and will not be given to third parties.

Name:
Organization:
E-Mail:
Version: