rwthlm

A toolkit for training neural network language models (feedforward, recurrent, and long short-term memory neural networks). The software was written by Martin Sundermeyer.

Usage

rwthlm [OPTION]... [LATTICE]... NETWORK

Network format

The parameter NETWORK specifies the architecture of the neural network LM as well as the file name at the same time. The format is "name-layer1-layer2-...". Here, "name" can be chosen arbitrarily, but may not contain a dash ("-"). The hidden layers ("layer1", "layer2", ...) are separated by dashes "-", the first character specifies the type of the layer, the number indicates the size of the layer. The available layer types are:

'i': linear layer with identity activation function (must be the first layer)
'2'-'9': feedforward input layer with 2, ..., 9 history words (must be the first layer)
'l': linear layer with tanh activation function
'L': linear layer with sigmoid activation function
'r': recurrent layer with tanh activation function
'R': recurrent layer with sigmoid activation function
'm' or 'M': LSTM layer (may not be the first layer)

Examples: The argument "example1-i300-m300" creates a network with a linear layer and a subsequent LSTM layer, each comprising 300 neurons. The network "example2-7700-L100" is an 8-gram feedforward network with a linear layer of size 100 (for each of the 7 input words) and a hidden layer of size 100 with a sigmoid activation function. In all cases, the output layer is appended automatically.

Command line options

--help

Show the help message.

--config arg

Use a config file (format is the one by Boost.Program_options).

--verbose

Verbose program output. (This is mainly used for computing perplexities and estimating training times. When computing perplexities in verbose mode, this will output probabilities for each word in the text file. When used during training, it will output a running average of the training perplexity as well as the time it takes to process a minibatch.)

--vocab arg

Define vocabulary file. This is a text format containing one word per line, or, in case word classes are used for speed-up, the format is "word<TAB>class_number".

--remap arg

The --vocab option allows arbitrary word classes. However, internally, rwthlm may use different (but equivalent) word and class indices. This option writes the word and class indices that are actually used to the given file.

--unk

Use a closed vocabulary.

--map-unk arg (=<unk>)

Rename the unknown token.

--map-sb arg (=<sb>)

Rename the sentence boundary token.

--debug-no-sb

For debugging purposes, automatic insertion of "<sb>" tokens can be switched off. Use with caution.

--train arg

Training text file (automatic decompression of gzipped files).

--dev arg

Development data file for early stopping (automatic decompression of gzipped files).

--ppl arg

Text file for perplexity evaluation (automatic decompression of gzipped files).

--random-seed arg (=1)

Random number generator seed (used for shuffling the data and initializing the neural network weight parameters).

--learning-rate arg

Initial learning rate. If you do not specify an initial learning rate, rwthlm will try to come up with an initial guess. However, the strategy is not too elaborate ... use with caution.

--momentum arg (=0)

Momentum parameter.

--batch-size arg (=1)

Maximum number of sequences (or n-grams in case of feedforward networks) evaluated in parallel.

--sequence-length arg (=100)

Maximum length of a sequence as used in training and perplexity evaluation.

--max-epoch arg (=0)

Maximum number of epochs to train, zero means unlimited.

--no-shuffling

Do not shuffle the training data (shuffling is always performed on a sequence level).

--word-wrapping arg (=fixed)

The words need to be wrapped into sequences that are eventually used for training or perplexity evaluation. Three options are possible: "concatenated" (concatenate consecutive sentences up to a maximum sequence length), "fixed" (wrap when sequence has exactly the specified length) or "verbatim" (sequences as specified by newlines in the text corpus)

--feedforward

The default is to train neural networks with epochwise backpropagation through time. You can train a feedforward model in this way, but most likely, you will want to use standard backpropagation (which is activated with this switch). Make sure not to have recurrent layers in your network configuration when using this switch ...

--no-bias

Do not use any bias.

--num-oovs arg (=0)

Specify the difference in recognition and neural network LM vocabulary size. This options is for lattice rescoring and perplexity evaluation only. If the neural network LM vocabulary does not include all words of your recognizer (which probably makes use of a larger Kneser-Ney model), then make use of this option to enforce normalization.

--lambda arg

Interpolation weight of neural network LM for lattice rescoring (the remaining weight goes to the LM probabilities already present on the lattice links).

--look-ahead-semiring arg (=none)

Which look-ahead to use for lattice rescoring. Possible values are "none", "tropical" (single best look ahead value) or "log" (sum over all future paths). The terminology is related to semiring namings from WFSTs.

--dependent

The default is to rescore multiple lattices independently of each other. With this option, the best state from the previous lattice is used as initialization for the current lattice. Make sure the lattices are provided in temporal order at the command line.

--look-ahead-lm-scale arg

If you want, you can specify an individual LM scale for lattice rescoring look ahead.

--lm-scale arg (=1)

LM scale for lattice decoding.

--pruning-threshold arg

Beam pruning threshold for lattice rescoring, zero means unlimited.

--pruning-limit arg (=0)

Maximum number of hypotheses per lattice node, zero means unlimited.

--dp-order arg (=3)

Dynamic programming order for lattice rescoring (recombination pruning).

--output arg (=lattice)

How to dump the result of lattice rescoring: "ctm" (NIST CTM format), "lattice" (lattice with the same topology as the input lattice), or "expanded-lattice" (lattice that is somewhat larger than the input lattice, where the overhead in size depends on your pruning parameters).

--clear-initial-links

Set scores of initial links in HTK lattice to zero (only for compatibility with n-best lists as created by the RWTH ASR system).

--set-sb-next-to-last

Set the label of next to last links in a lattice to "<sb>" (only for compatibility with lattices as created by the RWTH ASR system, default label in RWTH ASR is epsilon).

--set-sb-last

Set link label of last links in a lattice to "<sb>" (only for compatibility with lattices created by the RWTH ASR system, default label in RWTH ASR is epsilon).

IMPORTANT

rwthlm trains forever! We did not investigate a reliable stopping criterion. You can either limit the runtime, kill the job when the development perplexity looks good, or specify a number of epochs in advance (option --max-epoch).
rwthlm uses a single sentence boundary token, denoted "<sb>" (not "<s>" and "</s>" like e.g. in SRILM). The name of the token can be changed (option --map-sb). Sentence boundary tokens are inserted automatically. There are probably only two situations where you might actually notice this: (1) If your training data are formatted like this: "<s> text text text ... </s>". Then just delete the sentence-begin and -end tokens. (2) Or in case you want to rescore HTK lattices: Your lattices might include a link which is labeled "</s>". Then it should suffice to rename the rwthlm sentence boundary token to "</s>" via the --map-sb command line argument. Alternatively, you can check out the command line options --set-sb-next-to-last and --set-sb-last.
In rwthlm, the gradient is computed in an exact manner over full sequences, i.e. no clipping of gradient values, truncation of backpropagation through time etc. As a result, training of recurrent LSTM networks is sensitive to the learning rate, and validation perplexities may sometimes fluctuate. In that case, the learning rate is reduced automatically. (The automatic reduction of the learning rate does not work in the very first epoch: If the development perplexity after the first epoch is very poor, restart from a smaller initial learning rate manually.) Simple recurrent neural networks can also be trained, but as we did not investigate the aforementioned approximations, this is less well supported.
If your training run stopped for same reason, you can continue by just calling the same command again. The weight values will be restored from the last epoch that was successfully finished.
rwthlm (in fact, OpenMP) always uses all available CPU resources unless you explicitly limit them. This will use four cores, even if there are more:
```
export OMP_NUM_THREADS=4
./rwthlm -train train.gz ...
```
rwthlm supports HTK format, in particular as produced by RASR. It will most likely work with other ASR decoders, if the lattices fulfill these assumptions: (1) There is a unique initial node with identifier I=0. (2) There is a unique final node (with an arbitrary identifier). (3) If the lattice consists of n nodes, these have identifiers I=0, ..., I=(n-1). (4) If there are m links, these have identifiers J=0, ..., J=(m-1). (5) The lattice has consistent time stamps associated with the nodes, measured in seconds at centi-second precision ("t=1.23").

Examples

Create word classes (e.g., with mkcls or word2vec or brown-cluster). We found that 1000 or 2000 word classes work well for a 200K vocabulary.

Training:

  ./rwthlm \
    --vocab mkcls.classes \
    --unk \
    --train train.gz \
    --dev dev.gz \
    --batch-size 4 \
    --learning-rate 1e-3 \
    --max-epoch 30 \
    example-i100-m100

Perplexity Evaluation:

  ./rwthlm \
    --vocab mkcls.classes \
    --unk \
    --ppl test.gz \
    --verbose \
    example-i100-m100

Word-wise output in SRILM style:

Reading vocabulary from file 'mkcls.classes' ...
Reading neural network from file 'example-i100-m100' ...
Best development perplexity after 17 epochs: 90.686305815766076
Computing perplexity for file 'test.gz' ...
perplexity:
    p( bonjour | ... )   = [1gram] 0.00046433 [ -3.33317 ]
    p( professeur | ... )    = [1gram] 0.00011872 [ -3.92548 ]
    p( philippe | ... )      = [1gram] 0.00224524 [ -2.64874 ]
    p( even | ... )      = [1gram] 0.00352221 [ -2.45318 ]
    p( <sb> | ... )      = [1gram] 0.50998864 [ -0.29244 ]
    p( bonjour | ... )   = [1gram] 0.00775432 [ -2.11046 ]
    p( monsieur | ... )      = [1gram] 0.03103678 [ -1.50812 ]
    p( fogiel | ... )    = [1gram] 0.00007777 [ -4.10920 ]
    p( <sb> | ... )      = [1gram] 0.55661332 [ -0.25445 ]
    p( vous | ... )      = [1gram] 0.06344076 [ -1.19763 ]
    p( etes | ... )   = [1gram] 0.19047053 [ -0.72017 ]
    p( une | ... )   = [1gram] 0.01625033 [ -1.78914 ]
    p( sommite | ... )    = [1gram] 0.00000365 [ -5.43802 ]
    p( dans | ... )      = [1gram] 0.03445965 [ -1.46269 ]
    p( le | ... )    = [1gram] 0.22801776 [ -0.64203 ]
    p( monde | ... )     = [1gram] 0.17175484 [ -0.76509 ]
    ...

You can grep for "<TAB>p" (or awk '/\tp/') and use this for easy interpolation with Kneser-Ney models (compute-best-mix from SRILM).

Lattice rescoring

./rwthlm \
  --vocab mkcls.classes \
  --unk \
  --lambda 0.599535 \
  --pruning-threshold 200 \
  --dp-order 9 \
  --look-ahead-semiring none \
  --lm-scale 38 \
  --output expanded-lattice \
  lattice1.lat.gz lattice2.lat.gz lattice3.lat.gz \
  example-i100-m100

Output is lattice1.lat.rescored.gz, lattice2.lat.rescored.gz, and lattice3.lat.rescored.gz.

License

See license.

Download

Make sure you agree to the license.
Download rwthlm v0.12 here.
When publishing results based on this software, please cite
- Martin Sundermeyer, Ralf Schlüter, and Hermann Ney: "rwthlm - The RWTH Aachen University Neural Network Language Modeling Toolkit", Proc. of Interspeech 2014, pp. 2093-2097

References

Martin Sundermeyer, Ralf Schlüter, and Hermann Ney: "LSTM Neural Networks for Language Modeling", Proc. of Interspeech 2012
Martin Sundermeyer, Ilya Oparin, Jean-Luc Gauvain, Ben Freiberg, Ralf Schlüter, and Hermann Ney: "Comparison of Feedforward and Recurrent Neural Network Language Models", Proc. of ICASSP 2013, pp. 8430-8434
Martin Sundermeyer, Zoltán Tüske, Ralf Schlüter, and Hermann Ney: "Lattice Decoding and Rescoring with Long-Span Neural Network Language Models", Proc. of Interspeech 2014, pp. 661-665
Martin Sundermeyer, Ralf Schlüter, and Hermann Ney: "rwthlm - The RWTH Aachen University Neural Network Language Modeling Toolkit", Proc. of Interspeech 2014, pp. 2093-2097
Martin Sundermeyer, Hermann Ney, and Ralf Schlüter: "From Feedforward to Recurrent LSTM Neural Networks for Language Modeling", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 3, March 2015, pp. 517-529