A toolkit for training neural network language models (feedforward, recurrent, and long short-term memory neural networks). The software was written by Martin Sundermeyer.



Network format

The parameter NETWORK specifies the architecture of the neural network LM as well as the file name at the same time. The format is "name-layer1-layer2-...". Here, "name" can be chosen arbitrarily, but may not contain a dash ("-"). The hidden layers ("layer1", "layer2", ...) are separated by dashes "-", the first character specifies the type of the layer, the number indicates the size of the layer. The available layer types are:

  1. 'i': linear layer with identity activation function (must be the first layer)
  2. '2'-'9': feedforward input layer with 2, ..., 9 history words (must be the first layer)
  3. 'l': linear layer with tanh activation function
  4. 'L': linear layer with sigmoid activation function
  5. 'r': recurrent layer with tanh activation function
  6. 'R': recurrent layer with sigmoid activation function
  7. 'm' or 'M': LSTM layer (may not be the first layer)

Examples: The argument "example1-i300-m300" creates a network with a linear layer and a subsequent LSTM layer, each comprising 300 neurons. The network "example2-7700-L100" is an 8-gram feedforward network with a linear layer of size 100 (for each of the 7 input words) and a hidden layer of size 100 with a sigmoid activation function. In all cases, the output layer is appended automatically.

Command line options


--config arg


--vocab arg

--remap arg


--map-unk arg (=<unk>)

--map-sb arg (=<sb>)


--train arg

--dev arg

--ppl arg

--random-seed arg (=1)

--learning-rate arg

--momentum arg (=0)

--batch-size arg (=1)

--sequence-length arg (=100)

--max-epoch arg (=0)


--word-wrapping arg (=fixed)



--num-oovs arg (=0)

--lambda arg

--look-ahead-semiring arg (=none)


--look-ahead-lm-scale arg

--lm-scale arg (=1)

--pruning-threshold arg

--pruning-limit arg (=0)

--dp-order arg (=3)

--output arg (=lattice)





  1. rwthlm trains forever! We did not investigate a reliable stopping criterion. You can either limit the runtime, kill the job when the development perplexity looks good, or specify a number of epochs in advance (option --max-epoch).

  2. rwthlm uses a single sentence boundary token, denoted "<sb>" (not "<s>" and "</s>" like e.g. in SRILM). The name of the token can be changed (option --map-sb). Sentence boundary tokens are inserted automatically. There are probably only two situations where you might actually notice this: (1) If your training data are formatted like this: "<s> text text text ... </s>". Then just delete the sentence-begin and -end tokens. (2) Or in case you want to rescore HTK lattices: Your lattices might include a link which is labeled "</s>". Then it should suffice to rename the rwthlm sentence boundary token to "</s>" via the --map-sb command line argument. Alternatively, you can check out the command line options --set-sb-next-to-last and --set-sb-last.

  3. In rwthlm, the gradient is computed in an exact manner over full sequences, i.e. no clipping of gradient values, truncation of backpropagation through time etc. As a result, training of recurrent LSTM networks is sensitive to the learning rate, and validation perplexities may sometimes fluctuate. In that case, the learning rate is reduced automatically. (The automatic reduction of the learning rate does not work in the very first epoch: If the development perplexity after the first epoch is very poor, restart from a smaller initial learning rate manually.) Simple recurrent neural networks can also be trained, but as we did not investigate the aforementioned approximations, this is less well supported.
  4. If your training run stopped for same reason, you can continue by just calling the same command again. The weight values will be restored from the last epoch that was successfully finished.
  5. rwthlm (in fact, OpenMP) always uses all available CPU resources unless you explicitly limit them. This will use four cores, even if there are more:
    export OMP_NUM_THREADS=4
    ./rwthlm -train train.gz ...
  6. rwthlm supports HTK format, in particular as produced by RASR. It will most likely work with other ASR decoders, if the lattices fulfill these assumptions: (1) There is a unique initial node with identifier I=0. (2) There is a unique final node (with an arbitrary identifier). (3) If the lattice consists of n nodes, these have identifiers I=0, ..., I=(n-1). (4) If there are m links, these have identifiers J=0, ..., J=(m-1). (5) The lattice has consistent time stamps associated with the nodes, measured in seconds at centi-second precision ("t=1.23").


  1. Create word classes (e.g., with mkcls or word2vec or brown-cluster). We found that 1000 or 2000 word classes work well for a 200K vocabulary.

  2. Training:
      ./rwthlm \
        --vocab mkcls.classes \
        --unk \
        --train train.gz \
        --dev dev.gz \
        --batch-size 4 \
        --learning-rate 1e-3 \
        --max-epoch 30 \
  3. Perplexity Evaluation:
      ./rwthlm \
        --vocab mkcls.classes \
        --unk \
        --ppl test.gz \
        --verbose \
    Word-wise output in SRILM style:
    Reading vocabulary from file 'mkcls.classes' ...
    Reading neural network from file 'example-i100-m100' ...
    Best development perplexity after 17 epochs: 90.686305815766076
    Computing perplexity for file 'test.gz' ...
        p( bonjour | ... )   = [1gram] 0.00046433 [ -3.33317 ]
        p( professeur | ... )    = [1gram] 0.00011872 [ -3.92548 ]
        p( philippe | ... )      = [1gram] 0.00224524 [ -2.64874 ]
        p( even | ... )      = [1gram] 0.00352221 [ -2.45318 ]
        p( <sb> | ... )      = [1gram] 0.50998864 [ -0.29244 ]
        p( bonjour | ... )   = [1gram] 0.00775432 [ -2.11046 ]
        p( monsieur | ... )      = [1gram] 0.03103678 [ -1.50812 ]
        p( fogiel | ... )    = [1gram] 0.00007777 [ -4.10920 ]
        p( <sb> | ... )      = [1gram] 0.55661332 [ -0.25445 ]
        p( vous | ... )      = [1gram] 0.06344076 [ -1.19763 ]
        p( etes | ... )   = [1gram] 0.19047053 [ -0.72017 ]
        p( une | ... )   = [1gram] 0.01625033 [ -1.78914 ]
        p( sommite | ... )    = [1gram] 0.00000365 [ -5.43802 ]
        p( dans | ... )      = [1gram] 0.03445965 [ -1.46269 ]
        p( le | ... )    = [1gram] 0.22801776 [ -0.64203 ]
        p( monde | ... )     = [1gram] 0.17175484 [ -0.76509 ]

    You can grep for "<TAB>p" (or awk '/\tp/') and use this for easy interpolation with Kneser-Ney models (compute-best-mix from SRILM).

  4. Lattice rescoring
    ./rwthlm \
      --vocab mkcls.classes \
      --unk \
      --lambda 0.599535 \
      --pruning-threshold 200 \
      --dp-order 9 \
      --look-ahead-semiring none \
      --lm-scale 38 \
      --output expanded-lattice \ \
    Output is,, and