u n r a v e l - A decipherment Toolkit.

This is the home page of the UNRAVEL decipherment toolkit. It was developed by Malte Nuhn and Julian Schamper at the Human Language Technology and Pattern Recognition Group at the RWTH Aachen University.


new Bugfix Release. UNRAVEL 0.2.

Overview

The goal of UNRAVEL is to provide reusable tools that can be applied to unsupervised learning for Machine Translation.

UNRAVEL implements many of the recently published works on decipherment, including decipherment for deterministic ciphers like e.g. the Zodiac Z408 cipher and part two of the Beale Ciphers, as well as decipherment of probabilistic ciphers and unsupervised training for machine translation. It also includes data and example configuration files so that the previously published experiments are easy to reproduce.

The code base is implemented in C++11 and uses many publicly available libraries: The Google-Glog logging library is used for all logging purposes, the Google-Gflags library is used for providing command line flags, and the GoogleTest library is used for unit testing and consistency checks throughout the code base.

Classes for compressed I/O, access to OpenFST, access to KenLM, representing mappings, n-gram counts, vocabularies, lexicons, etc. are shared across the code base. For building we use the Gnu build system. UNRAVEL can be compiled using Gcc, Icc, and Clang on various Linux distributions and on MacOS X.

Configuration- and data files (if possible from a license point of view) for various experiments are distributed. Amongst others this includes setups for the Zodiac Z-408 and part two of the Beale ciphers (deterministic ciphers), as well as the OPUS corpus and the Verbmobil corpus (probabilistic cipher/Machine Translation).

UNRAVEL contains two tools: det-unravel for decipherment of deterministic ciphers and em-unravel for EM decipherment of probabilistic substitution ciphers and simple Machine Translation tasks.

The core algorithms of UNRAVEL are highly parallelized and have been tested with up to 128 threads.

Terms of Use

UNRAVEL is open source software; it can be redistributed and/or modified under the terms of the RWTH UNRAVEL License. This license includes free usage for non-commercial purposes as long as any changes made to the original software are published under the terms of the same license. Other licenses can be requested. Publications of results obtained through the use of original or modified versions of the software have to cite the authors by refering to the following publications:

For using det_unravel

M. Nuhn, J. Schamper, H. Ney: Beam Search for Solving Substitution Ciphers. In Annual Meeting of the Assoc. for Computational Linguistics, pages 1569-1576, Sofia, Bulgaria, August 2013.

and

M. Nuhn, J. Schamper, H. Ney: Improved Decipherment of Homophonic Ciphers . In Conference on Empirical Methods in Natural Language Processing , pages 1764-1768, Doha, Qatar, October 2014.

For using em_unravel

M. Nuhn, H. Ney: EM Decipherment for Large Vocabularies. In Annual Meeting of the Assoc. for Computational Linguistics, pages 759-764, Baltimore, MD, USA, June 2014.

and

M. Nuhn, A. Mauser, H. Ney: Deciphering Foreign Language by Combining Language Models and Context Vectors. In Annual Meeting of the Assoc. for Computational Linguistics, pages 156-164, Jeju, Republic of Korea, July 2012.


Download

To download the software, you have to accept the license terms.
Name:
Organization:
E-Mail:
I agree to the terms of the RWTH UNRAVEL License

Installation

Make sure that you have installed the following libraries before installing UNRAVEL. Some scripts for the installation of these libraries in your home-directory can be found here: https://github.com/mnuhn/setupenv/

After having downloaded and extracted the UNRAVEL source code, run:
./autogen.sh
cd codec
make
cd ..
make

General Introduction

In the following we will use the machine translation notation and denote the ciphertext with $f_1^N = f_1 \dots f_j \dots f_N$ which consists of cipher tokens $f_j \in V_f$. We denote the plaintext with $e_1^N = e_1 \dots e_i \dots e_N$ (and its vocabulary $V_e$ respectively). We define $e_0 = f_0 = e_{N+1} = f_{N+1} = \$$ with \$ being a special sentence boundary token. We use the abbreviations $\overline V_e = V_e \cup {$}$ and $\overline V_f$ respectively.

Probabilistic Substitution Ciphers

A general or probabilistic substitution cipher uses a table $s(e|f)$ which contains for each cipher token $f$ a probability that the token $f$ is substituted with the plaintext token $e$. Such a table for substituting cipher tokens $V_f={A, B, C, D}$ with plaintext tokens $V_e={a, b, c, d}$ could for example look like $$\begin{array}{|c|c|c|c|c|} \hline &a & b & c & d\\ \hline A & 0.1 & 0.2 & 0.3 & 0.4\\ B & 0.4 & 0.2 & 0.1 & 0.3\\ C & 0.4 & 0.1 & 0.2 & 0.3\\ D & 0.3 & 0.4 & 0.2 & 0.1\\ \hline \end{array}$$ In decipherment in general, this table $s(e|f)$ is unknown, and the task of decipherment is to learn this table from only large amounts of data for languages $e$ and $f$. Probabilistic substituion ciphers can be deciphered using our tool em_unravel.

Simple and Homophonic Substitution Ciphers

The simple substitution cipher encrypts a given plaintext into a ciphertext by replacing each plaintext token with a unique substitute: This means that the table $s(e|f)$ contains all zeroes, except for one ``$1.0$'' per $f\in V_f$ and one ``$1.0$'' per $e\in V_e$. For example the text $$a~bad~cab$$ would be enciphered to $$B~CBA~DBC$$ when using the substitution $$\begin{array}{|c|c|c|c|c|} \hline &a & b & c & d\\ \hline A & 0 & 0 & 0 & 1\\ B & 1 & 0 & 0 & 0\\ C & 0 & 1 & 0 & 0\\ D & 0 & 0 & 1 & 0\\ \hline \end{array}$$ In contrast to the simple substitution cipher, the homophonic substitution cipher allows multiple cipher tokens per plaintext token, which means that the table $s(e|f)$ is all zero, except for one $1.0$ per $f\in V_f$. For example the above plaintext could be enciphered to $$A~BCD~ECF$$ when using the homophonic substitution $$\begin{array}{|c|c|c|c|c|} \hline & a & b & c & d\\ \hline A & 1 & 0 & 0 & 0\\ B & 0 & 1 & 0 & 0\\ C & 1 & 0 & 0 & 0\\ D & 0 & 0 & 0 & 1\\ E & 0 & 0 & 1 & 0\\ F & 0 & 1 & 0 & 0\\ \hline \end{array}$$ We will use the definition $$n_{max} = \max_{e} \sum_{f} s(e|f)$$ to characterize the maximum number of different cipher symbols allowed per plaintext symbol. Simple and Homophonic Substituion ciphers can be deciphered using our tool det_unravel.

det_unravel - Deterministic Ciphers

Core Algorithm

See:

M. Nuhn, J. Schamper, H. Ney: Beam Search for Solving Substitution Ciphers. In Annual Meeting of the Assoc. for Computational Linguistics, pages 1569-1576, Sofia, Bulgaria, August 2013.

and

M. Nuhn, J. Schamper, H. Ney: Improved Decipherment of Homophonic Ciphers . In Conference on Empirical Methods in Natural Language Processing , pages 1764-1768, Doha, Qatar, October 2014.

Crack the Z408 Cipher

Here we will crack the Z408 cipher. Go to the directory where you unzipped and compiled UNRAVEL. Then:
# move to z408 example directory
cd data/tests/det_z408

# download an english 8gram character based LM without blanks from our website
./download_lm.sh

# start decipherment
./run.sh
This will run det_unravel to decipher the z408 cipher. Let's have a look at the config file:

# print information about the best hypothesis (according to its score) while
# doing search. This information contains the mapping and also the number of
# correct mapped symbols (if an correct reference mapping is available)
--print_best

# let google-logging also log to stderr
--alsologtostderr

# let google-logging write log files to the current directory
--log_dir=./

# prepend this prefix to all generated files
--output_prefix=z408_

# for evaluation purposes, the ciphertext symbols have been chosen in a way
# that the plaintext can be read-off instantly from the transcription: e.g.
# the ciphertext symbols e_1, e_2, e_3, ... correspond to the plaintext
# symbol e. this information is ONLY USED FOR EVALUATION, not for
# the following command tries to generate a reference mapping just
# from the ciphertext. This is handy (and also only possible) if the ciphertext
# symbols follow the format ciphertext-symbol=CORRESPONDING PLAINTEXT-SYMBOL_NUMBER
# e.g. like above e_1, e_2, e_3, ... correspond to plaintext symbol e.
--ref_mapping_type=ID_

# specify the homophonic limit: each plaintext symbol can at maximum have 10
# different cipher symbols.
# search can not find the correct mapping if this value is set to low but is able to
# find the correct mapping if it is set unnecessary high. However tight values
# make it easier to find a good solution e.g. if the ciphertext is known as a
# simple substitution cipher (1:1) we recommend setting this option to 1.
--extension_limit_e=10

# specify the lm
--lm=en.noblanks.lm.8gram.gz

# specify the cipher text (count file for different orders needed)
--counts_1=cipher.1gram.gz
--counts_2=cipher.2gram.gz
--counts_3=cipher.3gram.gz
--counts_4=cipher.4gram.gz
--counts_5=cipher.5gram.gz
--counts_6=cipher.6gram.gz
--counts_7=cipher.7gram.gz
--counts_8=cipher.8gram.gz

# choose the ngram extension order strategy type to be ngram. this strategy
# uses a beam search to find an extension order which maximizes the number of
# fixed ngrams, whereas the counts of each order could be weighted differently.
--extorder_type=ngram
--extorder_counts_2_weight=1.0
--extorder_counts_3_weight=1.0
--extorder_counts_4_weight=1.0
--extorder_counts_5_weight=1.0
--extorder_counts_6_weight=1.0
--extorder_counts_7_weight=2.0
--extorder_counts_8_weight=3.0

# this chooses the multi_ngram_feature to drive the search and simultaneously
# chooses 2 as the lowest order of counts to consider for this feature. the
# multi_ngram_count_feature then uses counts from this order up to the highest
# available count order and needs all counts of intermediate counts to be
# available.
--multi_ngram_feature_lowest_order=2
# this is the pruning parameter for the histogram size. several other pruning
# parameters can be defined but are set with good default values if they are not
# specified. different ranges can be used to define a pruning setting for a
# restricted hypothesis cardinality range. e.g. it could be the case that at the
# first search steps pruning should be done only moderately since only small
# context is available for good decisions. however in this case just one range is
# used.
--pruning_range1_histogram_max=75

Find some example det_unravel logging output here.

em_unravel - Probabilistic Ciphers

Introduction about prob ciphers.

Core Algorithm

See:

M. Nuhn, H. Ney: EM Decipherment for Large Vocabularies. In Annual Meeting of the Assoc. for Computational Linguistics, pages 759-764, Baltimore, MD, USA, June 2014.

and

M. Nuhn, A. Mauser, H. Ney: Deciphering Foreign Language by Combining Language Models and Context Vectors. In Annual Meeting of the Assoc. for Computational Linguistics, pages 156-164, Jeju, Republic of Korea, July 2012.

Solve a 1:1 cipher using EM training

Here we will run EM training on a simple 1:1 cipher. Go to the directory where you unzipped and compiled UNRAVEL. Then:
# move to em_1to1 example directory
cd data/tests/em_1to1

# download a german 2+3gram character based LM with blanks from our website
./download_lm.sh

# start decipherment
./run.sh
This will run det_unravel to decipher the z408 cipher. Let's have a look at the config file:

# let google-logging also log to stderr
--alsologtostderr

# let google-logging write log files to the current directory
--log_dir=./

# specify reference corpus
--e=e

# specify cipher corpus
--f=f

# specify language model
--lm=de.blanks.lm.2gram.gz

# use multiple threads
--num_threads=3

# for each source token f, expand the best e's according to p(f|e)
# in this case, 30 is bigger than our vocabulary, so no lexical beaming is performed
--lex_beam_size=30
--lex_beam_prepare=30

# for each lm state, expand the best e's according to LM
# in this case, 30 is bigger than our vocabulary, so no lm beaming is performed
--lm_beam_size=30

# keep only 100 hypotheses nodes for each position during search
--beam_size=100

# do not allow reordering
--permutation_window=0

# do not allow deletions/insertions
--insertion_beam_size=0
--insertions=false
--insertion_penalty=0
--max_deletions=0
--deletion_penalty=1.0

# interpolate lexicon: 0.99 learned lexicon + 0.01 uniform lexicon
--lex_lambda=0.99

# perform 10 iterations
--iters=10

# only use the first 10 sentences, use 2 of those for reporting statistics
--max_sentences=10
--test_sentences=2
Find some example em_unravel logging output here.
Last modified:
Disclaimer
Valid CSS!