UNRAVEL - Decipherment Toolkit

./autogen.sh
cd codec
make
cd ..
make

# move to z408 example directory
cd data/tests/det_z408

# download an english 8gram character based LM without blanks from our website
./download_lm.sh

# start decipherment
./run.sh

# print information about the best hypothesis (according to its score) while
# doing search. This information contains the mapping and also the number of
# correct mapped symbols (if an correct reference mapping is available)
--print_best

# let google-logging also log to stderr
--alsologtostderr

# let google-logging write log files to the current directory
--log_dir=./

# prepend this prefix to all generated files
--output_prefix=z408_

# for evaluation purposes, the ciphertext symbols have been chosen in a way
# that the plaintext can be read-off instantly from the transcription: e.g.
# the ciphertext symbols e_1, e_2, e_3, ... correspond to the plaintext
# symbol e. this information is ONLY USED FOR EVALUATION, not for
# the following command tries to generate a reference mapping just
# from the ciphertext. This is handy (and also only possible) if the ciphertext
# symbols follow the format ciphertext-symbol=CORRESPONDING PLAINTEXT-SYMBOL_NUMBER
# e.g. like above e_1, e_2, e_3, ... correspond to plaintext symbol e.
--ref_mapping_type=ID_

# specify the homophonic limit: each plaintext symbol can at maximum have 10
# different cipher symbols.
# search can not find the correct mapping if this value is set to low but is able to
# find the correct mapping if it is set unnecessary high. However tight values
# make it easier to find a good solution e.g. if the ciphertext is known as a
# simple substitution cipher (1:1) we recommend setting this option to 1.
--extension_limit_e=10

# specify the lm
--lm=en.noblanks.lm.8gram.gz

# specify the cipher text (count file for different orders needed)
--counts_1=cipher.1gram.gz
--counts_2=cipher.2gram.gz
--counts_3=cipher.3gram.gz
--counts_4=cipher.4gram.gz
--counts_5=cipher.5gram.gz
--counts_6=cipher.6gram.gz
--counts_7=cipher.7gram.gz
--counts_8=cipher.8gram.gz

# choose the ngram extension order strategy type to be ngram. this strategy
# uses a beam search to find an extension order which maximizes the number of
# fixed ngrams, whereas the counts of each order could be weighted differently.
--extorder_type=ngram
--extorder_counts_2_weight=1.0
--extorder_counts_3_weight=1.0
--extorder_counts_4_weight=1.0
--extorder_counts_5_weight=1.0
--extorder_counts_6_weight=1.0
--extorder_counts_7_weight=2.0
--extorder_counts_8_weight=3.0

# this chooses the multi_ngram_feature to drive the search and simultaneously
# chooses 2 as the lowest order of counts to consider for this feature. the
# multi_ngram_count_feature then uses counts from this order up to the highest
# available count order and needs all counts of intermediate counts to be
# available.
--multi_ngram_feature_lowest_order=2
# this is the pruning parameter for the histogram size. several other pruning
# parameters can be defined but are set with good default values if they are not
# specified. different ranges can be used to define a pruning setting for a
# restricted hypothesis cardinality range. e.g. it could be the case that at the
# first search steps pruning should be done only moderately since only small
# context is available for good decisions. however in this case just one range is
# used.
--pruning_range1_histogram_max=75

# move to em_1to1 example directory
cd data/tests/em_1to1

# download a german 2+3gram character based LM with blanks from our website
./download_lm.sh

# start decipherment
./run.sh

# let google-logging also log to stderr
--alsologtostderr

# let google-logging write log files to the current directory
--log_dir=./

# specify reference corpus
--e=e

# specify cipher corpus
--f=f

# specify language model
--lm=de.blanks.lm.2gram.gz

# use multiple threads
--num_threads=3

# for each source token f, expand the best e's according to p(f|e)
# in this case, 30 is bigger than our vocabulary, so no lexical beaming is performed
--lex_beam_size=30
--lex_beam_prepare=30

# for each lm state, expand the best e's according to LM
# in this case, 30 is bigger than our vocabulary, so no lm beaming is performed
--lm_beam_size=30

# keep only 100 hypotheses nodes for each position during search
--beam_size=100

# do not allow reordering
--permutation_window=0

# do not allow deletions/insertions
--insertion_beam_size=0
--insertions=false
--insertion_penalty=0
--max_deletions=0
--deletion_penalty=1.0

# interpolate lexicon: 0.99 learned lexicon + 0.01 uniform lexicon
--lex_lambda=0.99

# perform 10 iterations
--iters=10

# only use the first 10 sentences, use 2 of those for reporting statistics
--max_sentences=10
--test_sentences=2

- A decipherment Toolkit.

Overview

Terms of Use

Download

Installation

General Introduction

Probabilistic Substitution Ciphers

Simple and Homophonic Substitution Ciphers

det_unravel - Deterministic Ciphers

Core Algorithm

Crack the Z408 Cipher

em_unravel - Probabilistic Ciphers

Core Algorithm

Solve a 1:1 cipher using EM training

Name:
Organization:
E-Mail: