BootLog
BootLog is a simple command line utility for computing bootstrap
estimates. In particular, it can be used to judge and compare
the performance of NLP systems, as is demonstrated in the
article cited below.
You can download and use BootLog free of charge under the terms
of the GNU General
Public License.
The use of BootLog in ASR Performance Evaluation is described in
the following article:
M. Bisani and H. Ney: Bootstrap Estimates for Confidence
Intervals in ASR Performance Evaluation. In Proc. IEEE
International Conference on Acoustics, Speech, and Signal
Processing, Montreal, Canada, Vol. 1, pp. 409412, May 2004
If you have any questions or suggestions, please
contact me.
Download
 bootlog.tar.gz: distribution
package including Python source file, examples and documentation
as a gzippedtar file (all platforms)
Installation
Bootlog is implemented in Python using the numpy extensions. So the only thing to really install
is Python (version 2.2 or later) and numpy, if you don't already have it.
Then simply unpack the distribution file: tar zxf
bootlog.tar.gz
Using Bootlog
The following quick start guide illustrates how to obtain some of
the results given in the ICASSP paper. The examples directory
contains the sentencewise and speakerwise result files of
"system A" and "system B" on the NAB dev'94 test set
(system[AB]{sentence,speaker}.tab respectively), as
used in the paper.
Quick Start Guide

Unpack the distribution and change to the directory
> tar zxf bootlog.tar.gz
> cd bootlog

To compute the bootstrap estimates of word error rate for "system B", type:
> python bootlog.py p wer examples/systemBsentence.tab examples/systemBspeaker.tab
== examples/systemBsentence.tab ==
=== word error rate ===
number of segments: 310
plain estimate: 11.558740 %
number of replications: 10000
bootstrap estimate: 11.553092 % (bias 0.005648) %
bootstrap standard error: 0.767882 %
bootstrapt confidence intervals:
90% interval: (10.334422931858185, 12.859283476365619)
99% interval: (9.6900269541778972, 13.642527800379712)
== examples/systemBspeaker.tab ==
=== word error rate ===
number of segments: 20
plain estimate: 11.558740 %
number of replications: 10000
bootstrap estimate: 11.572518 % (bias 0.013778) %
bootstrap standard error: 1.807765 %
bootstrapt confidence intervals:
90% interval: (9.0136528280858172, 14.828798333550319)
99% interval: (7.9880890730191609, 17.288135593220339)

To compare the two systems, type:
> python bootlog.py p werdelta p werpoi j examples/systemAsentence.tab examples/systemBsentence.tab
=== word error rate difference ===
number of segments: 310
plain estimate: 0.392051 %
number of replications: 10000
bootstrap estimate: 0.394483 % (bias 0.002433) %
bootstrap standard error: 0.235602 %
bootstrapt confidence intervals:
90% interval: (0.77444336882865339, 0.013253810470510718)
99% interval: (0.99811168060426203, 0.23790642347343294)
=== word error rate probability of improvement ===
number of segments: 310
plain estimate: 100.000000 %
number of replications: 10000
bootstrap estimate: 95.150000 % (bias 4.850000) %
bootstrap standard error: 21.483102 %

To get more information on available options, type:
> python bootlog.py help
Going further
Bootlog is actually a very general utility allowing you to
bootstrap almost arbitraty statistics: In the example above we use
predefined expression with the p option. Instead you can
evaluate custom expressions with the e option. Also the input
file format is very general: the number, ordering and naming of
columns is arbitrary. You can define derived columns with the d
option. When joining tables (with the j option), rows are
matched on their first columns so the ordering is not important,
and the sets of row names do not even need to be identical.
Please tell
me is you find new uses for BootLog (maybe outside of ASR?).
FAQ
 What's the difference between matched pairs bootstrap (and
poi) and McNemar's test as implemented in the NIST tool?

The McNemar's (or Chisquare test) distinguishes only between
correct and incorrect results: When conducted on the sentence
level, it takes only sentence errors into account. If neither
of two systems recognizes a sentence entirely correctly, any
difference in the the number of errors will be ignored. In
other words: McNemar's test is related to sentence
error rate, not to word error rate.
 What about the matchedpairs test implemented in the NIST
tool?

The basic idea is very similar; however the NIST test
considers the difference in number of errors per sentence
without reference to the number of words in the reference
transcript. Therefore it is not directly related to word
error rate when the length of the sentences varies.
 How is probabilityofimprovement (poi) related to the
Pvalue given by the NIST tool?

The idea is the same: they are both the value of the
distribution function at zero, i.e. an estimate for the
probability that one system is superior. However, in the NIST
case said distribution describes the difference in the number
of errors per sentence, while poi describes the difference in
word error rate. An additional difference is that the NIST
Pvalue derives from a Gaussian assumption, while poi derives
from the bootstraped distribution function.
Maximilian Bisani
Last modified: Tue Jun 1 16:31:52 CEST 2004