BootLog

BootLog is a simple command line utility for computing bootstrap estimates. In particular, it can be used to judge and compare the performance of NLP systems, as is demonstrated in the article cited below.

You can download and use BootLog free of charge under the terms of the GNU General Public License.

The use of BootLog in ASR Performance Evaluation is described in the following article:

M. Bisani and H. Ney: Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, Vol. 1, pp. 409-412, May 2004

If you have any questions or suggestions, please contact me.

Download

bootlog.tar.gz: distribution package including Python source file, examples and documentation as a gzipped-tar file (all platforms)

Installation

Bootlog is implemented in Python using the numpy extensions. So the only thing to really install is Python (version 2.2 or later) and numpy, if you don't already have it. Then simply unpack the distribution file: tar zxf bootlog.tar.gz

Using Bootlog

The following quick start guide illustrates how to obtain some of the results given in the ICASSP paper. The examples directory contains the sentence-wise and speaker-wise result files of "system A" and "system B" on the NAB dev'94 test set (system-[AB]-{sentence,speaker}.tab respectively), as used in the paper.

Quick Start Guide

Unpack the distribution and change to the directory
```
> tar zxf bootlog.tar.gz
> cd bootlog
	
```

To compute the bootstrap estimates of word error rate for "system B", type:

> python bootlog.py -p wer examples/system-B-sentence.tab examples/system-B-speaker.tab
== examples/system-B-sentence.tab ==
=== word error rate ===
number of segments:       310
plain estimate:           11.558740 %
number of replications:   10000
bootstrap estimate:       11.553092 % (bias 0.005648) %
bootstrap standard error: 0.767882 %
bootstrap-t confidence intervals:
    90% interval:       (10.334422931858185, 12.859283476365619)
    99% interval:       (9.6900269541778972, 13.642527800379712)

== examples/system-B-speaker.tab ==
=== word error rate ===
number of segments:       20
plain estimate:           11.558740 %
number of replications:   10000
bootstrap estimate:       11.572518 % (bias -0.013778) %
bootstrap standard error: 1.807765 %
bootstrap-t confidence intervals:
    90% interval:       (9.0136528280858172, 14.828798333550319)
    99% interval:       (7.9880890730191609, 17.288135593220339)

To compare the two systems, type:

> python bootlog.py -p werdelta -p werpoi -j examples/system-A-sentence.tab examples/system-B-sentence.tab
=== word error rate difference ===
number of segments:       310
plain estimate:           -0.392051 %
number of replications:   10000
bootstrap estimate:       -0.394483 % (bias 0.002433) %
bootstrap standard error: 0.235602 %
bootstrap-t confidence intervals:
    90% interval:       (-0.77444336882865339, -0.013253810470510718)
    99% interval:       (-0.99811168060426203, 0.23790642347343294)

=== word error rate probability of improvement ===
number of segments:       310
plain estimate:           100.000000 %
number of replications:   10000
bootstrap estimate:       95.150000 % (bias 4.850000) %
bootstrap standard error: 21.483102 %

To get more information on available options, type:
```
> python bootlog.py --help
	
```

Going further

Bootlog is actually a very general utility allowing you to bootstrap almost arbitraty statistics: In the example above we use predefined expression with the -p option. Instead you can evaluate custom expressions with the -e option. Also the input file format is very general: the number, ordering and naming of columns is arbitrary. You can define derived columns with the -d option. When joining tables (with the -j option), rows are matched on their first columns so the ordering is not important, and the sets of row names do not even need to be identical. Please tell me is you find new uses for BootLog (maybe outside of ASR?).

FAQ

What's the difference between matched pairs bootstrap (and poi) and McNemar's test as implemented in the NIST tool?: The McNemar's (or Chi-square test) distinguishes only between correct and incorrect results: When conducted on the sentence level, it takes only sentence errors into account. If neither of two systems recognizes a sentence entirely correctly, any difference in the the number of errors will be ignored. In other words: McNemar's test is related to sentence error rate, not to word error rate.
What about the matched-pairs test implemented in the NIST tool?: The basic idea is very similar; however the NIST test considers the difference in number of errors per sentence without reference to the number of words in the reference transcript. Therefore it is not directly related to word error rate when the length of the sentences varies.
How is probability-of-improvement (poi) related to the P-value given by the NIST tool?: The idea is the same: they are both the value of the distribution function at zero, i.e. an estimate for the probability that one system is superior. However, in the NIST case said distribution describes the difference in the number of errors per sentence, while poi describes the difference in word error rate. An additional difference is that the NIST P-value derives from a Gaussian assumption, while poi derives from the bootstraped distribution function.

Maximilian Bisani

Last modified: Tue Jun 1 16:31:52 CEST 2004