EvalTrans Help

Introduction
Load/save a database
Choose different entry versions during import
Show source sentences in database
Show target sentences of a selected source sentence in database
Manual evaluation of a source/target sentence pair
Choose a test corpus for evaluation
Evaluate a test corpus manually
Evaluate a test corpus automatically by searching similar sentences in the database
Statistical information shown during an evaluation process
Evaluation information for single sentence pairs
Evaluation status for single sentence pairs
The multi reference word error rate
Edit/use information items
The score of a translation
Generate an evaluation report
Check the database extrapolation consistency
Check the database word error rate consistency
The xml database format
The Aachen Evaluation Database format

Introduction

EvalTrans is a tool useful for the evaluation of automatic translations.
It contains tools to maintainace evaluation databases, to evaluate a translation corpus manually, showing the most similar translations in a database, or to extrapolate an evaluation by searching the most similar translation in the database.

Load/save a database

Standard database format of EvalTrans is a proprietary, but XML-conform hierarchical format. (See Description of etdb.xml format)

Its standard extension is ".etdb.xml".

Use the entries Open, Save and Save as in the Database - Menu to load and save the database in this format.

You can also export to or import a database in the Aachen Evaluation Database format (i.e. "SOURCE source sentence" / "TARGET score target sentence") by using Export and Import resp. in this menu.

Load will always replace the database in memory by the selected database; import will merge the selected database to the database in memory. When the definitions and/or scores of the database in memory and the selected database differ, you will be asked which version to use.

To merge two (or more) databases, load the first one and import the second one.

It is also possible to Clear the database in memory by chosing the so-called menu entry. Note that all changes since last saving will be lost.

Choose the old or new version of a database entry

When a database is imported, there can be sentence pairs where the scores or the information items differ. In these cases, you will be asked whether you want to keep the version of the database in memory (which is called old version) or want the version in the selected database (new version) to be taken - once or during the whole importing process.

Be careful when information item definitions differ, since then the scores for these information items will differ in their meaning, too - you have to correct them manually.

Source sentence list

Using Source / Show sentences, a list of all source sentences in the current database can be shown.

The first number in each line is the internal number of the source sentence. The leftmost button will open the target sentence list of the source sentence. The number of target sentences is shown here.

The right button shows the number of information items defined for this sentence. A click on this button will open the information item edit window.

Left to each sentence there is a checkbox which can be selected. You can select all sentences at a glance by choosing Source / Select all ; in the same way you can deselect all sentences, or select each sentence that is not selected and vice versa.

It is also possible to select those sentences that are listed in a file using Source / Select those in file. Since no sentences will be de-selected, you can perform this operation on several files. This is useful when you want to clean up the database from sentences you do not need any longer: Perform this operation on each possible source target, then invert the selection and delete the useless sentences. (delete will remove all selected sentences from the database in memory). Another use of this feature is to limit database statistics operations onto a certain part of the sentences.

Ok will close the window; Help will show this text.

Target sentence list

When you click on the target button to the right of a source sentence in the source sentence list window, another window will open, showing each target sentence of the selected source sentence. Sort order is the assigned score of the target sentence, i.e. the number in front of the sentence. A click on the arrow button next to the sentence will open the manual evaluation window for this sentence.

Operations for the selection, deletion etc. of target sentences are similar to those in the source sentence list (except that the apropriate menu is the target menu). See source sentence list to read more.

You can compare two target sentences by selecting them and calling Target / Compare two sentences. If you choose Target / Compare weighted instead, you will be asked for a Levenshtein score file to be used for comparison.

The Ok button will close the window, Help shows this text.

Manual evaluation of a particular source/target sentence pair

The manual evaluation window allows the user to evaluate a source/target sentence pair by giving a score from 0 to 10, and by assigning an information entity error class to each information item of this sentence pair. A score of -1 means "not yet evaluated".

In the upper half of the window, the source and the target sentence are shown, each followed by its internal number.

Below this, the score buttons are located - select one by clicking on it.

TIP: Clicking with the right button also causes the window to close, so an extra click on the Ok button is not necessary.

Below this a list of similar sentences is shown; the most similar three of each score if there are so many. First sort order is the score of the target sentence, second sort order is the similarity (which is calculated using the Levenshtein distance algorithm on the words - not the letters - of the sentences). A row of exclamation marks indicates the similarity between the sentence and the target sentence; the background color of the exclamation marks has the same meaning. Four exclamation marks means that the sentences are identical (which usually means that you are looking on the sentence you want to evaluate).

A click on the arrow button next to a listed sentence will open another evaluation window for this sentence.

While the mouse pointer is located above one of the listed target sentences, the box below the list will show the differences between this sentence and the topical target sentence as far as the mentioned Levenshtein algorithm recognizes them: a green marked word is "new" in this sentence, a yellow one has been "changed", and a red box indicated that a word has been "deleted".

Next, the state of the information items in this sentence can be selected.

Pressing Ok will accept, Cancel will abandon these changes to the database. Note that in any case, the sentence is added to the database if it has not been in there.

You can save the window position if you want future evaluation windows to appear (almost) at the same position.

Choose a test corpus to evaluate

To evaluate a test corpus manually or automatically, the source and target corpus for this action has to be chosen. Either you enter a filename for each, or you click on the ... button next the entry line to open a file browser.

Input format is One Line - One Sentence without any further information in each file; corresponding sentences must have corresponding lines. Ok will perform, Cancel will abandon the action on these files.

Evaluate a test corpus manually

After the test corpus has been chosen, the manual evaluation overview window opens. The upper half contains some statistics about the test corpus and its state of evaluation. Below this there is a list showing the evaluation status of each sentence pair of the testcorpus.

Evaluate all new will open a manual evaluation window for each sentence in the test corpus that is not found (evaluated) in the database, until all sentences are evaluated or the Cancel button in that window is pressed.
Evaluate sentence or a double click on a sentence in the list will open the manual evaluation window for the selected sentence only.
Generate report will generate a report on this evaluation.
Caclulate WER will calculate the multi-reference word error rate for each sentence.
Ok will close this window.
Help will show this text

Note that all sentences will be added to the database, even if no score has been assigned to them.

Automatic extrapolation of an evaluation

The automatic corpus evaluation extrapolation window is similar to the manual corpus evaluation window. In the upper half, some statistics are shown, including the average extrapolation distance, i.e. the average levenshtein distance (standardized to the length of the target sentence). The evaluation status listbox below shows status information for each sentence pair.

If you want to use the weighted Levenshtein distance for extrapolation (which is more precise on larger databases), select the checkbox below the list and choose the Levenshtein score table file to be used.

Extrapolate missing scores will start the search for the most similar sentences to each test corpus sentence that has not been evaluated yet. Until now, this search can not be interrupted. After each hit, the statistics will be updated. If there is no evaluated sentence, no extrapolation will be done. If there is more than one sentence with the "lowest" distance, the average score from these sentences will be taken.
Evaluate sentence or a double click on a sentence in the list will open the manual evaluation window for the selected sentence only. The sentence will then be added to the database, regardless of whether it was re-evaluated by the user.
Generate report will generate a report on this evaluation.
Caclulate WER will calculate the multi-reference word error rate for each sentence.
Ok will close this window.
Help will show this text

Statistics during/from the evaluation

Total number of source/target sentence pairs in the evaluation corpus
Sentence pairs in evaluation corpus that have been found (and evaluated) in the database
Sentence pairs that have not been found (and evaluated) in the database
Source sentences that have not been found in the database (and will be added)
Target sentences that have not been found in the database (and will be added)
Sentence pairs that have been evaluated manually or extrapolated resp.
Sentence pairs that still have to be evaluated manually or extrapolated resp.
Average number of "perfect" (scored 10) reference sentences per evaluation sentence pair (to indicate how reliable the mWER is).

Average Score / Value of all (evaluated/extrapolated) sentence pairs
Standard Deviation of the score
Subjective sentence error rate (i.e. 100% * (1 - avg_score)): An average score of 0.0 results in a SSER of 100%, an average score of 10.0 in a SSER of 0%
Subjective sentence error rate weighted by the length of the target sentences
Total multi-reference word error rate
Average extrapolation distance: Average Levenshtein Distance (per target word) of all extrapolated sentences (extrapolated evaluation only).
Information item error rate: number of information items not evaluated as ok per evaluated information item
Number of information items marked as ok/missing/syntax/meaning/other/none

Evaluation information for single sentence pairs

During a manual or extrapolated evaluation, the list box in the lower half of the evaluation status window will show some information about each sentence pair:

Sentence index (in evaluation corpus)
Evaluation status (see below)
(extrapolated) Score
Multi reference Word Error Rate ((nc) means "not calculated")
Number of perfect reference sentences (- means "not yet counted")
Information item statistics ( Ok/ ( Misses/ Syntax/ Meaning/ Other)/ [none])
Index of (first) most similar database sentence used for extrapolation (extrapolated sentences only)
Absolute Levenshtein distance to that sentence (extrapolated sentences only)
Levenshtein distance to that sentence vs. length of current sentence (extrapolated sentences only)

Evaluation status for single sentence pairs

not evaluated means that no evaluation was found in the database,
from database means that the sentence has been found in the database,
evaluated as means that this sentence has been (re-)evaluated manually.
not yet extrapolated means that no evaluation was found in the database, and an extrapolation has not yet been performed on this sentence,
extrapolated means that a score for this sentence has been extrapolated.

Multi reference word error rate

The multi reference word error rate (mWER) is the word error rate against the most similar reference sentence which has been evaluated as "perfect" (i.e. has been assigned a score of 10). It is calculated as Levenshtein operations per reference word (and can thus exceed 100%). Average mWER for an evaluation corpus (manual/extrapolated evaluation) is calculated word wise, not sentence wise.

Information items

Each sentence transports several pieces of information. In this software, these pieces of information are called "information items".

It can ease the evaluation of a source/target sentence pair (and raise the comparability of the evaluation), if the user is obliged to check whether each information item has been transfered correctly. For this reason, and for possible extensions using this data, it is possible to specify the information items in a sentence by clicking the Information Item button in the source sentence window.

A new information item can be added using the new button, and a particular (and selected) item can be delete'd or changed (edit).

You can also re-order the information items by selecting one and moving it up or down.

Be careful when changing or deleting information item definitions: There can be a lot of confusion if this information item has been evaluated during a former evaluation.

During a manual evaluation, the user can select whether each information item of the source sentence

is (more or less) ok
misses completely
has grave syntactical mistakes
has a different meaning
or has some other problems

in the target sentence.

The score of a translation

Each source/target sentence pair is assigned a score by the user. This score is an integer in the range of -1 to 10, meaning:

10: The translation is absolutely correct; the only differences to the reference translation are synonymous words or phrases.
9: The translation is correct, and there are no differences to the reference translation that can change the meaning of the sentence.
8: The translation is good and syntactically well-formed, but not too sophisticated.
7: The translation is very well understandable, but there are minor syntactical mistakes.
6: The translation is very well understandable, but there are syntactical mistakes.
5: The translation is well understandable, but there are major syntactical mistakes and/or minor omissions.
4: The translation is understandable, but there are grave syntactical mistakes and/or minor word errors.
3: The translation is roughly understandable, but there are grave syntactical mistakes, omissions, and word errors.
2: The translation is not understandable, there are grave syntax errors, and/or parts of the source sentence have not been translated; or the sentence is well-formed, but important pieces of information like dates are wrong.
1: The translation is not understandable/present in major parts.
0: The translation is not present or has absolutely no relationship to the source sentence.
-1: No evaluation has been assigned yet.

Generate an evaluation report

After a manual or an extrapolated evaluation of a test corpus has been performed, a evaluation report can be created by clicking on the report button. The content of this report, which will be written into the chosen file, is more or less identical to the content of the (manual/automatic) evaluation window. You can choose whether you want to include the evaluated sentences literally or not.

Check consistency of the database: extrapolation consistency

As the quality of an extrapolated evaluation depends directly on the quality of the database, a measure for the database quality with respect to extrapolation purposes can be calculated (Source/Test extrapolation consistency):

For each target sentence of each selected source sentence, an extrapolated evaluation will be performed using the whole database except for the sentence itself. The extrapolated score will be compared to the manually assigned score for the sentence. The resulting average absolute and signed difference indicate how much an extrapolated evaluation score would differ from a manually performed one (absolute difference) and into what direction (signed difference). The absolute estimation error |EE| indicates the same for the SSER.

Optionally, all pairs of extrapolated and true score can be printed out into a file for further analysis.

Check consistency of the database: multi reference word error rate consistency

Similar to the extrapolation consistency check, the correlation between the multi-ref WER and the assigned score can be analysed (Source/Test WER consistency): For each target sentence of each selected source sentence, the multi reference word error rate will be calculated using the database's reference sentences (without the sentence itself). All pairs of WER and assigned score will be printed out into a file, which can be used for further analysis.

XML database savefile format

... looks as follows:

<database>
<version_id>$Id: EvalTransHelp.html,v 1.15 2000/07/28 14:03:06 leusch Exp $</version_id>
  <source>
    <s_sent> First source sentence </s_sent>
    <ielist>
      <iedef id="0"> First information item definition </iedef>
      <iedef id="1"> Second information item definition </iedef>

[... more information item definitions ...]

    </ielist>
    <targets>
     <tgt><t_sent> First target sentence </sent>
      <eval val=" Score of this sentence "/>
      <ie id="0" val=" ok|miss|syn|mean|oth "/>

 [... more information item error classes ...]

     </tgt>

 [... more target sentences ...]

    </targets>
   </source>

 [... more source sentences ...]

</database>

Aachen Evaluation Database savefile format