EvalTrans Help

Table of contents


Introduction


EvalTrans is a tool useful for the evaluation of automatic translations.
It contains tools to maintainace evaluation databases, to evaluate a translation corpus manually, showing the most similar translations in a database, or to extrapolate an evaluation by searching the most similar translation in the database.

Load/save a database


Standard database format of EvalTrans is a proprietary, but XML-conform hierarchical format. (See Description of etdb.xml format)

Its standard extension is ".etdb.xml".

Use the entries Open, Save and Save as in the Database - Menu to load and save the database in this format.

You can also export to or import a database in the Aachen Evaluation Database format (i.e. "SOURCE source sentence" / "TARGET score target sentence") by using Export and Import resp. in this menu.

Load will always replace the database in memory by the selected database; import will merge the selected database to the database in memory. When the definitions and/or scores of the database in memory and the selected database differ, you will be asked which version to use.

To merge two (or more) databases, load the first one and import the second one.

It is also possible to Clear the database in memory by chosing the so-called menu entry. Note that all changes since last saving will be lost.

Choose the old or new version of a database entry


When a database is imported, there can be sentence pairs where the scores or the information items differ. In these cases, you will be asked whether you want to keep the version of the database in memory (which is called old version) or want the version in the selected database (new version) to be taken - once or during the whole importing process.

Be careful when information item definitions differ, since then the scores for these information items will differ in their meaning, too - you have to correct them manually.

Source sentence list


Using Source / Show sentences, a list of all source sentences in the current database can be shown.

The first number in each line is the internal number of the source sentence. The leftmost button will open the target sentence list of the source sentence. The number of target sentences is shown here.

The right button shows the number of information items defined for this sentence. A click on this button will open the information item edit window.

Left to each sentence there is a checkbox which can be selected. You can select all sentences at a glance by choosing Source / Select all ; in the same way you can deselect all sentences, or select each sentence that is not selected and vice versa.

It is also possible to select those sentences that are listed in a file using Source / Select those in file. Since no sentences will be de-selected, you can perform this operation on several files. This is useful when you want to clean up the database from sentences you do not need any longer: Perform this operation on each possible source target, then invert the selection and delete the useless sentences. (delete will remove all selected sentences from the database in memory). Another use of this feature is to limit database statistics operations onto a certain part of the sentences.

Ok will close the window; Help will show this text.

Target sentence list


When you click on the target button to the right of a source sentence in the source sentence list window, another window will open, showing each target sentence of the selected source sentence. Sort order is the assigned score of the target sentence, i.e. the number in front of the sentence. A click on the arrow button next to the sentence will open the manual evaluation window for this sentence.

Operations for the selection, deletion etc. of target sentences are similar to those in the source sentence list (except that the apropriate menu is the target menu). See source sentence list to read more.

You can compare two target sentences by selecting them and calling Target / Compare two sentences. If you choose Target / Compare weighted instead, you will be asked for a Levenshtein score file to be used for comparison.

The Ok button will close the window, Help shows this text.

Manual evaluation of a particular source/target sentence pair


The manual evaluation window allows the user to evaluate a source/target sentence pair by giving a score from 0 to 10, and by assigning an information entity error class to each information item of this sentence pair. A score of -1 means "not yet evaluated".

In the upper half of the window, the source and the target sentence are shown, each followed by its internal number.

Below this, the score buttons are located - select one by clicking on it.

TIP: Clicking with the right button also causes the window to close, so an extra click on the Ok button is not necessary.

Below this a list of similar sentences is shown; the most similar three of each score if there are so many. First sort order is the score of the target sentence, second sort order is the similarity (which is calculated using the Levenshtein distance algorithm on the words - not the letters - of the sentences). A row of exclamation marks indicates the similarity between the sentence and the target sentence; the background color of the exclamation marks has the same meaning. Four exclamation marks means that the sentences are identical (which usually means that you are looking on the sentence you want to evaluate).

A click on the arrow button next to a listed sentence will open another evaluation window for this sentence.

While the mouse pointer is located above one of the listed target sentences, the box below the list will show the differences between this sentence and the topical target sentence as far as the mentioned Levenshtein algorithm recognizes them: a green marked word is "new" in this sentence, a yellow one has been "changed", and a red box indicated that a word has been "deleted".

Next, the state of the information items in this sentence can be selected.

Pressing Ok will accept, Cancel will abandon these changes to the database. Note that in any case, the sentence is added to the database if it has not been in there.

You can save the window position if you want future evaluation windows to appear (almost) at the same position.

Choose a test corpus to evaluate


To evaluate a test corpus manually or automatically, the source and target corpus for this action has to be chosen. Either you enter a filename for each, or you click on the ... button next the entry line to open a file browser.

Input format is One Line - One Sentence without any further information in each file; corresponding sentences must have corresponding lines. Ok will perform, Cancel will abandon the action on these files.

Evaluate a test corpus manually


After the test corpus has been chosen, the manual evaluation overview window opens. The upper half contains some statistics about the test corpus and its state of evaluation. Below this there is a list showing the evaluation status of each sentence pair of the testcorpus.

Note that all sentences will be added to the database, even if no score has been assigned to them.

Automatic extrapolation of an evaluation


The automatic corpus evaluation extrapolation window is similar to the manual corpus evaluation window. In the upper half, some statistics are shown, including the average extrapolation distance, i.e. the average levenshtein distance (standardized to the length of the target sentence). The evaluation status listbox below shows status information for each sentence pair.

If you want to use the weighted Levenshtein distance for extrapolation (which is more precise on larger databases), select the checkbox below the list and choose the Levenshtein score table file to be used.

Statistics during/from the evaluation




Evaluation information for single sentence pairs


During a manual or extrapolated evaluation, the list box in the lower half of the evaluation status window will show some information about each sentence pair:

Evaluation status for single sentence pairs


Multi reference word error rate


The multi reference word error rate (mWER) is the word error rate against the most similar reference sentence which has been evaluated as "perfect" (i.e. has been assigned a score of 10). It is calculated as Levenshtein operations per reference word (and can thus exceed 100%). Average mWER for an evaluation corpus (manual/extrapolated evaluation) is calculated word wise, not sentence wise.

Information items


Each sentence transports several pieces of information. In this software, these pieces of information are called "information items".

It can ease the evaluation of a source/target sentence pair (and raise the comparability of the evaluation), if the user is obliged to check whether each information item has been transfered correctly. For this reason, and for possible extensions using this data, it is possible to specify the information items in a sentence by clicking the Information Item button in the source sentence window.

A new information item can be added using the new button, and a particular (and selected) item can be delete'd or changed (edit).

You can also re-order the information items by selecting one and moving it up or down.

Be careful when changing or deleting information item definitions: There can be a lot of confusion if this information item has been evaluated during a former evaluation.

During a manual evaluation, the user can select whether each information item of the source sentence

in the target sentence.

The score of a translation


Each source/target sentence pair is assigned a score by the user. This score is an integer in the range of -1 to 10, meaning:

Generate an evaluation report


After a manual or an extrapolated evaluation of a test corpus has been performed, a evaluation report can be created by clicking on the report button. The content of this report, which will be written into the chosen file, is more or less identical to the content of the (manual/automatic) evaluation window. You can choose whether you want to include the evaluated sentences literally or not.

Check consistency of the database: extrapolation consistency


As the quality of an extrapolated evaluation depends directly on the quality of the database, a measure for the database quality with respect to extrapolation purposes can be calculated (Source/Test extrapolation consistency):

For each target sentence of each selected source sentence, an extrapolated evaluation will be performed using the whole database except for the sentence itself. The extrapolated score will be compared to the manually assigned score for the sentence. The resulting average absolute and signed difference indicate how much an extrapolated evaluation score would differ from a manually performed one (absolute difference) and into what direction (signed difference). The absolute estimation error |EE| indicates the same for the SSER.

Optionally, all pairs of extrapolated and true score can be printed out into a file for further analysis.

Check consistency of the database: multi reference word error rate consistency


Similar to the extrapolation consistency check, the correlation between the multi-ref WER and the assigned score can be analysed (Source/Test WER consistency): For each target sentence of each selected source sentence, the multi reference word error rate will be calculated using the database's reference sentences (without the sentence itself). All pairs of WER and assigned score will be printed out into a file, which can be used for further analysis.

XML database savefile format


... looks as follows:

<database>
<version_id>$Id: EvalTransHelp.html,v 1.15 2000/07/28 14:03:06 leusch Exp $</version_id>
  <source>
    <s_sent> First source sentence </s_sent>
    <ielist>
      <iedef id="0"> First information item definition </iedef>
      <iedef id="1"> Second information item definition </iedef>

[... more information item definitions ...]

    </ielist>
    <targets>
     <tgt><t_sent> First target sentence </sent>
      <eval val=" Score of this sentence "/>
      <ie id="0" val=" ok|miss|syn|mean|oth "/>

 [... more information item error classes ...]

     </tgt>

 [... more target sentences ...]

    </targets>
   </source>

 [... more source sentences ...]

</database>

Aachen Evaluation Database savefile format


... looks as follows:

SOURCE [Source sentence]

TARGET [Score] [First target sentence]

SOURCE [Source sentence]

TARGET [Score] [Second target sentence]

[... more sentence pairs ...]

The order of the sentences does not matter.


Gregor Leusch
Last modified: Thu Jul 13 11:56:06 CEST