Sign Language Recognition > Benchmark Databases |
RWTH German Fingerspelling Database
Our database is freely available.
Use it as you wish, but please cite us if you do:
-
P. Dreuw, T. Deselaers, D. Keysers, and H. Ney. Modeling Image Variability in Appearance-Based Gesture Recognition. In ECCV Workshop on Statistical Methods in Multi-Image and Video Processing (ECCV-SMVP), pages 7-18, Graz, Austria, May 2006.
You can download our database and the used scripts here.
Please write me an email if there are any questions.
We have also further databases that can be used for sign language recognition.
In the course of my diploma thesis work Appearance-Based Gesture Recognition, a new database of fingerspelling letters of German Sign Language (Deutsche Gebärdensprache, DGS) was created. The RWTH gesture database contains 35 gestures with video sequences for the signs A to Z and SCH, the German umlauts Ä, Ö, Ü, and for the numbers 1 to 5. Five of the gestures contain inherent motion (J, Z, Ä, Ö and Ü).
The recording was done under non-uniform daylight lighting conditions, the back- ground and the camera viewpoints are not constant, and the persons had no restrictions on the clothing while gesturing.
The database consists of 1400 image sequences that contain gestures of 20 different persons. Each person had to sign each gesture twice on two different days. The gestures were recorded by two different cameras, one webcam and one camcorder, from two different points of view. Figure 6.4 shows the record setup. The webcam recorded the sequences with a resolution of 320x240 at 25 frames per second, and the camcorder with a resolution of 352x288 at 25 frames per second. The persons were not trained to perform the signs, therefrom the gestures may differ from the standard. For recording the gestures we programmed a shell script which gave us the possibility of recording and converting gestures for as many persons as we wanted in a flexible and easy way. All videos were recorded in MPEG-4 DivX format using the freely available software MPlayer. The script offers possibilities to easily integrate new recording devices or changing the record resolution and frame rate.
Also we programmed another shell script to convert the recorded videos into single image files. For each person, session, and camera a sequence file was generated which contains all images belonging to this sequence. We chose the PNG image format with high compression factor but one may change this to any other value. These two scripts are also available online.
Before recording, the proband was asked if he agrees in making his sequence publicly available. It was clearly mentioned that he could abandon the record-session at any time. After a short explanation of the course of events he had to sign a letter of agreement. This is a very important task when recording a proband with cameras: on one hand the proband exactly knows what will happen with his records and on the other hand the proband cannot defy with hindsight to the publishing of the complete database. A more detailed overview on usability evaluation and working with probands can be found in [Nielsen 00] and [Schweibenz & Thissen 02].
For each gesture an example video was shown before recording. The proband could view this video as often as he wanted. He then started the recording by hitting the RETURN-key and stopped it by hitting it again. Then his recording was displayed to be compared with the previous reference example. The proband could record his gesture as often as he wanted. One recording-session took between 10 and 20 minutes. The different lighting conditions and sometimes the hand is located in front of the face makes it difficult to track and extract. No instructions concerning the clothing or jewellery like rings, bracelets or watches were given. We decided to record such a difficult database with respect to be able to build an online recognition system later which can work under no constraints.
Results
Using a camshift tracker on the RWTH gesture database to extract the original images thresholded by their skin probability we could improve the error from 87.1% to 44.0%. With the first time derivative image feature of original images thresholded by their skin probability in combination with tracking, the error rate could be improved from 72.1% to 46.2%. This shows the need of tracking system or a different feature extraction method to be more position and scale independent.
Using a two-sided tangent distance we could improve once again the error rate to good and currently best result of 35.7% which shows the advantage of using distance measures being invariant against affine transformations and the possibility of recognizing sign language by simple appearance-based features.
With the same features but scaled to 16x16 we achieved an error rate of 46.0% for one-sided tangent distance and 42.5% for two-sided which is even better than using 32x32 original image features without tangent-distance.
We could also improve the error rate when using the first time derivative image feature of original images thresholded by their skin probability with two-sided tangent distance from 46.2% to 44.1%.
The confusion matrix was obtained using two-sided tangent distance on the RWTH gesture database with original images thresholded by their skin probability as features. The error rate table shows all achieved results on this database up to now.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 C/I ER A 0 8 4 2 1 1 3 1 8/12 60.0% B 1 11 4 2 1 1 1 11/9 45.0% C 2 12 3 1 3 1 12/8 40.0% D 3 1 5 1 3 1 2 1 1 3 1 1 5/15 75.0% E 4 1 17 1 1 17/3 15.0% F 5 2 12 1 1 1 2 1 12/8 40.0% G 6 17 2 1 17/3 15.0% H 7 3 16 1 16/4 20.0% I 8 2 1 2 8 5 1 1 8/12 60.0% J 9 1 1 13 1 2 1 1 13/7 35.0% K 10 17 1 1 1 17/3 15.0% L 11 1 1 1 14 1 1 1 14/6 30.0% M 12 3 10 4 1 1 1 10/10 50.0% N 13 1 3 1 3 11 1 11/9 45.0% O 14 2 1 9 1 1 5 1 9/11 55.0% P 15 1 16 2 1 16/4 20.0% Q 16 1 1 1 1 16 16/4 20.0% R 17 1 2 1 5 2 1 2 6 5/15 75.0% S 18 1 5 6 3 1 3 1 3/17 85.0% T 19 2 1 2 14 1 14/6 30.0% U 20 1 3 1 2 5 1 2 4 1 5/15 75.0% V 21 1 1 17 1 17/3 15.0% W 22 1 1 1 16 1 16/4 20.0% X 23 1 1 1 15 1 1 15/5 25.0% Y 24 0 1 2 1 1 15 15/5 25.0% Z 25 1 1 1 1 16 16/4 20.0% AE 26 2 1 1 1 1 14 14/6 30.0% OE 27 1 2 1 2 14 14/6 30.0% UE 28 1 1 1 2 2 2 10 1 10/10 50.0% SCH 29 1 1 18 18/2 10.0% Eins 30 3 1 1 15 15/5 25.0% Zwei 31 1 1 1 1 2 14 14/6 30.0% Drei 32 1 1 1 3 14 14/6 30.0% Vier 33 2 1 1 1 15 15/5 25.0% Fuenf 34 1 1 18 18/2 10.0% C 8 11 12 5 17 12 17 16 8 13 17 14 10 11 9 16 16 5 3 14 5 17 16 15 15 16 14 14 10 18 15 14 14 15 18 I 9 7 9 1 25 0 9 4 3 12 12 1 18 15 4 3 29 5 3 0 8 1 3 7 0 12 4 13 12 2 11 6 1 1 0 I% 4 3 4 0 10 0 4 2 1 5 5 0 7 6 2 1 12 2 1 0 3 0 1 3 0 5 2 5 5 1 4 2 0 0 0 |
Feature | Feature Size | Distance | ER[%] |
Original thresholded by skin color prob. | 32x32 | Euclidan | 44.0 |
32x32 | One-sided tangent | 39.4 | |
32x32 | Two-sided tangent | 35.7 | |
16x16 | One-sided tangent | 46.0 | |
16x16 | Two-sided tangent | 42.5 | |
first time derivative of orig. thresholded by skin color prob. | 32x32 | Euclidan | 46.2 |
32x32 | Two-sided tangent | 44.1 |
Philippe Dreuw Last modified: Wed Jul 18 15:39:42 CEST 2007 Disclaimer. Created Wed Dec 22 18:04:32 CET 2004