Seminar "Selected Topics in Human Language Technology and Pattern Recognition"

In the Summer Term 2018 the Lehrstuhl Informatik 6 will host a seminar entitled "Selected Topics in Human Language Technology and Pattern Recognition".

Registration for the seminar

Registration for the seminar is only possible online via the central registration page from Friday, Jan. 19 to Friday, Feb. 02, 2018. A link can also be found on the Computer Science Department's homepage.

Prerequisites for participation in the seminar

Seminar format and important dates

Please note the following deadlines:

Note: failure to meet deadlines, absence without permission from compulsory sessions (presentations and preliminary meeting as announced by email to each participating student), or dropping out of the seminar after more than 3 weeks after the preliminary meeting/topic distribution results in the grade 5.0/not appeared.

Topics, relevant references and participants

    1. Speaker Diarization

      1. Methods (Engelke; Supervisor: Wilfried Michel)
        Initial References:
        • M.H. Moattar and M.M. Homayounpour, "A review on speaker diarization systems and approaches", Speech Communication, Volume 54, Issue 10, 2012,
        • Q. Wang, C. Downey, L. Wan, P.A. Mansfield and I.L. Moreno, "Speaker Diarization with LSTM", arXiv:1710.10468 [eess.AS] 2018

      2. Applications and Challenges (Thull; Supervisor: Wilfried Michel)
        Initial References:
        • T. L. Nwe, H. Sun, H. Li and S. Rahardja, "Speaker diarization in meeting audio," 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, 2009 doi: 10.1109/ICASSP.2009.4960523
        • K. Church, W. Zhu, J. Vopicka, J. Pelecanos, D. Dimitriadis and P. Fousek, "Speaker diarization: A perspective on challenges and opportunities from theory to practice," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017 doi: 10.1109/ICASSP.2017.7953098

    2. Speaker Separation

      1. Deep Clustering
        Initial References:
        • John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabee: "Deep Clustering: Discriminative Embeddings for Segmentation and Separation," IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, March 20-25, 2016.
        • Zhuo Chen, Yi Luo, Nima Mesgarani: "Deep Attractor Network for Single-Microhpone Speaker Separation", IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar 246-250, 2017

      2. Permutation invariant training
        Initial References:
        • Dong Yu, Morten Kolbaek, Zheng-Hua Tan, Jesper Jensen: "Permutation Invariant Training of Deep Models for Speaker-Independent Multi-Talker Speech Separation", IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar 246-250, 2017
        • Yanmin Qian, Xuankai Chang, Dong Yu: "Single-Channel Multi-talker Speech Recognition with permutation Invariant Training", submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing, arXiv:1707.06527

    3. Speaker Identification

      1. Speaker Recognition
        Initial References:
        • A Novel Scheme for Speaker Recognition Using a Phonetically-Aware Deep Neural Network, ICASSP 2014,
        • Deep Neural Network Approaches to Speaker and Language Recognition, IEEE Signal Processing Letters 2015,

      2. Named Entity Recognition
        Initial References:
        • Neural Architectures for Named Entity Recognition, NAACL 2016,
        • End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, ACL 2016,

    4. Sentiment Analysis of Text

      1. Document/Sentence Level Sentiment Analysis
        Initial References:
        • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, "Hierarchical Attention Networks for Document Classification", NAACL-HLT 2016,
        • X. Wang, W. Jiang, Z. Luo, "Combination of Convolutional and Recurrent Neural Network for Sentiment Analysis of Short Texts", COLING 2016,

      2. Aspect Level Sentiment Analysis
        Initial References:
        • Y. Wang, M. Huang, L. Zhao, X. Zhu, "Attention-based LSTM for Aspect-level Sentiment Classification", EMNLP 2016,
        • P. Chen, Z. Sun, L. Bing, W. Yang, "Recurrent Attention Network on Memory for Aspect Sentiment Analysis", EMNLP 2017,

    5. Sentiment Analysis from Audio

      1. Emotion detection
        Initial References:
        • G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 5200-5204.
        • Ghosh, S., Laksana, E., Morency, L.P. and Scherer, S., 2016, September. Representation Learning for Speech Emotion Recognition. In INTERSPEECH (pp. 3603-3607).

      2. Multimodal sentiment analysis
        Initial References:
        • Poria, S., Cambria, E. and Gelbukh, A., 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2539-2544).
        • Mohammad Soleymani, David Garcia, Brendan Jou, Bj√∂rn Schuller, Shih-Fu Chang, Maja Pantic, A survey of multimodal sentiment analysis, Image and Vision Computing, Volume 65, 2017, Pages 3-14,

    6. Word Embeddings and Natural Language Understanding

      1. Word embeddings and their applications to natural language processing
        Initial References:
        • J. Pennington, R. Socher, and C. D. Manning. "GloVe: Global Vectors for Word Representation," in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar, October 2014.
        • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From Word Embeddings To Document Distances," in Proc. Int. Conf. on Machine Learning (ICML), pages 957-966, Lille, France, July 2015.

      2. Neural network based natural language understanding
        Initial References:
        • [Intent classification] S. Ravuri, and A. Stolcke, "Recurrent Neural Network and LSTM Models for Lexical Utterance Classification," in Proc. Interspeech, pages 135-139, Dresden, Germany, September 2015.
        • [Slot filling] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, "Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding," IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 3, March 2015, pages 530-539.

    7. Speech Synthesis

      1. Auto-regressive models
        Initial References:
        • Efficient Neural Audio Synthesis,
        • PixelCNN++,

      2. Inverse autoregressive flows
        Initial References:
        • Parallel WaveNet: Fast High-Fidelity Speech Synthesis,
        • Improving Variational Inference with Inverse Autoregressive Flow,

      3. End-to-end text-to-speech
        Initial References:
        • VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop,
        • Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,

    8. Speech Enhancement

      1. Speech Enhancement for Human Listeners
        Initial References:
        • X. Xu, R. Flynn, and M. Russell, "Speech intelligibility and quality: A comparative study of speech enhancement algorithms," in 2017 28th Irish Signals and Systems Conference (ISSC), 2017, pp. 1-6.
        • P. C. Loizou and G. Kim, "Reasons why Current Speech-Enhancement Algorithms do not Improve Speech Intelligibility and Suggested Solutions," IEEE Trans. Audio. Speech. Lang. Processing, vol. 19, no. 1, pp. 47-56, Jan. 2011.
        • Y. Xu, J. Du, L. Dai, and C. Lee, "A Regression Approach to Speech Enhancement Based on Deep Neural Networks," IEEE Trans. Audio, Speech Lang. Process., vol. 23, no. 1, pp. 7-19, 2015.

      2. Speech Enhancement for ASR
        Initial References:
        • F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Roux, J. R. Hershey, and B. Schuller, "Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-Robust ASR," in Proceedings of the 12th International Conference on Latent Variable Analysis and Signal Separation - Volume 9237, 2015, pp. 91-99.
        • T. Ochiai, S. Watanabe, and S. Katagiri, "Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR," in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), 2017, no. 26280063, pp. 1-6.

    9. Constituency Parsing

      1. Neural Network-based Parsing
        Initial References:
        • Danqi Chen and Christopher D. Manning, "A Fast and Accurate Dependency Parser using Neural Networks", Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2014,
        • Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews and Noah A. Smith, "Transition-Based Dependency Parsing with Stack Long Short-Term Memory", Proceedings of the Association for Computational Linguistics, ACL 2015,
        • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, " Attention is all you need", Advances in Neural Information Processing Systems, NIPS 2017,

      2. Universal Semantic Parsing
        Initial References:
        • Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer and Noah A. Smith, "Many Languages, One Parser", Transaction of the Association for Computational Linguistics, TACL 2016,
        • Raymond Hendy Susanto and Wei Lu, "Neural Architectures for Multilingual Semantic Parsing", Proceedings of the Association for Computational Linguistics, ACL 2017,
        • Long Duong, Trevor Cohn, Steven Bird and Paul Cook, "A Neural Network Model for Low-Resource Universal Dependency Parsing", Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2015,

    10. Text Summarization

      1. Extractive Text Summarizatio
        Initial References:
        • Text Summarization Techniques: A Brief Survey. Mehdi Allahyari, Seyedamin Pouriyeh et. al. 2017.
        • Automatic Text Summarization (book). Torres-Moreno, Juan-Manuel, 2014. (RWTH Aachen Network)

      2. Abstractive Text Summarization (with Deep Learning)
        Initial References:
        • Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. Ramesh Nallapati, Bowen Zhou et. al. 2016. CoNLL.
        • Get To The Point: Summarization with Pointer-Generator Networks. Abigail See, Peter J. Liu et. al. 2017. ACL.

    11. Voice Activity Detection

      1. Challenges
        Initial References:
        • Damianos Karakos, Scott Novotney, Le Zhang, Rich Schwartz, "Model Adaptation and Active Learning in the BBN Speech Activity Detection System for the DARPA RATS program", Interspeech 2016.
        • Tomi Kinnunen, Alexey Sholokhov, Elie Khoury, Dennis Thomsen, Md Sahidullah, Zheng-Hua Tan, "HAPPY Team Entry to NIST OpenSAD Challenge: A Fusion of Short-Term Unsupervised and Segment i-Vector Based Speech Activity Detectors", Interspeech 2016.

      2. Google Home
        Initial References:
        • abio Vesperini, Paolo Vecchiotti, Emanuele Principi, Stefano Squartini, and Francesco Piazza, "Deep Neural Networks for Multi-Room Voice Activity Detection: Advancements and Comparative Evaluation", IJCNN 2016.
        • Shuo-Yiin Chang, Bo Li, Tara N. Sainath, Gabor Simko, Carolina Parada, "Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition", Interspeech 2017.

      3. Feature Selection
        Initial References:
        • Ruben Zazo, Tara N. Sainath, Gabor Simko, Carolina Parada, "Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection", Interspeech 2016.
        • Elie Khoury, Matt Garland, "I-Vectors for Speech Activity Detection", Odyssey 2016.
        • Longbiao Wang, Khomdet Phapatanaburi, Zeyan Oo, Seiichi Nakagawa, Masahiro Iwahashi, Jianwu Dang, "PHASE AWARE DEEP NEURAL NETWORK FOR NOISE ROBUST VOICE ACTIVITY DETECTION", ICME 2017.

      4. Audio-Visual Combination
        Initial References:
        • David Dov, Ronen Talmon and Israel Cohen, "Kernel Method for Speech Source Activity Detection in Multi-modal Signals", ICSEE 2016.
        • Ido Ariav, David Dov, Israel Cohen, "A deep architecture for audio-visual voice activity detection in the presence of transients", Signal Processing 142 (2018) p. 69-74.
        • Foteini Patrona, Alexandros Iosifidis, Anastasios Tefas, Nikolaos Nikolaidis and Ioannis Pitas, "Visual Voice Activity Detection in the Wild", IEEE TRANSACTIONS ON MULTIMEDIA, Vol. 18, No. 6, June 2016.

    12. Language Identification

      1. Language Identification
        Initial References:
        • Reviewing automatic language identification:
        • A covariance kernel for svm language recognition: (2008)
        • The MITLL NIST LRE 2009 language recognition system: (2009)

      2. Language Identification with Deep Learning
        Initial References:
        • Automatic language identification using deep neural networks: (2014)
        • Convolutional ANN: Deep learning for spoken language identification: (2009)

    Guidelines for the article and presentation

    The roughly 20-page article together with the slides (between 20 & 30) for the presentation should be prepared in LaTeX format. Presentations will consist of 30 to 40 minutes presentation time & 15 minutes discussion time. Document templates for both the article and the presentation slides are provided below along with links to LaTeX documentation available online. The article and the slides should be prepared in LaTeX format and submitted electronically in pdf format. Other formats will not be accepted.



    Inquiries should be directed to the respective supervisors or to:

    Markus Kitza
    RWTH Aachen University
    Lehrstuhl Informatik 6
    Ahornstr. 55
    52074 Aachen

    Room 6110