Ir al contenido

Documat


Factoid question answering for spoken documents

  • Autores: Pere Ramon Comas Umbert
  • Directores de la Tesis: Jordi Turmo (dir. tes.) Árbol académico, Lluís Márquez i Villodre (dir. tes.) Árbol académico
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2012
  • Idioma: español
  • Tribunal Calificador de la Tesis: Horacio Rodríguez Hontoria (presid.) Árbol académico, Lluís Padró Cirera (secret.) Árbol académico, José-Luis Vicedo González (voc.) Árbol académico, Sophie Rosset (voc.) Árbol académico, Maarten de Rijke (voc.) Árbol académico
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents.

      This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution.

      Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages.

      In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic.

      The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts.

      The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.


Fundación Dialnet

Mi Documat

Opciones de tesis

Opciones de compartir

Opciones de entorno