Negation and speculation detection in medical and review texts

Noa Patricia Cruz Díaz

Ayuda

Negation and speculation detection in medical and review texts

Autores: Noa Patricia Cruz Díaz
Directores de la Tesis: Manuel Jesús Maña López (dir. tes.)
Lectura: En la Universidad de Huelva ( España ) en 2014
Idioma: inglés
Número de páginas: 209
Tribunal Calificador de la Tesis: Manuel de Buenaga Rodríguez (presid.) , Jacinto Mata Vázquez (secret.) , Mariana Lara Neves (voc.)
Enlaces
- Tesis en acceso abierto en: Arias Montano
Resumen
- español
  La detección de la negación y la especulación ha sido un área de investigación activa en los últimos años en la comunidad de Procesamiento del Lenguaje Natural, incluyendo algunas tareas competitivas en conferencias relevantes. De hecho, muchas aplicaciones se podrían beneficiar de la identificación precisa de este tipo de información (por ejemplo, detección de interacciones, extracción de información, análisis de sentimientos). Esta tesis tiene como objetivo contribuir a la investigación en curso sobre la negación y la especulación en la comunidad de la Tecnología del Lenguaje a través del desarrollo de sistemas de aprendizaje automático que determinen las palabras claves de negación y especulación así como resuelvan su ámbito lingüístico de aplicación. Entendemos por resolver el ámbito lingüístico, identificar a nivel de la frase los tokens que se ven afectados por las palabras claves. Se centra en los dos dominios en los que la negación y la especulación han recibido más atención: el biomédico y el de artículos de opinión. En el primero, el método propuesto mejora los resultados hasta la fecha para la sub-colección de documentos clínicos del corpus Bioscope. En el segundo, la novedad de la contribución radica en el hecho de que, hasta donde sabemos, éste es el primer sistema entrenado y evaluado en la colección de artículos de opinión Simon Fraser University anotado con información negativa y especulativa, al mismo tiempo, que supone el primer intento en detectar la especulación en este dominio. Además, y debido a los problemas de tokenización encontrados durante el preprocesamiento de la colección de documentos BioScope y el escaso número de estudios en la bibliografía que aporten soluciones para este problema, la presente tesis describe este tema en profundidad proporcionando un análisis comprensivo así como lleva a cabo la evaluación de algunas herramientas de tokenización. Esta contribución supone el primer estudio de evaluación comparativo de tokenizadores en el ámbito biomédico, el cual podría ayudar a los desarrolladores de Procesamiento del Lenguaje Natural a elegir la mejor herramienta de tokenización a usar.
- English
  Negation and speculation detection has been an active research area during the last years in the Natural Language Processing community, including some Shared Tasks in relevant conferences. In fact, it constitutes a challenge in which many applications can benefit from identifying this kind of information (e.g., interaction detection, information extraction, sentiment analysis). This thesis aims to contribute to the ongoing research on negation and speculation in the Language Technology community through the development of machinelearning systems which determine the speculation and negation cues and resolve their scope (i.e., identify at sentence level which tokens are affected by the cues). It is focused on the two domains in which negation and hedging have drawn more attention: the biomedical and the review domains. In the first one, the proposed method improves the results to date for the sub-collection of clinical documents of the BioScope corpus. In the second, the novelty of the contribution lies in the fact that, to the best of our knowledge, this is the first system trained and tested on the SFU Review corpus annotated with negative and speculative information. At the same time, this is the first attempt to detect speculation in the review domain. Additionally, and due to the tokenization problems that were encountered during the preprocessing of the BioScope corpus and the small number of works in the bibliography which propose solutions for this problem, this thesis closely describes this issue and provide both a comprehensive overview analysis and evaluation of a set of tokenization tools. This means, the first comparative evaluation study of tokenizers in the biomedical domain which could help Natural Language Processing developers to choose the best tokenizer to use.