Multilingual Information Extraction in Clinical Texts Using Deep Learning Approaches

Elena Zotova Romanova

Ayuda

Multilingual Information Extraction in Clinical Texts Using Deep Learning Approaches

Autores: Elena Zotova Romanova
Directores de la Tesis: Germán Rigau Claramunt (dir. tes.) , Montserrat Cuadros Oller (codir. tes.)
Lectura: En la Universidad del País Vasco - Euskal Herriko Unibertsitatea ( España ) en 2025
Idioma: inglés
Enlaces
- Tesis en acceso abierto en: ADDI
Resumen
- Healthcare practice and biomedical research generate large volumes of digitized, unstructured data in multiple languages, which remain underutilized despite their potential to enhance healthcare delivery, support trainee education, and advance biomedical research. Transforming this data into structured, actionable information requires Natural Language Processing (NLP) techniques. Within NLP, this task is referred to as Information Extraction (IE). This thesis is part of the growth area of biomedical NLP and addresses key challenges in biomedical information extraction, focusing on entity recognition, entity linking and the interoperability of clinical terminologies. It makes three primary contributions: (i) the development of a method for clinical identifiers mapping and data augmentation, (ii) the design and evaluation of biomedical entity linking systems with semantic textual similarity methods, and (iii) the exploration of generative approaches for biomedical entity linking. Throughout, state-of-the-art deep learning techniques are used. First, the thesis presents ClinIDMap, a prototype tool for clinical ID mapping which integrates multiple biomedical knowledge bases (e.g., ICD-10, SNOMED CT, UMLS) and connects them with general-purpose ontologies (Wikidata and WordNet). The tool facilitates corpus annotation and data augmentation. Experiments demonstrate that corpus annotations transferred between terminologies retain high model performance, underscoring the method's utility for overcoming data scarcity. Second, the thesis explores methods for biomedical entity linking (BioEL) in non-English languages, particularly Spanish. By leveraging semantic textual similarity methods and supervised ranking via cross-encoders the entity-linking models achieve higher performance compared to symbolic methods. The proposed methods are validated through participation in shared tasks, where the systems achieved top rankings. Third, the thesis studies the topic of generative models for biomedical entity linking, employing encoder-decoder and decoder-only architectures. These systems generate entity descriptions in knowledge bases (KBs), which makes linking them to the KBs a text-to-text problem. Experiments reveal that context incorporation and data augmentation improve models' capacity to generalize. However, challenges remain in handling unseen data and stabilizing performance in zero-shot settings.