Unifying Named Entity Recognition and Extreme Multi-Label Classification for Explainable Clinical Coding

Alicia Ramírez Arrabe; Andrés Duque Fernández; Juan Martínez Romo

Ayuda

Unifying Named Entity Recognition and Extreme Multi-Label Classification for Explainable Clinical Coding

Autores: Alicia Ramírez Arrabe, Andrés Duque Fernández , Juan Martínez Romo
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 75, 2025 (Ejemplar dedicado a: Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025), págs. 41-52
Idioma: inglés
Títulos paralelos:
- Integración del Reconocimiento de Entidades Nombradas y la Clasificación Extrema Multi-Etiqueta para una Codificación Clínica Explicable
Enlaces
- Texto completo
Resumen
- español
  La codificación automática clínica de informes médicos sirve como intersección entre la atención sanitaria y el Procesamiento de Lenguaje Natural (PLN), facilitando la extracción de información relevante de documentos clínicos no estructurados. Este trabajo presenta un sistema de codificación automática explicable en tres etapas, desarrollado dentro del marco experimental de la competición CodiEsp 2020, una tarea orientada a la clasificación clínica automática en español. El sistema propuesto integra dos modelos basados en el Reconocimiento de Entidades Nombradas (NER), un modelo de clasificación de texto supervisado y un modelo de similitud no supervisado enriquecido con la extracción de frases clave. Esta metodología permite la detección de evidencias de texto superpuestas y/o discontinuas, así como la inclusión de códigos de fuera de la distribución. Nuestro enfoque supera a la mayoría de los modelos del estado del arte, logrando una mejora del 4,2%, 0,2% y 4,1% de la métrica F1 en las subtareas CodiEsp-D, CodiEsp-P y CodiEsp-X, respectivamente, además de un aumento de hasta el 2,4% en los valores de la métrica MAP.
- English
  Automatic clinical coding of medical reports sits at the intersection of healthcare and Natural Language Processing (NLP), facilitating the extraction of relevant information from unstructured clinical documents. This study introduces a three-stage explainable automatic coding system, developed within the experimental framework of the 2020 CodiEsp competition, a task devoted to automatic clinical coding in Spanish. The proposed system integrates two Named Entity Recognition (NER)-based models, a supervised text classification model, and an unsupervised similarity model enhanced with keyphrase extraction. This methodology allows for the detection of overlapped and discontinuous evidence texts, as well as for the inclusion of Out-Of-Distribution (OOD) codes. Our approach outperforms most state-of-the-art models, achieving an F1-score improvement of 4.2%, 0.2%, and 4.1% in the CodiEsp-D, CodiEsp-P and CodiEsp-X subtasks, respectively, and an increase of up to 2.4% in the MAP values.
Referencias bibliográficas
- Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama. 2019. Optuna: A nextgeneration hyperparameter optimization framework. In Proceedings...
- Almagro, M., R. M. Unanue, V. Fresno, and S. Montalvo. 2020. ICD-10 coding of Spanish electronic discharge summaries: An extreme classification...
- Barreiros, L., I. Coutinho, G. M. Correia, and B. Martins. 2025. Explainable ICD Coding via Entity Linking. arXiv preprint arXiv:2503.20508.
- Barros, J., M. Rojas, J. Dunstan, and A. Abeliuk. 2022. Divide and conquer: An extreme multi-label classification approach for coding diseases...
- Blanco, A., A. Casillas, A. Pérez, and A. D. de Ilarraza. 2019. Multi-label clinical document classification: Impact of labeldensity. Expert...
- Blanco, A., A. Pérez, and A. Casillas. 2020. IXA-AAA at CLEF eHealth 2020 CodiEsp. Automatic Classification of Medical Records with Multi-label...
- Carrino, C. P., J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, A. Gonzalez-Agirre, and M....
- Chen, P.-F., S.-M. Wang, W.-C. Liao, L.-C. Kuo, K.-C. Chen, Y.-C. Lin, C.-Y. Yang, C.-H. Chiu, S.-C. Chang, F. Lai, et al. 2021. Automatic...
- Cossin, S. and V. Jouhet. 2020. IAM at CLEF eHealth 2020: Concept Annotation in Spanish Electronic Health Records. In CLEF (Working Notes).
- Costa, J., I. Lopes, A. V. Carreiro, D. Ribeiro, and C. Soares. 2020. Fraunhofer AICOS at CLEF eHealth 2020 Task 1: Clinical Code Extraction...
- de la Iglesia, I., A. Atutxa, K. Gojenola, and A. Barrena. 2023. EriBERTa: A bilingual pre-trained language model for clinical natural language...
- Duque, A., H. Fabregat, L. Araujo, and J. Martinez-Romo. 2021. A keyphrase-based approach for interpretable ICD-10 code classification of...
- García-Santa, N., K. Cetina, L. Cappellato, C. Eickhoff, N. Ferro, and A. Nevéol. 2020. FLE at CLEF eHealth 2020: Text Mining and Semantic...
- Li, J., H. Fei, J. Liu, S. Wu, M. Zhang, C. Teng, D. Ji, and F. Li. 2022. Unified named entity recognition as word-word relation classification....
- Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A robustly optimized...
- López-García, G., J. M. Jerez, N. Ribelles, E. Alba, and F. J. Veredas. 2021. Transformers for clinical coding in Spanish. IEEE Access, 9:72387–72397.
- López-García, G., J. M. Jerez, N. Ribelles, E. Alba, and F. J. Veredas. 2023. Explainable clinical coding with in-domain adapted transformers....
- Miranda-Escalada, A., A. Gonzalez-Agirre, J. Armengol-Estapé, and M. Krallinger. 2020. Overview of Automatic Clinical Coding: Annotations,...
- O’malley, K. J., K. F. Cook, M. D. Price, K. R. Wildes, J. F. Hurdle, and C. M. Ashton. 2005. Measuring diagnoses: ICD code accuracy. Health...
- Pereira, S., A. Névéol, P. Massari, M. Joubert, and S. Darmoni. 2006. Construction of a semi-automated ICD-10 coding help system to optimize...
- Pérez, J., A. Pérez, A. Casillas, and K. Gojenola. 2018. Cardiology record multilabel classification using latent Dirichlet allocation. Computer...
- Ramshaw, L. A. and M. P. Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large...
- Reimers, N. and I. Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084.
- Shaban-Nejad, A., M. Michalowski, and D. L. Buckeridge. 2021. Explainability and interpretability: keys to deep medicine. Explainable AI in...
- Xie, P. and E. Xing. 2018. A neural architecture for automated ICD coding. In Proceedings of the 56th Annual Meeting of the Association for...
- Yang, Y., H. Lin, Z. Yang, Y. Zhang, D. Zhao, and L. Luo. 2025. LCDL: Classification of ICD codes based on disease label co-occurrence dependency...
- Yu, P., L. Merrick, G. Nuti, and D. Campos. 2024. Arctic-Embed 2.0: Multilingual Retrieval Without Compromise. arXiv preprint arXiv:2412.04506.
- Zhou, L., C. Cheng, D. Ou, and H. Huang. 2020. Construction of a semi-automatic ICD-10 coding system. BMC medical informatics and decision...
- Zhou, T., P. Cao, Y. Chen, K. Liu, J. Zhao, K. Niu, W. Chong, and S. Liu. 2021. Automatic ICD coding via interactive shared representation...
- Zweigenbaum, P. 1999. Encoder l’information médicale: des terminologies aux systèmes de représentation des connaissances. Innovation Stratégique...