Ir al contenido

Documat


EriBERTa Private Surpasses her Public Alter Ego: Enhancing a Bilingual Pretrained Encoder with Limited Private Medical Data

  • Autores: Iker de la Iglesia, Adrián Sánchez Freire, Oier Urquijo Durán, Ander Barrena Madinabeitia Árbol académico, Aitziber Atutxa Salazar Árbol académico
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 75, 2025 (Ejemplar dedicado a: Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025), págs. 283-296
  • Idioma: inglés
  • Títulos paralelos:
    • EriBERTa Privada Supera su Alter Ego Pública: Mejorando un Codificador Bilingüe Preentrenado con Datos Médicos Privados Limitados
  • Enlaces
  • Resumen
    • español

      El uso secundario de los informes clínicos es esencial para mejorar la atención al paciente. Si bien las herramientas de PLN se han vuelto fundamentales para extraer información de dichos informes, los Modelos del Lenguaje específicos de dominio para el español clínico siguen siendo escasos. Presentamos EriBERTa, el primer Modelo del Lenguaje clínico bilingüe de código abierto para ingles y español, diseñado para impulsar el Procesamiento del Lenguaje Clínico en entornos de bajos recursos. Evaluamos su rendimiento en múltiples dimensiones: datos de preentrenamiento públicos y privados, disponibilidad de datos y transferencia interlingüística. Los resultados muestran que el preentrenamiento en Informes Clínicos Electrónicos dentro del dominio produce importantes mejoras, especialmente en tareas complejas como la identificación de secciones en informes clínicos. EriBERTa también muestra buen rendimiento en tareas monolingües y transfiere el conocimiento adquirido eficazmente entre idiomas, lo que lo convierte en una herramienta valiosa para el PLN clínico multilingüe. El modelo se publica para apoyar futuras investigaciones.

    • English

      The secondary use of clinical reports is essential for improving patient care. While NLP tools have become instrumental in extracting insights from such reports, domain-specific language models for clinical Spanish remain scarce. Therefore, we introduce EriBERTa, the first open-source bilingual clinical language model for English and Spanish, designed to advance clinical NLP in under-resourced settings. We evaluate its performance across multiple dimensions: public vs. proprietary pretraining data, data availability, and cross-lingual transfer. Results show that pretraining on in-domain Electronic Health Records yields strong gains, especially for complex tasks like clinical document section identification. EriBERTa also performs well on monolingual tasks and transfers effectively across languages, making it a valuable tool for multilingual clinical NLP. The model is publicly released to support further research.

  • Referencias bibliográficas
    • 2015. Connecting health and care for the nation: A shared nationwide interoperability roadmap. Office of the National Coordinator for Health...
    • 2019a. Health Level Seven (HL7). CDA. http://www.hl7.org. Last Online; accessed 31-05-2021.
    • 2019b. Health Level Seven (HL7). FHIR. http://www.hl7.org. Last Online; accessed 31-05-2021.
    • 2019. Recommendation on a European Electronic Health Record exchange format. European Commision.
    • 2020. Health at a Glance: Europe 2018 STATE OF HEALTH IN THE EU CYCLE. https://ec.europa.eu/health/sites/default/files/state/docs/2018_healthatglance_rep_en.pdf....
    • 2020. openehr. https://www.openehr.org. Last Online; accessed 31-05-2021.
    • 2020. State of Interoperability among U.S. Non-federal Acute Care Hospitals in 2018. https://www.healthit.gov/sites/default/files/page/2020-03/Stateof-Interoperability-among-US-Nonfederal-Acute-Care-Hospitals-in-2018.pdf....
    • 2021. Health IT Data Summaries. https://dashboard.healthit.gov/apps/health-informationtechnology-data-summaries.php. Last Online; accessed...
    • 2022. International Statistical Classification of Diseases and Related Health Problems (ICD). https://www.who.int/standards/classifications/classification-of-diseases....
    • Akbik, A., T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP....
    • Alsentzer, E., J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, and M. Mc-Dermott. 2019. Publicly available clinical BERT embeddings....
    • Aracena, C., N. Rodríguez, V. Rocco, and J. Dunstan. 2023. Pre-trained language models in Spanish for health insurance coverage. In T. Naumann,...
    • Biewald, L. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
    • Carrino, C. P., J. Armengol-Estapé, O. de Gibert Bonet, A. Gutiérrez-Fandiño, A. Gonzalez-Agirre, M. Krallinger, and M. Villegas. 2021. Spanish...
    • Carrino, C. P., J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, A. Gonzalez-Agirre, and M....
    • Casillas, A., N. Ezeiza, I. Goenaga, A. Pérez, and X. Soto. 2019. Measuring the effect of different types of unsupervised word representations...
    • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised...
    • Conneau, A. and G. Lample. 2019. Crosslingual language model pretraining. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,...
    • De la Iglesia, I., M. Vivó, P. Chocrón, G. de Maeztu, K. Gojenola, and A. Atutxa. 2023a. An Open Source Corpus and Automatic Tool for Section...
    • De la Iglesia, I., M. Vivó, P. Chocrón, G. de Maeztu, K. Gojenola, and A. Atutxa. 2023b. Overview of ClinAIS at IberLEF 2023: Automatic Identification...
    • Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In...
    • Dogan, R. I., R. Leaman, and Z. Lu. 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal...
    • Fabregat, H., J. Martínez-Romo, and L. Araujo. 2018. Overview of the DIANN task: Disability annotation task. In P. Rosso, J. Gonzalo, R. Martínez,...
    • Gaschi, F., X. Fontaine, P. Rastin, and Y. Toussaint. 2023. Multilingual clinical ner: Translation or cross-lingual transfer? In 5th Clinical...
    • Goenaga, I., E. Andres, K. Gojenola, and A. Atutxa. 2023. Advances in Monolingual and Crosslingual Automatic Disability Annotation in Spanish....
    • Goenaga, I., X. Lahuerta, A. Atutxa, and K. Gojenola. 2021. A section identification tool: Towards hl7 cda/ccr standardization in spanish...
    • Gonzalez-Agirre, A., M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, and M. Krallinger. 2019. PharmaCoNER: Pharmacological Substances,...
    • Hu, J., S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating...
    • Hu, Y., Q. Chen, J. Du, X. Peng, V. K. Keloth, X. Zuo, Y. Zhou, Z. Li, X. Jiang, Z. Lu, K. Roberts, and H. Xu. 2024. Improving large language...
    • I.H.T.S.D.O. 2022. SNOMED CT – Starter Guide. Online: International Health Terminology Standards Development Organisation. Institute of Formal...
    • Intxaurrondo, A.. 2018. Spaccc. Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
    • Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. 2016. Mimic-iii,...
    • Kim, J.-D., T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In Proceedings...
    • Krallinger, M., O. Rabal, F. Leitner, M. Vazquez, D. Salgado, Z. lu, R. Leaman, Y. Lu, D. Ji, D. Lowe, R. Sayle, R. Batista-Navarro, R. Rak,...
    • Labrak, Y., A. Bazoge, R. Dufour, M. Rouvier, E. Morin, B. Daille, and P.-A. Gourraud. 2023. DrBERT: A Robust Pretrained Model in French for...
    • Lee, J., W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang. 2020. BioBERT: a pre-trained biomedical language representation model for...
    • Li, J., Y. Deng, Q. Sun, J. Zhu, Y. Tian, J. Li, and T. Zhu. 2024. Benchmarking large language models in evidence-based medicine. IEEE Journal...
    • Li, J., Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, and Z. Lu. 2016. BioCreative...
    • Lima-López, S., E. Farré-Maduell, L. Gasco-Sánchez, J. Rodríguez-Miret, and M. Krallinger. 2023. Overview of symptemist at biocreative viii:...
    • Lima-López, S., E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, and M. Krallinger. 2023. Overview of MedProcNER...
    • Lima-López, S., E. Farré-Maduell, A. Miranda-Escalada, V. Brivá-Iglesias, and M. Krallinger. 2021. NLP applied to occupational health: MEDDOPROF...
    • Lima-López, S., E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz, G. Ceroni, J. Kossoff, A. Shah, A. Nentidis,...
    • Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A robustly optimized...
    • López-Úbeda, P.. 2022. Biomedical entities recognition in Spanish combining word embeddings. Proces. del Leng. Natural 68, 149–152.
    • Marimon, M., A. Gonzalez-Agirre, A. Intxaurrondo, H. Rodriguez, J. L. Martin, M. Villegas, and M. Krallinger. 2019. Automatic de-identification...
    • Miranda-Escalada, A., E. Farré, and M. Krallinger. 2020. Named entity recognition, concept normalization and clinical coding: Overview of...
    • Miranda-Escalada, A., L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, and M....
    • Miranda-Escalada, A., A. Gonzalez-Agirre, J. Armengol-Estapé, and M. Krallinger. 2020. Overview of Automatic Clinical Coding: Annotations,...
    • Nakayama, H.. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
    • Naseem, U., M. Khushi, V. Reddy, S. Rajendran, I. Razzak, and J. Kim. 2021. BioALBERT: A Simple and Effective Pre-trained Language Model for...
    • National Center for Biotechnology Information. 2024. NCBI Databases and Tools. U.S. National Library of Medicine, Accessed April 3, 2025.
    • National Library of Medicine. 2024. PubMed Database. U.S. National Library of Medicine, Accessed April 3, 2025.
    • Nunes, M., J. Boné, J. C. Ferreira, P. Chaves, and L. B. Elvas. 2024. MediAlbertina: An european portuguese medical language model. Computers...
    • Sennrich, R., B. Haddow, and A. Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual...
    • Tiedemann, J.. 2012. Parallel data, tools and interfaces in opus. In N. C. C. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J....
    • Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno