Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Ane G. Domingo Aldama; Marcos Merino Prado; Alain García Olea; Josu Goikoetxea Salutregi; Koldobika Gojenola Galletebeitia; Aitziber Atutxa Salazar

Ayuda

Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Autores: Ane G. Domingo Aldama, Marcos Merino Prado, Alain García Olea, Josu Goikoetxea Salutregi, Koldobika Gojenola Galletebeitia , Aitziber Atutxa Salazar
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 76, 2026 (Ejemplar dedicado a: Procesamiento del Lenguaje Natural, Revista nº 76, marzo de 2026), págs. 39-52
Idioma: español
Títulos paralelos:
- Automatización de la predicción temprana de enfermedades mediante datos clínicos estructurados y no estructurados
Enlaces
- Texto completo
Resumen
- español
  Este estudio presenta una metodología totalmente automatizada para estudios de predicción temprana en entornos clínicos, aprovechando la información extraída de informes de alta hospitalaria no estructurados. El proceso propuesto utiliza los informes de alta para respaldar los tres pasos principales de la predicción temprana: selección de cohortes, generación de conjuntos de datos y etiquetado de resultados. Mediante el procesamiento de los informes de alta con técnicas de procesamiento del lenguaje natural, podemos identificar de manera eficiente las cohortes de pacientes relevantes, enriquecer los conjuntos de datos estructurados con variables clínicas adicionales y generar etiquetas de alta calidad sin intervención manual.
  
  Este enfoque aborda el principal problema de los registros médicos electrónicos (RME) codificados que son los datos faltantes o incompletos, capturando información clínicamente relevante que a menudo está infrarrepresentada. Evaluamos la metodología en el contexto de la predicción de la progresión de la fibrilación auricular (FA), demostrando que los modelos predictivos entrenados con conjuntos de datos enriquecidos con información de informes de alta logran una mayor precisión y correlación con los resultados reales en comparación con los modelos entrenados únicamente con datos estructurados de RME, al tiempo que superan las puntuaciones clínicas tradicionales. Estos resultados demuestran que la automatización de la integración de texto clínico no estructurado puede agilizar los estudios de predicción temprana, mejorar la calidad de los datos y aumentar la fiabilidad de los modelos predictivos para la toma de decisiones clínicas.
- English
  This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.
Referencias bibliográficas
- Alzubi, A. A., V. J. Watzlaf, and P. Sheridan. 2021. Electronic health record (ehr) abstraction. Perspectives in health information management,...
- Awwalu, J., A. G. Garba, A. Ghazvini, and R. Atuah. 2015. Artificial intelligence in personalized medicine application of ai algorithms in...
- Botsis, T., G. Hartvigsen, F. Chen, and C. Weng. 2010. Secondary use of ehr: data quality issues and informatics opportunities. Summit on...
- Brahier, M. S., F. Zou, M. Abdulkareem, S. Kochi, F. Migliarese, A. Thomaides, X. Ma, C. Wu, V. Sandfort, P. J. Bergquist, et al. 2023. Using...
- Chen, H., X. Li, X. He, A. Chen, J. McGill, E. C.Webber, H. Xu, M. Liu, and J. Bian. 2025. Enhancing patient-trial matching with large language...
- Cruz-Correia, R. J., P. P. Rodrigues, A. Freitas, F. C. Almeida, R. Chen, and A. Costa-Pereira. 2009. Data quality and integration issues...
- De la Iglesia, I., A. Sánchez-Freire, O. Urquijo-Durán, A. Barrena, and A. Atutxa. 2025. Eriberta private surpasses her public alter ego:...
- de la Iglesia, I., M. Vivó, P. Chocrón, G. de Maeztu, K. Gojenola, and A. Atutxa. 2023. An open source corpus and automatic tool for section...
- De Vos, C. B., R. Pisters, R. Nieuwlaat, M. H. Prins, R. G. Tieleman, R.-J. S. Coelen, A. C. van den Heijkant, M. A. Allessie, and H. J. Crijns....
- Fan, X., Y. Li, Q. He, M. Wang, X. Lan, K. Zhang, C. Ma, and H. Zhang. 2023. Predictive value of machine learning for recurrence of atrial...
- Feder, S. L. 2018. Data quality in electronic health records research: quality domains and assessment methods. Western journal of nursing...
- Garcia Olea, A., J. Ormaetxe Merodio, A. Atutxa Salazar, I. Diez Gonzalez, I. Fernandez De La Prieta, M. Maeztu Rada, E. Amuriza De Luis,...
- García-Olea, A., A. G. Domingo-Aldama, M. Merino, K. Gojenola, J. Goikoetxea, A. Atutxa, and J. M. Ormaetxe. 2025. The application of deep...
- Guan, Z., Z. Wu, Z. Liu, D. Wu, H. Ren, Q. Li, X. Li, and N. Liu. 2023. Cohortgpt: An enhanced gpt for participant recruitment in clinical...
- Hollmann, N., S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeister, and F. Hutter. 2025. Accurate predictions...
- Holmes, J. H., J. Beinlich, M. R. Boland, K. H. Bowles, Y. Chen, T. S. Cook, G. Demiris, M. Draugelis, L. Fluharty, P. E. Gabriel, et al....
- Hulme, O. L., S. Khurshid, L.-C.Weng, C. D. Anderson, E. Y. Wang, J. M. Ashburner, D. Ko, D. D. McManus, E. J. Benjamin, P. T. Ellinor, et...
- Ibrahim, J. G., H. Chu, and M.-H. Chen. 2012. Missing data in clinical studies: issues and methods. Journal of clinical oncology, 30(26):3297–3303.
- Jetley, G. and H. Zhang. 2019. Electronic health records in is research: Quality issues, essential thresholds and remedial actions. Decision...
- Jin, Q., Z. Wang, C. S. Floudas, F. Chen, C. Gong, D. Bracken-Clarke, E. Xue, Y. Yang, J. Sun, and Z. Lu. 2024. Matching patients to clinical...
- Kahale, L. A., A. M. Khamis, B. Diab, Y. Chang, L. C. Lopes, A. Agarwal, L. Li, R. A. Mustafa, S. Koujanian, R. Waziry, et al. 2020. Potential...
- Knecht, S., J. Cyriac, P. Badertscher, P. Krisai, V. Schlageter, S. Osswald, M. Zellweger, M. Kuhne, and C. Sticherling. 2024. Machine learning...
- Kornej, J., G. Hindricks, M. B. Shoemaker, D. Husser, A. Arya, P. Sommer, S. Rolf, P. Saavedra, A. Kanagasundram, S. Patrick Whalen, et al....
- Lewis, A. E., N. Weiskopf, Z. B. Abrams, R. Foraker, A. M. Lai, P. R. Payne, and A. Gupta. 2023. Electronic health record data quality assessment...
- Lip, G. Y., R. Nieuwlaat, R. Pisters, D. A. Lane, and H. J. Crijns. 2010. Refining clinical risk stratification for predicting stroke and...
- Liu, C.-M., W.-S. Chen, S.-L. Chang, Y.-C. Hsieh, Y.-H. Hsu, H.-X. Chang, Y.-J. Lin, L.-W. Lo, Y.-F. Hu, F.-P. Chung, et al. 2024. Use of...
- McDavid, A., P. K. Crane, K. M. Newton, D. R. Crosslin, W. McCormick, N. Weston, K. Ehrlich, E. Hart, R. Harrison, W. A. Kukull, et al. 2013....
- Mishra, J. and S. Tarar. 2020. Chronic disease prediction using deep learning. In Advances in Computing and Data Sciences: 4th International...
- Nadarajah, R., J. Wu, A. F. Frangi, D. Hogg, C. Cowan, and C. Gale. 2021. Predicting patient-level new-onset atrial fibrillation from population-based...
- Ng, K., S. R. Steinhubl, C. DeFilippi, S. Dey, and W. F. Stewart. 2016. Early detection of heart failure using electronic health records:...
- Qiu, Y., H. Guo, S. Wang, S. Yang, X. Peng, D. Xiayao, R. Chen, J. Yang, J. Liu, M. Li, et al. 2024. Deep learning-based multimodal fusion...
- Ristevski, B. and M. Chen. 2018. Big data analytics in medicine and healthcare. Journal of integrative bioinformatics, 15(3):20170030.
- Schork, N. J. 2019. Artificial intelligence and personalized medicine. Precision medicine in Cancer therapy, pages 265–283.
- Shah, V. 2018. Next-generation artificial intelligence for personalized medicine: Challenges and innovations. INTERNATIONAL JOURNAL OF COMPUTER...
- Sharma, D. K., M. Chatterjee, G. Kaur, and S. Vavilala. 2022. Deep learning applications for disease diagnosis. In Deep learning for medical...
- Siontis, K. C., X. Yao, J. P. Pirruccello, A. A. Philippakis, and P. A. Noseworthy. 2020. How will machine learning inform the clinical care...
- Soni, S. and K. Roberts. 2021. Patient cohort retrieval using transformer language models. In AMIA annual symposium proceedings, volume 2020,...
- Sterne, J. A., I. R. White, J. B. Carlin, M. Spratt, P. Royston, M. G. Kenward, A. M. Wood, and J. R. Carpenter. 2009. Multiple imputation...
- Stubbs, A., M. Filannino, E. Soysal, S. Henry, and ¨ O. Uzuner. 2019. Cohort selection for clinical trials: n2c2 2018 shared task track 1....
- Sung, S.-F., K.-L. Sung, R.-C. Pan, P.-J. Lee, and Y.-H. Hu. 2022. Automated risk assessment of newly detected atrial fibrillation poststroke...
- Terry, A. L., M. Stewart, S. Cejic, J. N. Marshall, S. de Lusignan, B. M. Chesworth, V. Chevendra, H. Maddocks, J. Shadd, F. Burge, et al....
- Tiwari, P., K. L. Colborn, D. E. Smith, F. Xing, D. Ghosh, and M. A. Rosenberg. 2020. Assessment of a machine learning model applied to harmonized...
- Tseng, A. S. and P. A. Noseworthy. 2021. Prediction of atrial fibrillation using machine learning: a review. Frontiers in Physiology, 12:752317.
- Ullah, M., A. Akbar, and G. G. Yannarelli. 2020. Applications of artificial intelligence in early detection of cancer, clinical diagnosis...
- van Breugel, B. and M. van der Schaar. 2024. Why tabular foundation models should be a research priority. Proceedings of the 41st International...
- Vydiswaran, V. V., A. Strayhorn, X. Zhao, P. Robinson, M. Agarwal, E. Bagazinski, M. Essiet, B. E. Iott, H. Joo, P. Ko, et al. 2019. Hybrid...
- Wagholikar, K. B., H. Estiri, M. Murphy, and S. N. Murphy. 2020. Polar labeling: silver standard algorithm for training disease classifiers....
- Xie, S., Z. Yu, and Z. Lv. 2021. Multi-disease prediction based on deep learning: a survey. Computer Modeling in Engineering & Sciences,...
- Yu, Z., K. Wang, Z. Wan, S. Xie, and Z. Lv. 2023. Popular deep learning algorithms for disease prediction: a review. Cluster Computing, 26(2):1231–1251.
- Zhou, X., K. Nakamura, N. Sahara, T. Takagi, Y. Toyoda, Y. Enomoto, H. Hara, M. Noro, K. Sugi, M. Moroi, et al. 2022. Deep learning-based...