Lexical Complexity Assessment of Spanish in Ecuadorian Public Documents

Jenny Alexandra Ortiz Zambrano; César Espin Riofrio; Arturo Montejo Ráez

Ayuda

Lexical Complexity Assessment of Spanish in Ecuadorian Public Documents

Autores: Jenny Alexandra Ortiz Zambrano, César Espin Riofrio, Arturo Montejo Ráez
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 74, 2025, págs. 291-303
Idioma: inglés
Títulos paralelos:
- Evaluación de Complejidad Léxica del Español en Documentos Públicos Ecuatorianos
Enlaces
- Texto completo
Resumen
- español
  Este estudio presenta una evaluación integral de la complejidad léxica (CL) en textos de instituciones publicas ecuatorianas, con un enfoque particular en el desarrollo y aplicación de técnicas avanzadas de procesamiento del lenguaje natural (PLN). El análisis incluye una evaluación comparativa de varios modelos y enfoques aplicados al corpus GovAIEc, una colección recientemente desarrollada de textos gubernamentales ecuatorianos. El estudio examina el impacto de la incorporación de características lingüísticas y la variación del número de épocas de entrenamiento, proporcionando un análisis profundo de su contribución al rendimiento del modelo. Además, se propone una solución práctica y accesible a través de una plataforma web diseñada para facilitar la comprensión de palabras complejas en documentos públicos, que a menudo obstaculizan la ejecución exitosa de procesos burocráticos. Este trabajo tiene como objetivo mejorar las interacciones con los sistemas gubernamentales promoviendo una comunicación más eficiente y comprensible. El mejor rendimiento se alcanzó con bert-base-spanish-wwm-uncased, combinando características lingüísticas y codificaciones, con un MAE = 0.1551. Los resultados indican que las características lingüísticas son esenciales para mejorar el rendimiento, sugiriendo que los enfoques híbridos son más efectivos que los basados únicamente en aprendizaje profundo.
- English
  This study presents a comprehensive assessment of lexical complexity (LC) in texts from Ecuadorian public institutions, with a particular focus on the development and application of advanced natural language processing (NLP) techniques. The analysis includes a comparative evaluation of several models and approaches applied to the GovAIEc corpus, a recently developed collection of Ecuadorian government texts. The study examines the impact of incorporating linguistic features and varying the number of training epochs, providing an in-depth analysis of their contribution to model performance. Furthermore, a practical and accessible solution is proposed through a web platform designed to facilitate the understanding of complex words in public documents, which often hinder the successful execution of bureaucratic processes. This work aims to improve interactions with government systems by promoting more efficient and comprehensible communication. The best performance was achieved with bert-base-spanish-wwm-uncased, combining linguistic features and encodings, with a MAE = 0.1551. The results indicate that linguistic features are essential to improve performance, suggesting that hybrid approaches are more effective than those based solely on deep learning.
Referencias bibliográficas
- Agerri, R. and E. Agirre. 2023. Lessons learned from the evaluation of spanish language models.
- Alarcón, R., L. Moreno, and P. Martínez. 2020. Hulat-alexs cwi task-cwi for language and learning disabilities applied to university educational...
- Aumiller, D. and M. Gertz. 2023. Unihd at tsar-2022 shared task: Is compute all we need for lexical simplification. arXiv preprint arXiv:2301.01764.
- Bott, S., H. Saggion, N. P. Rojas, M. S. Salazar, and S. C. Ramirez. 2024. Multils-sp/ca: Lexical complexity prediction and lexical simplification...
- Bylund, E., Z. Khafif, and R. Berghoff. 2024. Linguistic and geographic diversity in research on second language acquisition and multilingualism:...
- Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. PML4DC at...
- Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2023. Spanish pre-trained bert model and evaluation data.
- Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised...
- Desai, A. T., K. North, M. Zampieri, and C. Homan. 2021. LCP-RIT at SemEval-2021 task 1: Exploring linguistic features for lexical complexity...
- Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding....
- Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
- Gutiérrez-Fandiño, A., J. Armengol-Estapé, A. Gonzalez-Agirre, and M. Villegas. 2021a. Spanish legalese language model and corpora. arXiv...
- Gutiérrez-Fandiño, A., J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, A. Gonzalez-Agirre, C. Armentano-Oller,...
- Hossain, M. S., A. I. Paran, S. H. Shohan, J. Hossain, and M. M. Hoque. 2024. SemanticCUETSync at SemEval-2024 task 1: Finetuning sentence...
- Howard, J. and S. Ruder. 2018. Universal language model fine-tuning for text classification.
- Jebali, M. S., A. Valanzano, M. Murugesan, G. Veneri, and G. D. Magistris. 2024. Leveraging the regularizing effect of mixing industrial and...
- Licardo, M., N. Volcanjk, and D. Haramija. 2021. Differences in communication skills among elementary students with mild intellectual disabilities...
- Liu, Y. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364.
- Minaee, S., T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao. 2024. Large language models: A survey. arXiv preprint...
- Mosquera, A. 2021. Alejandro mosquera at semeval-2021 task 1: Exploring sentence and word features for lexical complexity prediction. In Proceedings...
- Nandy, A., S. Adak, T. Halder, and S. M. Pokala. 2021. cs60075 team2 at SemEval-2021 Task 1: Lexical Complexity Prediction using Transformer-based...
- North, K., A. Dmonte, T. Ranasinghe, M. Shardlow, and M. Zampieri. 2023a. ALEXSIS+: Improving substitute generation and selection for...
- North, K., T. Ranasinghe, M. Shardlow, and M. Zampieri. 2023b. Deep learning approaches to lexical simplification: A survey. arXiv preprint...
- North, K., T. Ranasinghe, M. Shardlow, and M. Zampieri. 2024. Deep learning approaches to lexical simplification: A survey. Journal of Intelligent...
- North, K., M. Zampieri, and M. Shardlow. 2023. Lexical complexity prediction: An overview. ACM Computing Surveys, 55(9):1–42.
- Ortiz-Zambrano, J., C. Espin-Riofrio, and A. Montejo-Ráez. 2023. Sinai participation in simpletext task 2 at clef 2023: Gpt-3 in lexical complexity...
- Ortiz-Zambrano, J. A., C. H. Espín-Riofrío, and A. Montejo-Ráez. 2024. Deep encodings vs. linguistic features in lexical complexity prediction....
- Ortiz-Zambrano, J. A. and A. Montejo-Ráez. 2021. Complex words identification using word-level features for SemEval-2020 task 1. In Proceedings...
- Ortiz Zambrano, J. A., C. Espin-Riofrio, and A. Montejo Ráez. 2023. Legalec: A new corpus for complex word identification research in law...
- Paetzold, G. 2021. Utfpr at semeval-2021 task 1: Complexity prediction by combining bert vectors and classic features. In Proceedings of the...
- Paetzold, G. and L. Specia. 2016. Sv000gg at semeval-2016 task 11: Heavy gauge complex word identification with system voting. In Proceedings...
- Paetzold, G. H. and L. Specia. 2017. A survey on lexical simplification. Journal of Artificial Intelligence Research, 60:549–593.
- Rampal, N., K. Wang, M. Burigana, L. Hou, J. Al-Johani, A. Sackmann, H. S. Murayshid, W. A. AlSumari, A. M. AlAbdulkarim, N. E. Alhazmi, et...
- Rets, I. and J. Rogaten. 2021. To simplify or not? facilitating english l2 users’ comprehension and processing of open educational resources...
- Ronzano, F., L. E. Anke, H. Saggion, et al. 2016. Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic...
- Roundy, P. T., J. M. Trussel, and S. A. Davenport. 2023. The text complexity of local government annual reports. Local Government Studies,...
- Sarzynska-Wawer, J., A. Wawer, A. Pawlak, J. Szymanowska, I. Stefaniak, M. Jarkiewicz, and L. Okruszek. 2021. Detecting formal thought disorder...
- Shardlow, M. 2013. A comparison of techniques to automatically identify complex words. In 51st annual meeting of the association for computational...
- Shardlow, M., F. Alva-Manchego, R. T. Batista-Navarro, S. Bott, S. Calderon-Ramirez, R. Cardon, T. François, A. Hayakawa, A. Horbach, and...
- Shardlow, M., M. Cooper, and M. Zampieri. 2020a. CompLex — a new corpus for lexical complexity prediction from Likert Scale data. In Proceedings...
- Shardlow, M., M. Cooper, and M. Zampieri. 2020b. Complex: A new corpus for lexical complexity prediction from likert scale data. arXiv preprint...
- Shardlow, M., R. Evans, G. H. Paetzold, and M. Zampieri. 2021. Semeval-2021 task 1: Lexical complexity prediction. arXiv preprint arXiv:2106.00473.
- Shiroyama, T. 2022. Comparing lexical complexity using two different ve modes: a pilot study. Intelligent CALL, granular systems and learner...
- Singh, S. and A. Mahmood. 2021. The NLP cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access, 9:68675–68702.
- Tan, K., K. Luo, Y. Lan, Z. Yuan, and J. Shu. 2024. An llm-enhanced adversarial editing system for lexical simplification.
- Vaswani, A. 2017. Attention is all you need. Advances in Neural Information Processing Systems.
- Yaseen, T. B., Q. Ismail, S. Al-Omari, E. Al-Sobh, and M. Abdullah. 2021. Just-blue at semeval-2021 task 1: Predicting lexical complexity...
- Yuan, Y.-P., Y. K. Dwivedi, G. W.-H. Tan, T.-H. Cham, K.-B. Ooi, E. C.-X. Aw, and W. Currie. 2023. Government digital transformation: Understanding...
- Zambrano, J. A. O. and A. Montejo-Raéz. 2021. Clexis2: A new corpus for complex word identification research in computing studies. In Proceedings...
- Zeng, J., X. Tong, X. Yu, W. Xiao, and Q. Huang. 2024. Interpretara: Enhancing hybrid automatic readability assessment with linguistic feature...