Discriminative Benchmarking of Spanish Language Models: Findings from the ODESIA Challenge 2024

Alejandro Benito Santos; Roser Morante; Adrián Ghajari Espinosa; Iker García Ferrero; Robiert Sepúlveda Torres; Germán Rigau Claramunt; Rodrigo Agerri Gascón; Juan Pablo Consuegra Ayala; Ernesto Luis Estevanell Valladares; Fabio Yáñez Romero; Miquel Canal Esteve; Yoan Gutiérrez Vázquez; Rafael Muñoz Guillena; Manuel Palomar Sanz; Eva Sánchez Salido; Guillermo Marco Remón; Andrés Fernández García; Víctor Fresno Fernández; Enrique Amigó; Laura Plaza Morales; Jorge Carrillo de Albornoz; Miguel Lucas; Julio Gonzalo Arroyo

Ayuda

Discriminative Benchmarking of Spanish Language Models: Findings from the ODESIA Challenge 2024

Autores: Alejandro Benito Santos, Roser Morante , Adrián Ghajari Espinosa, Iker García Ferrero, Robiert Sepúlveda Torres, Germán Rigau Claramunt , Rodrigo Agerri Gascón , Juan Pablo Consuegra Ayala, Ernesto Luis Estevanell Valladares, Fabio Yáñez Romero, Miquel Canal Esteve, Yoan Gutiérrez Vázquez , Rafael Muñoz Guillena , Manuel Palomar Sanz , Eva Sánchez Salido, Guillermo Marco Remón, Andrés Fernández García, Víctor Fresno Fernández , Enrique Amigó , Laura Plaza Morales , Jorge Carrillo de Albornoz , Miguel Lucas, Julio Gonzalo Arroyo
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 76, 2026 (Ejemplar dedicado a: Procesamiento del Lenguaje Natural, Revista nº 76, marzo de 2026), págs. 225-238
Idioma: inglés
Títulos paralelos:
- Evaluación Discriminativa de Modelos de Lenguaje en Español: Resultados del ODESIA Challenge 2024
Enlaces
- Texto completo
Resumen
- español
  Presentamos los resultados del ODESIA Challenge 2024, una competición abierta basada en conjuntos de prueba privados orientada a evaluar sistemas de procesamiento del lenguaje natural (PLN) en español en diez tareas discriminativas.
  
  El sistema ganador, un LLM (Qwen2.5-14B), destacó por su rendimiento en extractive Question Answering, mientras que los encoders superaron a los LLM en tareas como sequence labeling y soft classification. Concluimos que, aunque los grandes modelos generativos pueden dominar tareas de razonamiento con contextos largos, los encoders logran un rendimiento comparable o superior en muchos escenarios discriminativos, poniendo en tela de juicio la creencia de que el tamaño de un modelo es un factor más decisivo que el emplear una arquitectura especializada en este tipo de tareas.
- English
  This paper presents the results from the 2024 ODESIA Challenge, a public competition aimed at benchmarking natural language processing (NLP) systems in Spanish across ten discriminative tasks using a standardized methodology based on private, held-out test sets. Results show the winning system (Qwen2.5-14B) prevailed due to structural advantages in extractive Question Answering, whereas encoders outperformed LLMs in other tasks such as sequence labeling and soft classification. We conclude that, while generative models may dominate reasoning-heavy tasks involving long contexts, encoder architectures obtain on-par or even better performance in many other discriminative scenarios, challenging the assumption that massive scale universally supersedes specialized architectural design.
Referencias bibliográficas
- Agerri, R. and E. Agirre. 2023. Lessons learned from the evaluation of Spanish Language Models. Procesamiento del Lenguaje Natural, 70:157–170.
- Amigo, E. and A. Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In S. Muresan, P. Nakov, and A. Villavicencio,...
- BAAI. 2025. Bge-reranker-v2-m3: A lightweight reranker model with strong multilingual capabilities. Hugging Face Model Hub, 2.
- Benito-Santos, A., A. Ghajari, and V. Fresno. 2025. Robust Estimation of Population-Level Effects in Repeated- Measures NLP Experimental Designs....
- Borgeaud, S., A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, D. de Las...
- B.V., S. T. 2022. Weaviate: A cloud-native, modular, real-time vector search engine. https://weaviate.io. Version X.X.
- Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised...
- Fabregat, H., J. Mart´ınez-Romo, and L. Araujo. 2018. Overview of the DIANN task: Disability annotation task. In Proceedings of the Third...
- García-Ferrero, I., R. Agerri, A. Atutxa Salazar, E. Cabrio, I. de la Iglesia, A. Lavelli, B. Magnini, B. Molinet, J. Ramirez-Romero, G. Rigau, J....
- Grattafiori, A. et al. 2024. The llama 3 herd of models.
- Gutiérrez-Fandiño, A., J. Armengol- Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, C. Armentano-Oller, C. Rodriguez- Penagos,...
- Hu, E. J., Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen. 2021. Lora: Low-rank adaptation of large language models. CoRR,...
- Izacard, G., P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi- Yu, A. Joulin, S. Riedel, and E. Grave. 2022. Atlas: Few-shot...
- Lewis, P., E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela....
- Loshchilov, I. and F. Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Malmasi, S., A. Fang, B. Fetahu, S. Kar, and O. Rokhlenko. 2022. Semeval-2022 task 11: Multilingual complex named entity recognition (multiconer)....
- Moral, P., G. Marco, J. Gonzalo, J. Carrillode Albornoz, and I. Gonzalo-Verdugo. 2023. Overview of DIPROMATS 2023: automatic detection and...
- Plaza, L., J. Carrillo-de-Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, and P. Rosso. 2023. Overview of EXIST 2023 – Learning with...
- Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang,...
- Rajbhandari, S., J. Rasley, O. Ruwase, and Y. He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International...
- Reimers, N. and I. Gurevych. 2020. sentence-transformers/all-minilml6- v2. https://huggingface. co/sentence-transformers/ all-MiniLM-L6-v2....
- Robertson, S. and S. Walker. 1994. Some simple effective approximations to the 2- poisson model for probabilistic weighted retrieval. In Proceedings...
- Rodríguez-Sánchez, F., J. Carrillo-de Albornoz, L. Plaza, D. Spina, J. Gonzalo, and P. Rosso. 2022. Overview of EXIST 2022: sexism identification...
- Ruder, S. 2021. Challenges and Opportunities in NLP Benchmarking. http: //ruder.io/nlp-benchmarking.
- Sainz, O., I. García-Ferrero, A. Jacovi [et al.]. 2024. Data Contamination Report from the 2024 CONDA Shared Task. In O. Sainz, I. Garc´ıa...
- Sánchez Salido, E., R. Morante, J. Gonzalo, G. Marco, J. Carrillo-de-Albornoz, L. Plaza, E. Amigo, A. F. Garc´ıa, A. Benito-Santos, A. Ghajari...
- Schwenk, H. and X. Li. 2018. A corpus for multilingual document classification in eight languages. In N. C. C. chair), K. Choukri, C. Cieri,...
- Serrano, A. V., G. G. Subies, H. M. Zamorano, N. A. Garcia, D. Samy, D. B. Sanchez, A. M. Sandoval, M. G. Nieto, and A. B. Jimenez. 2022....
- Sviridova, E., A. Yeginbergen, A. Estarrona, E. Cabrio, S. Villata, and R. Agerri. 2024. CasiMedicos-arg: A medical question answering dataset...
- Team, G. 2024. Gemma: Open models based on gemini research and technology.
- Teknium, R., J. Quesnelle, and C. Guang. 2024. Hermes 3 technical report.
- Uma, A., T. Fornaciari, A. Dumitrache, T. Miller, J. Chamberlain, B. Plank, E. Simpson, and M. Poesio. 2021a. SemEval-2021 task 12: Learning...
- Uma, A. N., T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio. 2021b. Learning from Disagreement: A Survey. Journal of Artificial Intelligence...
- Weaviate. 2024. Fusion algorithm: rankedfusion for hybrid search in weaviate. https://weaviate.io/learn/ knowledgecards/fusion-algorithm. Accessed:...
- Willard, B. T. and R. Louf. 2023. Efficient guided generation for large language models.