Ir al contenido

Documat


Conclusiones de la evaluación de Modelos del Lenguaje en Español

  • Autores: Rodrigo Agerri Gascón Árbol académico, Eneko Agirre Bengoa Árbol académico
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 70, 2023, págs. 157-170
  • Idioma: español
  • Títulos paralelos:
    • Lessons learned from the evaluation of Spanish Language Models
  • Enlaces
  • Resumen
    • Multiple

      Actualmente existen varios modelos del lenguaje en español (también conocidos como BERTs) los cuales han sido desarrollados tanto en el marco de grandes proyectos que utilizan corpus privados de gran tamaño, como mediante esfuerzos académicos de menor escala aprovechando datos de libre acceso. En este artículo presentamos una comparación exhaustiva de modelos de lenguaje en español con los siguientes resultados: (i) La inclusión de modelos multilingües previamente ignorados altera sustancialmente el panorama de la evaluación para el español, ya que resultan ser en general mejores que sus homólogos monolingües; (ii) Las diferencias en los resultados entre los modelos monolingües no son concluyentes, ya que aquellos supuestamente más pequeños e inferiores obtienen resultados más que competitivos. El resultado de nuestra evaluación demuestra que es necesario seguir investigando para comprender los factores que subyacen a estos resultados. En este sentido, es necesario seguir investigando el efecto del tamaño del corpus, su calidad y las técnicas de preentrenamiento para poder obtener modelos monolingües en español significativamente mejores que los multilingües ya existentes. Aunque esta actividad reciente demuestra un creciente interés en el desarrollo de la tecnología lingüística para el español, nuestros resultados ponen de manifiesto que el desarrollo de modelos de lenguaje sigue siendo un problema abierto que requiere conjugar recursos (monetarios y/o computacionales) con los mejores conocimientos y prácticas de investigación en PLN.

    • English

      Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.

  • Referencias bibliográficas
    • Agerri, R. 2020. Projecting heterogeneous annotations for named entity recognition. In IberLEF@SEPLN.
    • Agerri, R., I. San Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, and E. Agirre. 2020. Give your Text Representation Models some...
    • Aghajanyan, A., A. Shrivastava, A. Gupta, N. Goyal, L. Zettlemoyer, and S. Gupta. 2021. Better ne-tuning by reducing representational collapse....
    • Agirre, E., C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau,...
    • Agirre, E., C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe. 2014. SemEval-2014...
    • Armengol-Estape, J., C. P. Carrino, C. Rodriguez-Penagos, O. de Gibert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas....
    • Artetxe, M., I. Aldabe, R. Agerri, O. P. de Vinaspre, and A. S. Etxabe. 2022. Does corpus quality really matter for lowresource languages?...
    • Brown, T. B., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,...
    • Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Perez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC...
    • Clark, K., M.-T. Luong, Q. V. Le, and C. D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In...
    • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised...
    • Conneau, A., G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In...
    • De la Rosa, J., E. G. Ponferrada, M. Romero, P. Villegas, P. G. de Prado Salas, and M. Grandury. 2022. BERTIN: Ecient Pre-Training of a Spanish...
    • de Vries, W., A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim. 2019. BERTje: A Dutch BERT Model. In ArXiv, volume abs/1912.09582.
    • Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In...
    • Gutierrez-Fandino, A., J. ArmengolEstape, M. Pamies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, C. Armentano-Oller, C. RodriguezPenagos,...
    • He, P., J. Gao, and W. Chen. 2021. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In...
    • Komatsuzaki, A. 2019. One Epoch Is All You Need. In ArXiv.
    • Kreutzer, J., I. Caswell, L. Wang, A. Wahab, D. van Esch, N. Ulzii-Orshikh, A. Tapo, N. Subramani, A. Sokolov, C. Sikasote, M. Setyawan, S....
    • Lai, G., B. Oguz, Y. Yang, and V. Stoyanov. 2019. Bridging the domain gap in crosslingual document classication. In ArXiv, volume abs/1909.07009.
    • Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A Robustly Optimized...
    • Martin, L., B. Muller, P. J. Ortiz Suarez, Y. Dupont, L. Romary, E. de la Clergerie, D. Seddah, and B. Sagot. 2020. CamemBERT: a tasty French...
    • Nozza, D., F. Bianchi, and D. Hovy. 2020. What the [MASK]? Making Sense of Language-Specic BERT Models. In ArXiv, volume abs/2003.02912. Nzeyimana,...
    • Ortiz Suarez, P. J., B. Sagot, and L. Romary. 2019. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures....
    • Otegi, A., A. Gonzalez-Agirre, J. A. Campos, A. S. Etxabe, and E. Agirre. 2020. Conversational Question Answering in Low Resource Scenarios:...
    • Pires, T. J. P., E. Schlinger, and D. Garrette. 2019. How Multilingual is Multilingual BERT? In ACL.
    • Porta, J. and L. Espinosa-Anke. 2020. Overview of CAPITEL Shared Tasks at IberLEF 2020: Named Entity Recognition and Universal Dependencies...
    • Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2020. Exploring the limits of transfer learning...
    • Sanchez-Bayona, E. and R. Agerri. 2022. Leveraging a new spanish corpus for multilingual and crosslingual metaphor detection. In CoNLL.
    • Scao, T. L., A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagne, A. S. Luccioni, F. Yvon, M. Galle, J. Tow, A. M. Rush, S. Biderman,...
    • H. Rezanejad, H. Jones, I. Bhattacharya, I. Solaiman, I. Sedenko, I. Nejadgholi, J. Passmore, J. Seltzer, J. B. Sanz, K. Fort, L. Dutra, M....
    • Schwenk, H. and X. Li. 2018. A corpus for multilingual document classication in eight languages. In LREC.
    • Straka, M., J. Strakova, and J. Hajic. 2019. Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency...
    • Tanvir, H., C. Kittask, and K. Sirts. 2021. EstBERT: A Pretrained LanguageSpecic BERT for Estonian. In NODALIDA.
    • Taule, M., M. A. Mart, and M. Recasens. 2008. AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In LREC.
    • Tiedemann, J. and S. Thottingal. 2020. OPUS-MT { Building open translation services for the World. In European Association for Machine Translation...
    • Tjong-Kim-Sang, E. 2002. Introduction to the CoNLL-2002 Shared Task: Language Independent Named Entity Recognition. In CoNLL.
    • Urbizu, G., I. San Vicente, X. Saralegi, R. Agerri, and A. Soroa. 2022. BasqueGLUE: A natural language understanding benchmark for Basque....
    • Virtanen, A., J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. 2019. Multilingual is not enough: BERT...
    • Wang, X., Y. Jiang, N. Bach, T. Wang, Z. Huang, F. Huang, and K. Tu. 2021. Automated concatenation of embeddings for structured prediction....
    • Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. 2020. Transformers:...
    • Wu, S. and M. Dredze. 2020. Are All Languages Created Equal in Multilingual BERT? In Workshop on Representation Learning for NLP.
    • Xue, L., N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Rael. 2021. mT5: A massively multilingual pre-trained...
    • Yang, Y., Y. Zhang, C. Tar, and J. Baldridge. 2019. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identication. In EMNLP.
    • Zhang, S., S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster,...
    • Zheng, B., L. Dong, S. Huang, S. Singhal, W. Che, T. Liu, X. Song, and F. Wei. 2021. Allocating large vocabulary capacity for cross-lingual...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno