MarIA: Modelos del Lenguaje en Español

Aitor González Agirre; Marta Villegas Montserrat; Asier Gutiérrez Fandiño; Jordi Armengol Estapé; Marc Pàmies; Joan Llop Palao; Joaquín Silveira Ocampo; Casimiro Pio Carrino; Carme Armentano i Oller; Carlos Rodríguez Penagos

Ayuda

MarIA: Modelos del Lenguaje en Español

Autores: Aitor González Agirre, Marta Villegas Montserrat , Asier Gutiérrez Fandiño, Jordi Armengol Estapé, Marc Pàmies, Joan Llop Palao, Joaquín Silveira Ocampo, Casimiro Pio Carrino, Carme Armentano i Oller, Carlos Rodríguez Penagos
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 68, 2022, págs. 39-60
Idioma: español
Títulos paralelos:
- MarIA:: Spanish Language Models
Enlaces
- Texto completo

Dialnet Métricas: 41 Citas

Resumen
- español
  En este artículo se presenta MarIA, una familia de modelos del lenguaje en español y sus correspondientes recursos que se hacen públicos para la industria y la comunidad científica. Actualmente, MarIA incluye los modelos del lenguaje en español RoBERTa-base, RoBERTa-large, GPT2 y GPT2-large, que pueden considerarse como los modelos más grandes y mejores para español. Los modelos han sido preentrenados utilizando un corpus masivo de 570 GB de textos limpios y deduplicados, que comprende un total de 135 mil millones de palabras extraídas del Archivo Web del Español construido por la Biblioteca Nacional de España entre los años 2009 y 2019. Evaluamos el rendimiento de los modelos con nueve conjuntos de datos existentes y con un nuevo conjunto de datos de pregunta-respuesta extractivo creado ex novo. El conjunto de modelos de MarIA supera, en la práctica totalidad, el rendimiento de los modelos existentes en español en las diferentes tareas y configuraciones presentadas.
- English
  This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which can arguably be presented as the largest and most proficient language models in Spanish. The models were pretrained using a massive corpus of 570GB of clean and deduplicated texts with 135 billion words extracted from the Spanish Web Archive crawled by the National Library of Spain between 2009 and 2019. We assessed the performance of the models with nine existing evaluation datasets and with a novel extractive Question Answering dataset created ex novo. Overall, MarIA models outperform the existing Spanish models across a variety of NLU tasks and training settings.
Referencias bibliográficas
- Agirre, E., C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, et al. 2015. Semeval-2015...
- Agirre, E., C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe. 2014. Semeval-2014 task...
- Agirre, E., D. Cer, M. Diab, and A. Gonzalez- Agirre. 2012. SemEval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The...
- Almeida, A. and A. Bilbao. 2018. Spanish 3b words word2vec embedding, January. Artetxe, M., S. Ruder, and D. Yogatama. 2019. On the cross-lingual...
- Bañón, M., P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Esplà-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn, S. Ortiz Rojas, L. Pla...
- Bengio, Y., R. Ducharme, and P. Vincent. 2000. A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.
- Bilbao-Jayo, A. and A. Almeida. 2018. Automatic political discourse analysis with multi-scale convolutional neural networks and contextual...
- Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
- Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,...
- Brown, T. B., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,...
- Cardellino, C. 2019. Spanish Billion Words Corpus and Embeddings, August.
- Carrino, C. P., J. Armengol-Estapé, O. de Gibert Bonet, A. Gutiérrez-Fandiño, A. Gonzalez-Agirre, M. Krallinger, and M. Villegas. 2021. Spanish...
- Cañete, J. 2019. Compilation of large spanish unannotated corpora, May.
- Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC...
- Clark, K., M. Luong, Q. V. Le, and C. D. Manning. 2020. ELECTRA: pre-training text encoders as discriminators rather than generators. CoRR,...
- Conneau, A., R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations....
- Cui, Y., W. Che, T. Liu, B. Qin, and Z. Yang. 2021. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio,...
- Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2018. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Gutiérrez-Fandiño, A., J. Armengol-Estapé, C. P. Carrino, O. D. Gibert, A. Gonzalez- Agirre, and M. Villegas. 2021a. Spanish biomedical and...
- Gutiérrez-Fandiño, A., J. Armengol-Estapé, A. Gonzalez-Agirre, and M. Villegas. 2021b. Spanish legalese language model and corpora.
- Henighan, T., J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, C. Hallacy, B. Mann, A. Radford,...
- Hochreiter, S. and J. Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, nov.
- Komatsuzaki, A. 2019. One epoch is all you need.
- Le, H., L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, and D. Schwab. 2019. Flaubert: Unsupervised...
- Lewis, D. D., Y. Yang, T. Russell-Rose, and F. Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine...
- Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. Roberta: A robustly optimized...
- Martin, L., B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de La Clergerie, D. Seddah, and B. Sagot. 2019. Camembert: a tasty french...
- Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Nguyen, D. Q. and A. T. Nguyen. 2020. Phobert: Pre-trained language models for vietnamese. arXiv preprint arXiv:2003.00744.
- Nozza, D., F. Bianchi, and D. Hovy. 2020. What the [mask]? making sense of language-specific BERT models. CoRR, abs/2003.02912.
- Ott, M., S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. 2019. fairseq: A fast, extensible toolkit for sequence...
- Pennington, J., R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical...
- Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In...
- Pomikálek, J. 2011. Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk university, Faculty of informatics, Brno,...
- Porta-Zamorano, J. and L. Espinosa-Anke. 2020. Overview of capitel shared tasks at iberlef 2020: Named entity recognition and universal dependencies...
- Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. Improving language understanding by generative pre-training.
- Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 2019a. Language Models are Unsupervised Multitask Learners.
- Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019b. Language models are unsupervised multitask learners. OpenAI...
- Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016. Squad: 100,000+ questions for machine comprehension of text.
- Schwenk, H. and X. Li. 2018. A corpus for multilingual document classification in eight languages. In N. C. C. chair), K. Choukri, C. Cieri,...
- Speer, R. 2019. ftfy. Zenodo. Version 5.5.
- Taulé, M., M. A. Martí, and M. Recasens. 2008. AnCora: Multilevel annotated corpora for Catalan and Spanish. In Proceedings of the Sixth International...
- Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In Lrec, volume 2012, pages 2214–2218. Citeseer.
- Tjong Kim Sang, E. F. 2002. Introduction to the CoNLL-2002 shared task: Language independent named entity recognition. In COLING-02: The 6th...
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. 2017. Attention is all you need....
- Virtanen, A., J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. 2019. Multilingual is not enough: BERT...
- Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. 2019. Hugging face’s...
- Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von...
- Xue, L., N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. 2020. mt5: A massively multilingual pre-trained...
- Yang, Y., Y. Zhang, C. Tar, and J. Baldridge. 2019. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification. In Proc. of...