Ir al contenido

Documat


Bertinho: Galician BERT Representations

  • Autores: David Vilares Calvo Árbol académico, Marcos García González, Carlos Gómez Rodríguez Árbol académico
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 66, 2021, págs. 13-26
  • Idioma: inglés
  • Títulos paralelos:
    • Bertinho: Representaciones BERT para el gallego
  • Enlaces
  • Resumen
    • español

      Este artículo presenta un modelo BERT monolingüe para el gallego. Nos basamos en la tendencia actual que ha demostrado que es posible crear modelos BERT monolingües robustos incluso para aquellos idiomas para los que hay una relativa escasez de recursos, funcionando éstos mejor que el modelo BERT multilingüe oficial (mBERT). Concretamente, liberamos dos modelos monolingües para el gallego, creados con 6 y 12 capas de transformers, respectivamente, y entrenados con una limitada cantidad de recursos (~45 millones de palabras sobre una única GPU de 24GB.) Para evaluarlos realizamos un conjunto exhaustivo de experimentos en tareas como análisis morfosintáctico, análisis sintáctico de dependencias o reconocimiento de entidades. Para ello, abordamos estas tareas como etiquetado de secuencias, con el objetivo de ejecutar los modelos BERT sin la necesidad de incluir ninguna capa adicional (únicamente se añade la capa de salida encargada de transformar las representaciones contextualizadas en la etiqueta predicha). Los experimentos muestran que nuestros modelos, especialmente el de 12 capas, mejoran los resultados de mBERT en la mayor parte de las tareas.

    • English

      This paper presents a monolingual BERT model for Galician. We follow the recent trend that shows that it is feasible to build robust monolingual BERT models even for relatively low-resource languages, while performing better than the well-known official multilingual BERT (mBERT). More particularly, we release two monolingual Galician BERT models, built using 6 and 12 transformer layers, respectively; trained with limited resources (~45 million tokens on a single GPU of 24GB). We then provide an exhaustive evaluation on a number of tasks such as POS-tagging, dependency parsing and named entity recognition. For this purpose, all these tasks are cast in a pure sequence labeling setup in order to run BERT without the need to include any additional layers on top of it (we only use an output classification layer to map the contextualized representations into the predicted label). The experiments show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.

  • Referencias bibliográficas
    • Agerri, R., X. Gómez Guinovart, G. Rigau, and M. A. Solla Portela. 2018. Developing new linguistic resources and tools for the Galician language....
    • Agerri, R., I. San Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, and E. Agirre. 2020. Give your text representation models some...
    • Bengio, Y., R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
    • Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association...
    • Cañete, J., G. Chaperon, R. Fuentes, and J. Pérez. 2020. Spanish pre-trained BERT model and evaluation data. In Practical ML for Developing...
    • Collobert, R. and J. Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In...
    • Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural Language Processing (Almost) From Scratch. Journal...
    • Dai, A. M. and Q. V. Le. 2015. Semisupervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
    • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding....
    • Ettinger, A. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association...
    • Freixeiro Mato, X. R. 2003. Gramática da Lingua Galega IV. Gramática do texto. A Nosa Terra, Vigo.
    • Garcia, M. and P. Gamallo. 2010. Análise Morfossintáctica para Português Europeu e Galego: Problemas, Solu¸c˜oes e Avalia¸c˜ao. Linguamática,...
    • Garcia, M., C. Gómez-Rodríguez, and M. A. Alonso. 2016. Creación de un treebank de dependencias universales mediante recursos existentes para...
    • Garcia, M., C. Gómez-Rodríguez, and M. A. Alonso. 2018. New treebank or repurposed? on the feasibility of cross-lingual parsing of romance...
    • Guinovart, X. G. and S. L. Fernández. 2009. Anotación morfosintáctica do Corpus Técnico do Galego. Linguamática, 1(1):61–70.
    • Guinovart, X. 2017. Recursos integrados da lingua galega para a investigación lingüística. Gallaecia. Estudos de lingüística portuguesa e...
    • IGE. 2018. Coñecemento e uso do galego. Instituto Galego de Estatística, http://www.ige.eu/web/mostrar_actividade_estatistica.jsp?idioma=gl&codigo=0206004.
    • Jiang, N. and M.-C. de Marneffe. 2019. Evaluating BERT for natural language inference: A case study on the CommitmentBank. In Proceedings...
    • Karthikeyan, K., Z. Wang, S. Mayhew, and D. Roth. 2020. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In International Conference...
    • Kingma, D. P. and J. Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR...
    • Kitaev, N. and D. Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association...
    • Koutsikakis, J., I. Chalkidis, P. Malakasiotis, and I. Androutsopoulos. 2020. GREEKBERT: The Greeks visiting Sesame Street. In 11th Hellenic...
    • Kuratov, Y. and M. Arkhipov. 2019. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. Computational Linguistics...
    • Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations....
    • Landauer, T. K. and S. T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and...
    • Lee, S., H. Jang, Y. Baik, S. Park, and H. Shin. 2020. KR-BERT: A Small-Scale Korean-Specific Language Model. arXiv preprint arXiv:2008.03979.
    • Lin, Y., Y. C. Tan, and R. Frank. 2019. Open sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop BlackboxNLP:...
    • Lindley Cintra, L. F. and C. Cunha. 1984. Nova Gramática do Português Contemporâneo. Livraria Sá da Costa, Lisbon.
    • Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A Robustly Optimized...
    • Malvar, P., J. R. Pichel, O. Senra, P. Gamallo, and A. García. 2010. Vencendo a escassez de recursos computacionais. carvalho: Tradutor automático...
    • McDonald, S. and M. Ramscar. 2001. Testing the distributional hypothesis: The influence of context on judgements of semantic similarity. In...
    • Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. In Workshop Proceedings...
    • Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013b. Distributed representations of words and phrases and their compositionality....
    • Nivre, J., M.-C. de Marneffe, F. Ginter, J. Hajiˇc, C. D. Manning, S. Pyysalo, S. Schuster, F. Tyers, and D. Zeman. 2020. Universal Dependencies...
    • Ortiz Suárez, P. J., L. Romary, and B. Sagot. 2020. A monolingual approach to contextualized word embeddings for midresource languages. In...
    • Padró, L. 2011. Analizadores Multilingües en FreeLing. Linguamatica, 3(2):13–20.
    • Pennington, J., R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical...
    • Peters, M., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings...
    • Pires, T., E. Schlinger, and D. Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association...
    • Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2020. Exploring the limits of transfer learning...
    • Rojo, G., M. López Martínez, E. Domínguez Noya, and F. Barcala. 2019. Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA),...
    • Salant, S. and J. Berant. 2018. Contextualized word representations for reading comprehension. In Proceedings of the 2018 Conference of the...
    • Samartim, R. 2012. Língua somos: A constru¸c˜ao da ideia de língua e da identidade coletiva na galiza (pré-) constitucional. In Novas achegas...
    • Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings...
    • Schnabel, T., I. Labutov, D. Mimno, and T. Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015...
    • Souza, F., R. Nogueira, and R. Lotufo. 2019. Portuguese named entity recognition using BERT-CRF. arXiv preprint arXiv:1909.10649.
    • Strzyz, M., D. Vilares, and C. GómezRodríguez. 2019. Viable dependency parsing as sequence labeling. In Proceedings of the 2019 Conference...
    • TALG. 2016. CTG Corpus (Galician Technical Corpus). TALG Research Group. SLI resources, 1.0, ISLRN 437-045-879-366-6. TALG. 2018. SLI NERC...
    • TALG Research Group. SLI resources, 1.0, ISLRN 435-026-256-395-4. Teyssier, P. 1987. História da Língua Portuguesa. Livraria Sá da Costa,...
    • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention Is All You Need. arXiv...
    • Vilares, D. and C. Gómez-Rodríguez. 2018. Transition-based parsing with lighter feedforward networks. In Proceedings of the Second Workshop...
    • Vilares, D., M. Strzyz, A. Søgaard, and C. Gómez-Rodríguez. 2020. Parsing as pretraining. In Proceedings of the ThirtyFourth AAAI Conference...
    • Virtanen, A., J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. 2019. Multilingual is not enough: BERT...
    • Vuli´c, I., E. M. Ponti, R. Litschko, G. Glavaˇs, and A. Korhonen. 2020. Probing pretrained language models for lexical semantics. In Proceedings...
    • Wenzek, G., M.-A. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, and E. Grave. 2020. CCNet: Extracting high quality monolingual...
    • Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von...
    • Wu, S. and M. Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno