Open Generative Large Language Models for Galician

Pablo Gamallo Otero; Pablo Rodríguez; Iria de Dios Flores; Susana Sotelo Docío; Silvia Paniagua; Daniel Bardanca Outeiriño; José Ramón Pichel Campos; Marcos García González

Ayuda

Open Generative Large Language Models for Galician

Autores: Pablo Gamallo Otero , Pablo Rodríguez, Iria de Dios Flores, Susana Sotelo Docío, Silvia Paniagua, Daniel Bardanca Outeiriño, José Ramón Pichel Campos, Marcos García González
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 73, 2024, págs. 259-270
Idioma: inglés
Títulos paralelos:
- Grandes Modelos de Lengua Generativos y Abiertos para Gallego
Enlaces
- Texto completo

Dialnet Métricas: 4 Citas

Resumen
- español
  Los grandes modelos de lengua (LLM por su nombre en inglés) han transformado el procesamiento del lenguaje natural, pero la predominancia del uso de datos en inglés para su entrenamiento ha dado lugar a sesgos y disparidades de rendimiento entre lenguas. Este desequilibrio margina a las lenguas minoritarias, dificultando el acceso equitativo a las tecnologías de PLN para las lenguas con menos recursos, como el gallego. Para hacer frente a esta situación, presentamos los dos primeros LLM generativos centrados en el gallego. Estos modelos, disponibles gratuitamente como recursos de código abierto, han sido entrenados utilizando una arquitectura GPT con 1,3 mil millones de parámetros, a partir de un corpus de 2,1 mil millones de palabras. Aprovechando la técnica de pre-entrenamiento continuado, hemos adaptado al gallego dos LLM existentes entrenados en corpus más grandes, mitigando así las limitaciones de datos que surgirían si el entrenamiento se realizara desde cero. Los modelos se han evaluado utilizando juicios humanos y conjuntos de datos basados en tareas de referencia estandarizadas. Estas evaluaciones revelan un rendimiento prometedor, subrayando la importancia de la diversidad lingüística en los modelos generativos.
- English
  Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.
Referencias bibliográficas
- Bandarkar, L., D. Liang, B. Muller, M. Artetxe, S. N. Shukla, D. Husa, N. Goyal, A. Krishnan, L. Zettlemoyer, and M. Khabsa. 2023. The belebele...
- Chang, Y., X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, and X....
- Dalt, S. D., J. Llop, I. Baucells, M. Pamies, Y. Xu, A. Gonzalez-Agirre, and M. Villegas. 2024. Flor: On the effectiveness of language adaptation....
- de Dios-Flores, I., C. Magariños, A. I. Vladu, J. E. Ortega, J. R. Pichel, M. García, P. Gamallo, E. Fernández Rei, A. Bugarín-Diz, M. González...
- de Dios-Flores, I., S. P. Suárez, C. C. Pérez, D. B. Outeiriño, M. Garcia, and P. Gamallo. 2024. Corpusnós: A massive galician corpus for...
- Downey, C., T. Blevins, N. Goldfine, and S. Steinert-Threlkeld. 2023. Embedding structure matters: Comparing methods to adapt multilingual...
- Etxaniz, J., O. Sainz, N. Perez, I. Aldabe, G. Rigau, E. Agirre, A. Ormazabal, M. Artetxe, and A. Soroa. 2024. Latxa: An open language model...
- Fernández-Pichel, M., M. Prada-Corral, D. E. Losada, J. C. Pichel, and P. Gamallo. 2024. An unsupervised perplexity-based method for boilerplate...
- Gao, L., J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff,...
- Garcia, M. 2021. Exploring the representation of word meanings in context: A case study on homonymy and synonymy. In Proceedings of the 59th...
- Gupta, K., B. Thérien, A. Ibrahim, M. L. Richter, Q. G. Anthony, E. Belilovsky, I. Rish, and T. Lesort. 2023. Continual pre-training of large...
- Gutiérrez-Fandiño, A. Armengol-Estapé, J. Pàmies, M. Llop-Palao, J. Silveira-Ocampo, J. Carrino, C. Armentano-Oller, C. Rodriguez-Penagos,...
- Hendrycks, D., S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. 2021. Measuring...
- Ke, Z., Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. 2023. Continual pre-training of language models. In The Eleventh International Conference...
- Khanuja, S., S. Ruder, and P. Talukdar. 2023. Evaluating the diversity, equity, and inclusion of NLP technology: A case study for Indian languages....
- Kingma, D. P. and J. Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Lopes, R., J. Magalhaes, and D. Semedo. 2024. GlórIA: A generative and open large language model for Portuguese. In P. Gamallo, D. Claro,...
- Mihaylov, T., P. Clark, T. Khot, and A. Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering....
- Paperno, D., G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. 2016. The LAMBADA...
- Rajbhandari, S., J. Rasley, O. Ruwase, and Y. He. 2020. Zero: memory optimizations toward training trillion parameter models. In Proceedings...
- Santos, R., J. Silva, L. Gomes, J. Rodrigues, and A. Branco. 2024. Advancing Generative AI for Portuguese with Open Decoder Gervásio PT. In...
- Touvron, H., L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,...
- Vilares, D., M. Garcia, and C. Gómez-Rodríguez. 2021. Bertinho: Galician BERT Representations. Procesamiento del Lenguaje Natural, 66:13–26.
- Wang, A., A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language...
- Warstadt, A., A. Singh, and S. R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational...
- Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von...
- Yang, Y., Y. Zhang, C. Tar, and J. Baldridge. 2019. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In K. Inui,...