Ir al contenido

Documat


Adapting Large Language Models for Underrepresented Languages

    1. [1] Universidade da Coruña

      Universidade da Coruña

      A Coruña, España

  • Localización: Proceedings XoveTIC 2024: Impulsando el talento científico / coord. por Manuel Lagos Rodríguez, Tirso Varela Rodeiro, Javier Pereira-Loureiro Árbol académico, Manuel Francisco González Penedo Árbol académico, 2024, págs. 25-32
  • Idioma: inglés
  • Enlaces
  • Resumen
    • The popularization of Large Language Models (LLMs), especially with the development of conversational systems, makes mandatory to think about facilitating the use of artificial intelligence (AI) to everyone. Most models neglect minority languages, prioritizing widely spoken ones. This exacerbates their underrepresentation in the digital world and negatively affects their speakers. We present two resources aimed at improving natural language processing (NLP) for Galician: (i) a Llama 3.1 instruct model adapted through continuous pre-training on the CorpusNós dataset; and (ii) a Galician version of the Alpaca dataset, used to assess the improvement over the base model. In this evaluation, our model outperformed both the base model and another Galician model in quantitative and qualitative terms


Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno