Ir al contenido

Documat


EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque

  • Autores: Aitor García Pablos, Naiara Pérez Miguel, Montse Cuadros, Jaione Bengoetxea Azurmendi
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 73, 2024, págs. 125-137
  • Idioma: inglés
  • Títulos paralelos:
    • EuSQuAD: SQuAD2.0 Traducido y Alineado Automáticamente para Euskera
  • Enlaces
  • Resumen
    • español

      La amplia disponibilidad de conjuntos de datos de preguntas y respuestas en inglés ha facilitado en gran medida el avance del campo de Procesamiento de Lenguaje Natural (PLN). Sin embargo, la escasez de tales recursos para idiomas minoritarios, como el euskera, plantea un desafío sustancial para estas comunidades. En este contexto, la traducción y alineación de conjuntos de datos desempeña un papel crucial en la reducción de esta brecha tecnológica. Este trabajo presenta EuSQuAD, la primera iniciativa dedicada a traducir y alinear automáticamente SQuAD2.0 al euskera. Demostramos el valor de EuSQuAD a través de un extenso análisis cualitativo y experimentos de QA, para los cuales se ha creado además un nuevo dataset anotado por humanos.

    • English

      The widespread availability of Question Answering (QA) datasets in English has greatly facilitated the advancement of the Natural Language Processing (NLP) field. However, the scarcity of such resources for minority languages, such as Basque, poses a substantial challenge for these communities. In this context, the translation and alignment of existing QA datasets plays a crucial role in narrowing this technological gap. This work presents EuSQuAD, the first initiative dedicated to automatically translating and aligning SQuAD2.0 into Basque, resulting in more than 142k QA examples. We demonstrate EuSQuAD’s value through extensive qualitative analysis and QA experiments supported with EuSQuAD as training data. These experiments are evaluated with a new human-annotated dataset.

  • Referencias bibliográficas
    • Abadani, N., J. Mozafari, A. Fatemi, M. A. Nematbakhsh, and A. Kazemi. 2021. ParSQuAD: Machine translated SQuAD dataset for Persian Question...
    • Agerri, R., I. San Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, and E. Agirre. 2020. Give your text representation models some...
    • Artetxe, M., S. Ruder, and D. Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In D. Jurafsky, J. Chai,...
    • Carrino, C. P., M. R. Costa-jussà, and J. A. R. Fonollosa. 2020. Automatic Spanish translation of SQuAD dataset for Multi-lingual Question...
    • Chandra, A., A. Fahrizain, S. W. Laufried, et al. 2021. A survey on non-English question answering dataset. arXiv preprint arXiv:2112.13634.
    • Choi, E., H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. 2018. QuAC: Question Answering in Context. In Proceedings...
    • Clark, J. H., D. Garrette, I. Turc, and J. Wieting. 2022. CANINE: Pre-training an efficient tokenization-free encoder for language representation....
    • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding....
    • d’Hoffschmidt, M., W. Belblidia, Q. Heinrich, T. Brendlé, and M. Vidal. 2020. FQuAD: French Question Answering dataset. In T. Cohn, Y. He,...
    • Etchegoyhen, T., E. Martínez Garcia, A. Azpeitia, G. Labaka, I. Alegria, I. Cortes Etxabe, A. Jauregi Carrera, I. Ellakuria Santos, M. Martin,...
    • Forner, P. et al. 2009. Overview of the CLEF 2008 Multilingual Question Answering track. In Evaluating Systems for Multilingual and Multimodal...
    • Hládek, D., J. Staš, J. Juhár, and T. Koctúr. 2023. Slovak dataset for Multilingual Question Answering. IEEE Access, 11:32869–32881.
    • Joshi, M., E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for Reading Comprehension....
    • Kwiatkowski, T. et al. 2019. Natural Questions: A benchmark for Question Answering research. Transactions of the Association for Computational...
    • Mozannar, H., E. Maamary, K. El Hajal, and H. Hajj. 2019. Neural Arabic Question Answering. In W. El-Hajj, L. H. Belguith, F. Bougares, W....
    • Otegi, A., A. Agirre, J. A. Campos, A. Soroa, and E. Agirre. 2020. Conversational Question Answering in low resource scenarios: A dataset...
    • Rajpurkar, P., R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting...
    • Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of...
    • Schuster, M. and K. Nakajima. 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal...
    • Sennrich, R., B. Haddow, and A. Birch. 2016. Improving Neural Machine Translation models with monolingual data. In Proceedings of the 54th...
    • Snæbjarnarson, V. and H. Einarsson. 2022. Natural Questions in Icelandic. In Proceedings of the Thirteenth Language Resources and Evaluation...
    • Tasmiah Tahsin Mayeesha, A. M. S. and R. M. Rahman. 2021. Deep learning based Question Answering system in Bengali. Journal of Information...
    • Tiedemann, J. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources...
    • Wang, Y. et al. 2022. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Y. Goldberg, Z. Kozareva,...
    • Wei, J., M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. 2022. Finetuned Language Models are zero-shot learners....
    • Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. 2016. Google’s Neural Machine...
    • Zeng, C., S. Li, Q. Li, J. Hu, and J. Hu. 2020. A survey on Machine Reading Comprehension—tasks, evaluation metrics and benchmark datasets....

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno