Towards Quality Benchmarking in Question Answering over Tabular Data in Spanish

Jorge Osés Grijalba; Luis Alfonso Ureña López; José Camacho Collados; Eugenio Martínez Cámara

Ayuda

Towards Quality Benchmarking in Question Answering over Tabular Data in Spanish

Autores: Jorge Osés Grijalba, Luis Alfonso Ureña López , José Camacho Collados , Eugenio Martínez Cámara
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 73, 2024, págs. 283-296
Idioma: inglés
Títulos paralelos:
- Una Evaluación de Calidad en Preguntas y Respuestas sobre Datos Tabulares en Español
Enlaces
- Texto completo
Resumen
- español
  La evolución constante y veloz de la capacidad de compresión y generación de lenguaje de los modelos de lenguaje grandes (LLMs) va acompañada del descubrimiento de nuevas habilidades. La evaluación de estas precisa de que la comunidad científica proporcione marcos de evaluación que permita el estudio, comparación y análisis de estas nuevas capacidades en diversos LLMs. La respuesta a preguntas a partir de datos en tablas es una de las nuevas capacidades de los LLMs, que aún carece de un benchmark de evaluación que permita analizarla en diferentes escenarios. Por tanto, en este trabajo se presenta Spa-DataBench, un benchmark de evaluación formado por diez conjuntos de datos sobre diferentes aspectos de la sociedad española. Cada conjunto de datos tiene asociado un conjunto de preguntas en español con sus respectivas respuestas, las cuales escrutan al LLM para estudiar su capacidad de responder preguntas que involucran una columna o varias sobre distintos tipos de datos, y de generar código fuente que permite la resolución de la pregunta. Se evalúan seis LLMs en Spa-DataBench, y se compara su rendimiento mediante el uso del mismo prompt escrito en ingles, debido a que los LLMs evaluados no han sido ajustados a usar prompts en español. Los resultados indican que los LLMs pueden razonar sobre datos tabulares, pero su rendimiento en español es inferior que en inglés, evidenciando que aùn se debe seguir trabajando en mejorar el procesamiento del español de los LLMs.
- English
  The rapid and incessant progress of language understanding and language generation capacity of large language models (LLMs) is followed by the discovery of new capabilities. The research community has to provide evaluation benchmarks to asses these emerging capabilities by studying, analysing and comparing different LLMs under fair and realistic settings. Question answering on tabular data is an important task to assess that lacks reliable evaluation benchmarks to assess LLMs in distinct scenarios, particularly for Spanish. Hence, in this paper we present Spa-DataBench, an evaluation benchmark composed of ten datasets about different topics of the Spanish society. Likewise, each dataset is linked to a set of questions written in Spanish and their corresponding answers. These questions are used to assess LLMs and analyse their capacity for answering questions that involve one single or multiple columns of different data types, and for generating source code to resolve the questions. We evaluate six LLMs on Spa-DataBench, and we compare their performance using both Spanish and English prompts. The results on Spa-DataBench show that LLMs are able to reason on tabular data, but their performance in Spanish is worse, which means that there is still room for improvement of LLMs in the Spanish language.
Referencias bibliográficas
- 40dB, E. P. 2022. Percepción del amor. https://elpais.com/sociedad/2022-06-05/consulte-todos-los-datos-internosde-la-encuesta-de-el-pais-sobre-la-percepcion-del-amor-cuestionarios-yrespuestas-individuales.html.
- 40dB, E. P. 2024a. Encuesta de igualdad marzo 2024. https://elpais.com/espana/2024-03-11/consulte-todos-los-datos-internosde-la-encuesta-de-el-pais-de-marzocuestionarios-cruces-y-respuestas.html.
- 40dB, E. P. 2024b. Encuesta sobre el sueño. https://elpais.com/ciencia/2024-02-25/consulte-todos-los-datos-internosdel-barometro-de-el-pais-cuestionarioscruces-y-respuestas-individuales.html.
- Brown, T. B., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,...
- Buhrmann, T. 2023. Lector, dec.
- CEA. 2023. Barómetro andaluz septiembre 2023. https://www.centrodeestudiosandaluces.es/barometro/barometro-andaluzde-septiembre-2023.
- Chen, W. 2023. Large language models are few(1)-shot table reasoners. In Findings of the Association for Computational Linguistics: EACL 2023,...
- CIS. 2021a. Salud mental durante la pandemia. https://www.cis.es/es/detalleficha-estudio?idEstudio=14676.
- CIS. 2021b. Salud mental durante la pandemia. https://datos.gob.es/es/catalogo/ea0022266-2193comportamiento-de-los-espanolesante-las-vacaciones-iii.
- CIS. 2023a. Cis – relaciones afectivas pospandemia iii. https://www.cis.es/detalle-fichaestudio?origen=estudio&idEstudio=14702.
- CIS. 2023b. Fusión barómetros enero-marzo 2023. https://www.cis.es/es/detalleficha-estudio?idEstudio=14707.
- CIS. 2023c. Opinión pública y política fiscal julio 2023. https://www.cis.es/detalle-fichaestudio?origen=estudio&idEstudio=14741.
- CRS. 2023. Barómetro juventud, salud y bienestar 2023. https://www.centroreinasofia.org/publicacion/barometro-salud-2023/.
- Deng, X., V. Bashlovkina, F. Han, S. Baumgartner, and M. Bendersky. 2023. Llms to the moon? reddit market sentiment analysis with large language...
- Du, X., J. Shao, and C. Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual...
- Duan, N., D. Tang, P. Chen, and M. Zhou. 2017. Question generation for question answering. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings...
- Guo, D., Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, and W. Liang. 2024. Deepseek-coder: When...
- Gururangan, S., S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference...
- Heilman, M. and N. A. Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual...
- Hendrycks, D., C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. 2021. Measuring massive multitask language understanding....
- Jiang, A. Q., A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L....
- Jin, N., J. Siebert, D. Li, and Q. Chen. 2022. A survey on table question answering: Recent advances. In M. Sun, G. Qi, K. Liu, J. Ren, B....
- Joshi, M., E. Choi, D. S. Weld, and L. Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension....
- Kocisky, T., J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2018. The narrativeqa reading comprehension challenge....
- Kweon, S., Y. Kwon, S. Cho, Y. Jo, and E. Choi. 2023. Open-WikiTable : Dataset for open domain question answering with complex reasoning over...
- Labutov, I., S. Basu, and L. Vanderwende. 2015. Deep questions without deep understanding. In Proceedings of the 53rd Annual Meeting of the...
- Lindberg, D., F. Popowich, J. Nesbit, and P. Winne. 2013. Generating natural language questions to support learning online. In Proceedings...
- Ling, Y., Y. An, and S. Hasan. 2017. Improving clinical diagnosis inference through integration of structured and unstructured knowledge....
- Nan, L., C. Hsieh, Z. Mao, X. V. Lin, N. Verma, R. Zhang, W. Kryscinski, H. Schoelkopf, R. Kong, X. Tang, M. Mutuma, B. Rosand, I. Trindade,...
- Osés-Grijalba, J., L. A. Ureña-López, E. M. Cámara, and J. Camacho-Collados. 2024. Question answering over tabular data with databench: A...
- Pasupat, P. and P. Liang. 2015a. Compositional semantic parsing on semistructured tables. In Proceedings of the 53rd Annual Meeting of the...
- Pasupat, P. and P. Liang. 2015b. Compositional semantic parsing on semistructured tables.
- Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. Technical...
- Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of...
- Rozière, B., J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton,...
- Srivastava, A., A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. 2022....
- Tunstall, L., E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero,...
- Ushio, A., F. Alva-Manchego, and J. Camacho-Collados. 2022. Generative language models for paragraph-level question generation. In Y. Goldberg,...
- Voorhees, E. M. 2001. The trec question answering track. Natural Language Engineering, 7(4):361–378.
- Wang, A., Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. 2019. Superglue: a stickier benchmark for...
- Wang, A., A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language...
- Wang, G., S. Cheng, Q. Yu, and C. Liu. 2023. OpenLLMs: Less is More for Open-source Models, 7.
- Wei, J., Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals,...
- Yang, J., H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu. 2023. Harnessing the power of llms in practice: A survey on chatgpt...
- Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language...
- Zhang, Q., S. Chen, D. Xu, Q. Cao, X. Chen, T. Cohn, and M. Fang. 2023a. A survey for efficient open domain question answering. In Proceedings...
- Zhang, T., F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto. 2023b. Benchmarking large language models for news summarization....
- Zhang, W., Y. Deng, B. Liu, S. Jialin Pan, and L. Bing. 2023c. Sentiment analysis in the era of large language models: A reality check. arXiv...
- Zhong, V., C. Xiong, and R. Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv...