Analysing the Problem of Automatic Evaluation of Language Generation Systems

Iván Martínez Murillo; Paloma Moreda Pozo; Elena Lloret Pastor

Ayuda

Analysing the Problem of Automatic Evaluation of Language Generation Systems

Autores: Iván Martínez Murillo, Paloma Moreda Pozo , Elena Lloret Pastor
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 72, 2024, págs. 123-136
Idioma: inglés
Títulos paralelos:
- Analizando el Problema de la Evaluación Automática de los Sistemas de Generación de Lenguaje
Enlaces
- Texto completo
Resumen
- español
  Las métricas automáticas de evaluación de texto se utilizan ampliamente para medir el rendimiento de un sistema de Generación de Lenguaje Natural (GLN). Sin embargo, estas métricas tienen varias limitaciones. Este articulo propone un estudio empírico donde se analiza el problema que tienen las métricas de evaluación actuales, como la falta capacidad que tienen estos sistemas de medir la calidad semántica de un texto, o la alta dependencia que tienen estas métricas sobre los textos contra los que se comparan. Además, se comparan sistemas de GLN tradicionales contra sistemas más actuales basados en redes neuronales. Finalmente, se propone una experimentación con GPT-4 para determinar si es una fuente fiable para evaluar la calidad de un texto. A partir de los resultados obtenidos, se puede concluir que con las métricas automáticas actuales la mejora de los sistemas neuronales frente a los tradicionales no es tan significativa. En cambio, si se analizan los aspectos cualitativos de los textos generados, si que se refleja esa mejora.
- English
  Automatic text evaluation metrics are widely used to measure the performance of a Natural Language Generation (NLG) system. However, these metrics have several limitations. This article empirically analyses the problem with current evaluation metrics, such as their lack of ability to measure the semantic quality of a text or their high dependence on the texts they are compared against. Additionally, traditional NLG systems are compared against more recent systems based on neural networks. Finally, an experiment with GPT-4 is proposed to determine if it is a reliable source for evaluating the validity of a text. From the results obtained, it can be concluded that with the current automatic metrics, the improvement of neural systems compared to traditional ones is not so significant. On the other hand, if we analyse the qualitative aspects of the texts generated, this improvement is reflected.
Referencias bibliográficas
- Aghahadi, Z. and A. Talebpour. 2022. Avicenna: a challenge dataset for natural language generation toward commonsense syllogistic reasoning....
- Anderson, P., B. Fernando, M. Johnson, and S. Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Computer Vision–ECCV...
- Appelt, D. 1985. Planning english sentences. cambridge university press.
- Banerjee, S. and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings...
- Bateman, J. A. 1997. Enabling technology for multilingual natural language generation: the KPML development environment. Natural Language...
- Bhargava, P. and V. Ng. 2022. Commonsense knowledge reasoning and generation with pre-trained language models: A survey. In Proceedings of...
- Braun, D., K. Klimt, D. Schneider, and F. Matthes. 2019. SimpleNLG-DE: Adapting SimpleNLG 4 to German. In Proceedings of the 12th International...
- Carlsson, F., J. Öhman, F. Liu, S. Verlinden, J. Nivre, and M. Sahlgren. 2022. Finegrained controllable text generation using non-residual...
- Cascallar-Fuentes, A., A. Ramos-Soto, and A. Bugarín Diz. 2018. Adapting SimpleNLG to Galician language. In Proceedings of the 11th International...
- Chen, G., K. van Deemter, and C. Lin. 2018. SimpleNLG-ZH: a linguistic realisation engine for Mandarin. In Proceedings of the 11th International...
- Dong, C., Y. Li, H. Gong, M. Chen, J. Li, Y. Shen, and M. Yang. 2023. A survey of natural language generation. ACM Computing Surveys, 55:1–38,...
- Fu, J., S.-K. Ng, Z. Jiang, and P. Liu. 2023. GPTScore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Gatt, A. and E. Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal...
- Gatt, A. and E. Reiter. 2009. SimpleNLG: A realisation engine for practical applications. In Proceedings of the 12th European workshop on...
- Guo, K. 2022. Testing and validating the cosine similarity measure for textual analysis. Available at SSRN 4258463.
- Han, J., M. Kamber, and J. Pei. 2012. 2 - getting to know your data. In Data Mining (Third Edition), The Morgan Kaufmann Series in Data Management...
- He, X., Y. Gong, A.-L. Jin, W. Qi, H. Zhang, J. Jiao, B. Zhou, B. Cheng, S. Yiu, and N. Duan. 2022. Metric-guided distillation: Distilling...
- Hovy, E. 1987. Generating natural language under pragmatic constraints. Journal of Pragmatics, 11(6):689–719.
- Ji, Z., N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. 2023. Survey of hallucination in natural language...
- Khapra, M. M. and A. B. Sai. 2021. A tutorial on evaluation metrics used in natural language generation. NAACL-HLT 2021 - 2021 Conference...
- Kincaid, J. P., R. P. Fishburne Jr, R. L. Rogers, and B. S. Chissom. 1975. Derivation of new readability formulas (automated readability index,...
- Koller, A. and M. Stone. 2007. Sentence generation as a planning problem. In Proceedings of the 45th Annual Meeting of the Association of...
- Kusner, M., Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In International conference on machine...
- Lemon, O. 2011. Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation....
- Levelt, W. 1989. Speaking: From intention to articulation MIT press. Cambridge, MA.
- Lin, B. Y., W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren. 2020. CommonGen: A constrained text generation challenge for generative...
- Lin, C.-Y. 2004. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Lo, C.-k., A. K. Tumuluru, and D. Wu. 2012. Fully automatic semantic MT evaluation. In Proceedings of the Seventh Workshop on Statistical...
- Mann, W. C. and J. A. Moore. 1981. Computer generation of multiparagraph English text. American Journal of Computational Linguistics, 7(1):17–29.
- McDonald, D. D. 2010. Natural language generation. Handbook of natural language processing, 2:121–144.
- Mirza, M., B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, I. J. Goodfellow, and J. Pouget-Abadie. 2014. Generative adversarial...
- Nakatsu, C. and M. White. 2010. Generating with discourse combinatory categorial grammar. Linguistic Issues in Language Technology, 4.
- Nirenburg, S., V. R. Lesser, and E. Nyberg. 1989. Controlling a language generation planner. In IJCAI, pages 1524–1530.
- OpenAI. 2023. GPT-4 technical report.
- Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the...
- Popovic, M. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine...
- Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog,...
- Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2020. Exploring the limits of transfer learning...
- Ramos-Soto, A., J. Janeiro-Gallardo, and A. Bugarín. 2017. Adapting SimpleNLG to spanish. pages 144–148. Association for Computational Linguistics.
- Reiter, E. 1994. Has a consensus NL generation architecture appeared, and is it psycholinguistically plausible? In Proceedings of the Seventh...
- Rieser, V. and O. Lemon. 2009. Natural language generation as planning under uncertainty for spoken dialogue systems. Empirical Methods in...
- Roos, Q. 2022. Fine-tuning pre-trained language models for CEFR-level and keyword conditioned text generation: A comparison between google’s...
- Sai, A. B., A. K. Mohankumar, and M. M. Khapra. 2022. A survey of evaluation metrics used for NLG systems. ACM Comput. Surv., 55(2), jan.
- Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2008. The graph neural network model. IEEE transactions on neural...
- Stanchev, P., W. Wang, and H. Ney. 2019. EED: Extended edit distance measure for machine translation. In Proceedings of the Fourth Conference...
- Sukhbaatar, S., J. Weston, R. Fergus, et al. 2015. End-to-end memory networks. Advances in neural information processing systems, 28.
- Sutskever, I., O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing...
- Tang, T., H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. Zhao, and F. Wei. 2023. Not all metrics are guilty: Improving NLG evaluation with...
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances...
- Vedantam, R., C. Lawrence Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE conference...
- Wang, H., Y. Liu, C. Zhu, L. Shou, M. Gong, Y. Xu, and M. Zeng. 2021. Retrieval enhanced model for commonsense generation. In Findings of...
- Wang, J., Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou. 2023. Is ChatGPT a good NLG evaluator? a preliminary study....
- Yu, W., C. Zhu, L. Qin, Z. Zhang, T. Zhao, and M. Jiang. 2022. Diversifying content generation for commonsense reasoning with mixture of knowledge...
- Yuan, W., G. Neubig, and P. Liu. 2021. BARTScore: Evaluating generated text as text generation. Advances in Neural Information Processing...
- Zhang, H., S. Si, H. Wu, and D. Song. 2023. Controllable text generation with residual memory transformer. arXiv preprint arXiv:2309.16231.
- Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. 2020. BERTScore: Evaluating text generation with BERT. In International Conference...
- Zhang, Y. and X. Wan. 2024. Situated-Gen: Incorporating geographical and temporal contexts into generative commonsense reasoning. Advances...
- Zhu, W. and S. Bhat. 2020. GRUEN for evaluating linguistic quality of generated text. In Findings of the Association for Computational Linguistics:...