Automatic and Manual Evaluation of a Spanish Suicide Information Chatbot

Pablo Ascorbe Fernández; María Soledad Campos Burgui; César Domínguez; Jónathan Heras Vicente; Magdalena Pérez; Ana Rosa Terroba Reinares

Ayuda

Automatic and Manual Evaluation of a Spanish Suicide Information Chatbot

Autores: Pablo Ascorbe Fernández, María Soledad Campos Burgui, César Domínguez , Jónathan Heras Vicente , Magdalena Pérez, Ana Rosa Terroba Reinares
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 73, 2024, págs. 151-164
Idioma: inglés
Títulos paralelos:
- Evaluación automática y manual de un chatbot para proporcionar información sobre suicidio en castellano
Enlaces
- Texto completo

Dialnet Métricas: 1 Cita

Resumen
- español
  Los chatbots tienen un gran potencial en campos delicados como la salud mental, pero para asegurar su correcto funcionamiento es necesaria una evaluación cuidadosa, ya sea por métodos manuales o por métodos automáticos. En este trabajo se presenta una librería para evaluar automáticamente chatbots en castellano de Generación Mejorada por Recuperación (en ingles Retrieval Augmented Generation o RAG) utilizando grandes modelos de lenguaje (en inglés, LLMs). A continuación, se realiza una evaluación exhaustiva de varios modelos candidatos a ser utilizados en un sistema RAG para proporcionar información sobre la prevención del suicidio, utilizando una evaluación manual, una automática basada en métricas y una automática basada en LLMs. Todos los métodos coinciden al escoger el mejor modelo, pero presentan sutiles diferencias. Los métodos automáticos basados en métricas se correlacionan en precisión y exhaustividad con la evaluación humana, pero no en fidelidad; y algunos métodos automáticos basados en LLMs no detectan algunos errores, como respuestas no relacionadas con la pregunta; o pueden pasar por alto respuestas inseguras. Como conclusión, podemos decir que los métodos automáticos pueden reducir el esfuerzo de evaluación manual, no obstante, ´esta sigue siendo esencial, sobre todo en contextos sensibles como los relacionados con la salud mental.
- English
  Chatbots have a great potential in sensitive fields like mental health; however, a careful evaluation, either by manual or automatic methods is a must to ensure the reliability of these systems. In this work, a library for automatically evaluating Spanish Retrieval Augmented Generation (RAG) chatbots using Large Language Models (LLMs) is presented. Then, a thorough analysis of several LLMs candidates to be used in a RAG system which provides suicide prevention information is conducted. Towards that aim, we use a manual evaluation, an automatic evaluation based on metrics, and an automatic evaluation based on LLMs. All evaluation methods agree on a preferred model, but they exhibit subtle differences. Automatic methods may overlook unsafe answers; the automatic methods based on metrics are correlated on precision and completeness with human evaluation but not on faithfulness; and some automatic methods based on LLMs do not detect some errors. As a general conclusion, even if automatic methods can reduce manual evaluation efforts, manual evaluation remains essential, particularly in sensitive contexts like those related to mental health.
Referencias bibliográficas
- Abd-Alrazaq, A. A., M. Alajlani, N. Ali, K. Denecke, B. M. Bewick, and M. Househ. 2021. Perceptions and opinions of patients about mental...
- Bertin Project. 2023. Bertin-gpt-j-6b alpaca.
- Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC...
- Chan, J. X., S.-L. Chua, and L. K. Foo. 2022. A two-stage classification chatbot for suicidal ideation detection. In International Conference...
- Chiang, C.-H. and H.-y. Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Clibrain. 2023. Lince mistral 7b instruct.
- Elsayed, N., Z. ElSayed, and M. Ozer. 2024. Cautionsuicide: A deep learning based approach for detecting suicidal ideation in real time chatbot...
- Es, S., J. James, L. Espinosa-Anke, and S. Schockaert. 2023. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint...
- Field, A. 2024. Discovering statistics using IBM SPSS Statistics. SAGE Publications Limited.
- Fu, J., S.-K. Ng, Z. Jiang, and P. Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Gao, M., J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan. 2023. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554.
- Gobierno de Navarra. 2014. Prevención y actuación ante conductas suicidas.
- Haque, M. R. and S. Rubya. 2023. An overview of chatbot-based mobile mental health apps: insights from app description and user reviews. JMIR...
- Instituto Nacional de Estadística. 2023. Defunciones según la causa de muerte año 2022. Technical report.
- Ji, S., S. Pan, X. Li, E. Cambria, G. Long, and Z. Huang. 2020. Suicidal ideation detection: A review of machine learning methods and applications....
- Ji, S., C. P. Yu, S.-f. Fung, S. Pan, and G. Long. 2018. Supervised learning for suicidal ideation detection in online user content. Complexity,...
- Jiang, A. Q., A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al....
- Khawaja, Z. and J.-C. Bélisle-Pipon. 2023. Your robot therapist is not your therapist: understanding the role of ai-powered mental health...
- Liu, Y., D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu. 2023. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint...
- Rioja Salud. 2019. Plan de prevención del suicidio en La Rioja.
- Romero, M., C. Casadevante, and H. Montoro. 2020. Cómo construir un psicólogo-chatbot. Papeles del Psicólogo, 41(1):27–34.
- Savage, N. 2023. The rise of the chatbots. Communications of the ACM, 66(7):16–17.
- Schober, P., C. Boer, and L. A. Schwarte. 2018. Correlation coefficients: appropriate use and interpretation. Anesthesia & analgesia,...
- Servicio Canario de Salud. 2021. Programa de prevención de la conducta suicida en Canarias.
- Sim, J. and C. C. Wright. 2005. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy,...
- Sufrate-Sorzano, T., E. Jiménez-Ramón, M. E. Garrote-Cámara, V. Gea-Caballero, A. Durante, R. Juárez-Vela, and I. Santolalla-Arnedo. 2022....
- Sweeney, C., C. Potts, E. Ennis, R. Bond, M. D. Mulvenna, S. O’neill, M. Malcolm, L. Kuosmanen, C. Kostenius, A. Vakaloudis, et al. 2021....
- Taori, R., I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. 2023. Stanford alpaca: An instruction-following...
- Vaidyam, A. N., H. Wisniewski, J. D. Halamka, M. S. Kashavan, and J. B. Torous. 2019. Chatbots and conversational agents in mental health:...
- Valizadeh, M. and N. Parde. 2022. The AI doctor is in: A survey of task-oriented dialogue systems for healthcare applications. In Proceedings...
- Wang, Y., W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu. 2023. Aligning large language models with human: A survey....
- WHO. 2021. Suicide worldwide in 2019: global health estimates.
- Wu, M., A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. Aji. 2023. Lamini-lm: A diverse herd of distilled models from large-scale instructions....
- Xue, J., B. Zhang, Y. Zhao, Q. Zhang, C. Zheng, J. Jiang, H. Li, N. Liu, Z. Li, W. Fu, et al. 2023. Evaluation of the current state of chatbots...
- Zhang, T., A. M. Schoene, S. Ji, and S. Ananiadou. 2022. Natural language processing applied to mental illness detection: a narrative review....
- Zheng, L., W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. 2024. Judging llm-as-a-judge with mt-bench...
- Zhuo, T. Y. 2023. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317.