NoticIA: A Clickbait Article Summarization Dataset in Spanish

Iker García Ferrero; Begoña Altuna

Ayuda

NoticIA: A Clickbait Article Summarization Dataset in Spanish

Autores: Iker García Ferrero, Begoña Altuna
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 73, 2024, págs. 191-207
Idioma: inglés
Títulos paralelos:
- NoticIA: Un Dataset para el Resumen de Artículos Clickbait en Español
Enlaces
- Texto completo

Dialnet Métricas: 3 Citas

Resumen
- español
  Presentamos NoticIA, un conjunto de datos que consta de 850 artículos de noticias en español con titulares clickbait, cada uno emparejado con resúmenes generativos de alta calidad de una sola frase escritos por humanos. Esta tarea exige habilidades avanzadas de comprensión y resumen de texto, desafiando la capacidad de los modelos para inferir y conectar diversas piezas de información para satisfacer la curiosidad informativa del usuario generada por el titular clickbait. Evaluamos las capacidades de comprensión de texto en español de una amplia gama de modelos de lenguaje grandes de ´ultima generación. Además, utilizamos el conjunto de datos para entrenar ClickbaitFighter, un modelo que logra un rendimiento casi humano en esta tarea.
- English
  We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative summarizations written by humans. This task demands advanced text understanding and summarization abilities, challenging the models’ capacity to infer and connect diverse pieces of information to meet the user’s informational needs generated by the clickbait headline. We evaluate the Spanish text comprehension capabilities of a wide range of state-of-the-art large language models. Additionally, we use the dataset to train ClickbaitFighter, a task-specific model that achieves near-human performance in this task.
Referencias bibliográficas
- 01.AI, A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, K. Yu, P. Liu, Q. Liu, S. Yue, S. Yang, S....
- AI@Meta. 2024. Llama 3 model card.
- Bai, J., S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu,...
- Bi, X., D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, H. Gao, K. Gao, W. Gao, R. Ge, K. Guan, D. Guo, J. Guo,...
- Fröbe, M., B. Stein, T. Gollub, M. Hagen, and M. Potthast. 2023. SemEval-2023 Task 5: Clickbait Spoiling. In A. K. Ojha, A. S. Dogruöz, G....
- Gemma-Team, T. Mesnard, C. Hardin, R. Dadashi, [et al.]. 2024. Gemma: Open Models Based on Gemini Research and Technology.
- Heiervang, M. 2022. Abstractive title answering for clickbait content. Master’s thesis, University of Oslo.
- Intan Maharani, N. P., A. Purwarianti, and A. F. Aji. 2023. Low-Resource Clickbait Spoiling for Indonesian via Question Answering. In 2023...
- Ivison, H., Y.Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, and H. Hajishirzi. 2023. Camels...
- Jiang, A. Q., A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L....
- Jiang, A. Q., A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel,...
- Kim, D., C. Park, S. Kim, W. Lee, W. Song, Y. Kim, H. Kim, Y. Kim, H. Lee, J. Kim, C. Ahn, S. Yang, S. Lee, H. Park, G. Gim, M. Cha, H. Lee,...
- Kurenkov, A., T. Mentor, Y. Zhang, and O. C. Johnson. 2022. Saved You A Click: Automatically Answering Clickbait Titles. ArXiv, abs/2212.08196.
- Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain,...
- Liu, T., K. Yu, L. Wang, X. Zhang, and X. Wu. 2021. WCD: A New Chinese Online Social Media Dataset for Clickbait Analysis and Detection. In...
- Min, S., K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. 2023. FActScore: Finegrained atomic...
- OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774.
- OpenAI. 2024a. gpt-3.5-turbo-0125.
- OpenAI. 2024b. Hello GPT-4o.
- Pal, A., D. Karkhanis, S. Dooley, M. Roberts, S. Naidu, and C. White. 2024. Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive.
- pansophic. 2023. A 3B parameter GPTlike model fine-tuned on a mix of publicly available datasets using DPO.
- Potthast, M., T. Gollub, K. Komlossy, S. Schuster, M. Wiegmann, E. P. Garces Fernandez, M. Hagen, and B. Stein. 2018. Crowdsourcing a Large...
- Pujahari, A. and D. S. Sisodia. 2021. Clickbait detection using multiple categorisation techniques. Journal of Information Science, 47(1):118–128.
- Ren, J., S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. 2021. ZeRO-Offload: Democratizing Billion-Scale...
- Sepúlveda-Torres, R., A. Bonet-Jover, and E. Saquete. 2023. Detecting Misleading Headlines Through the Automatic Recognition of Contradiction...
- stability.ai. 2023. Introducing Stable LM Zephyr 3B: A New Addition to Stable LM, Bringing Powerful LLM Assistants to Edge Devices.
- Teknium. 2023. OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants.
- Touvron, H., L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,...
- Tunstall, L., E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero,...
- Wang, G., S. Cheng, X. Zhan, X. Li, S. Song, and Y. Liu. 2023. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. CoRR,...
- Wang, H.-C., M. Maslim, and H.-Y. Liu. 2023. CA-CD: context-aware clickbait detection using new Chinese clickbait dataset with transfer learning...
- Zhang, P., G. Zeng, T. Wang, and W. Lu. 2024. TinyLlama: An Open-Source Small Language Model. CoRR, abs/2401.02385.
- Zheng, J., K. Yu, and X. Wu. 2021. A deep model based on Lure and Similarity for Adaptive Clickbait Detection. Knowledge-Based Systems, 214:106714.