LLM for Untargeted Adversarial Attack Against Language Models in Spanish

Adrián Moreno Muñoz; Luis Alfonso Ureña López; Eugenio Martínez Cámara

Ayuda

LLM for Untargeted Adversarial Attack Against Language Models in Spanish

Autores: Adrián Moreno Muñoz, Luis Alfonso Ureña López , Eugenio Martínez Cámara
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 75, 2025 (Ejemplar dedicado a: Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025), págs. 317-336
Idioma: inglés
Títulos paralelos:
- Ataque de Adversario sin Objetivo Específico Basado en LLM Contra Modelos de Lenguaje en Español
Enlaces
- Texto completo
Resumen
- español
  Los modelos de lenguaje presentan vulnerabilidades de seguridad inherentes donde incluso modificaciones sutiles en las entradas pueden manipular sus salidas, estas debilidades representan una preocupación significativa. Esta investigación explora ataques adversarios sin objetivo específico contra modelos de lenguaje en español utilizando un enfoque de dos etapas: identificar palabras influyentes en el proceso de toma de decisiones y reemplazarlas con sinónimos apropiados. Las pruebas realizadas en diversos conjuntos de datos contra modelos preentrenados revelan que los modelos generativos, guiados por palabras relevantes seleccionadas mediante XAI, pueden alterar significativamente las predicciones de estos modelos de lenguaje.
- English
  Language models face inherent security vulnerabilities where even subtle input modifications can manipulate their outputs, these weaknesses represent a significant concern. This research explores untargeted adversarial attacks against Spanish language models using a two-stage approach: identifying influential words in the decision-making process and replacing them with appropriate synonyms. The evaluation of the attack against pre-trained Spanish language models reveals that generative models, guided by XAI-selected salient words, can significantly alter their predictions.
Referencias bibliográficas
- Ca˜nete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC...
- Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised...
- Cruz, F. L., J. A. Troyano, F. Enriquez, and J. Ortega. 2008. Clasificación de documentos basada en la opinión: experimentos con un corpus...
- da Silva, F. A. 2025. Navigating the dual-edged sword of generative AI in cybersecurity. Brazilian Journal of Development.
- Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Eliasziw, M. and A. Donner. 1991. Application of the McNemar test to non-independent matched pair data. Statistics in medicine, 10(12):1981–1991.
- Fandiño, A. G., J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, and M. Villegas....
- Gerardo Huerta, G. Z. 2024. Dataset for BERTIN-ClimID: BERTIN-Base Climate-related text Identification.
- Gonzalez-Agirre, A., M. Pàmies, J. Llop, I. Baucells, S. D. Dalt, D. Tamayo, J. J. Saiz, F. Espuña, J. Prats, J. Aula-Blasco, M. Mina, A....
- Goyal, S., S. Doddapaneni, M. M. Khapra, and B. Ravindran. 2023. A Survey of Adversarial Defenses and Robustness in NLP. ACM Comput. Surv.,...
- Grattafiori, A., A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. 2024. The Llama...
- Irfan, M. M., S. Ali, I. Yaqoob, and N. Zafar. 2021. Towards Deep Learning: A Review On Adversarial Attacks. In 2021 International Conference...
- Ji, J., B. Hou, A. Robey, G. J. Pappas, H. Hassani, Y. Zhang, E. Wong, and S. Chang. 2024. Defending Large Language Models against Jailbreak...
- Jia, J., Y. Liu, and N. Z. Gong. 2022. BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning. In 2022 IEEE Symposium...
- Kokhlikyan, N., V. Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, et al. 2020. Captum:...
- Liu, X., N. Xu, M. Chen, and C. Xiao. 2023. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. arXiv preprint...
- Lundberg, S. M. and S.-I. Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems,...
- Perez, E., S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. 2022. Red Teaming Language Models with...
- Pontes, M. F., R. C. Pedrosa, P. H. Lopes, and E. J. S. Luz. 2024. Evaluating Federated Learning with Homomorphic Encryption for Medical Named...
- Ribeiro, M. T., S. Singh, and C. Guestrin. 2016. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of...
- Rodríguez-Barroso, N., D. Jiménez-López, M. V. Luzón, F. Herrera, and E. Martínez-Cámara. 2023. Survey on Federated Learning Threats: concepts,...
- Team, G., A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al. 2025. Gemma 3...
- Team, G., M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et al. 2024. Gemma...
- Team, Q. 2024. Qwen2.5: A Party of Foundation Models, September.
- Wu, Z., L. Tian, Y. Zhang, Y. Wang, and Y. Du. 2021. Network Attack and Defense Modeling and System Security Analysis: A Novel Approach Using...
- Xuanfan, N. and L. Piji. 2023. A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks. In J. Zhang, editor,...
- Yang, A., B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J....
- Yao, Y., J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang. 2024. A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad,...
- Zou, A., Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language...
- Zou, W., R. Geng, B.Wang, and J. Jia. 2024. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language...
- Zuñiga, G. 2024. Spam Detection Messages Dataset.