Analysis of the Precision of Large Language Models in the Identification of Security Vulnerabilities and Weaknesses in Generated Code

Federico Muñoz Babiano; Paula Lamo Anuarbe; Ricardo S. Alonso Rincón

Ayuda

Analysis of the Precision of Large Language Models in the Identification of Security Vulnerabilities and Weaknesses in Generated Code

Muñoz-Babiano, Federico ^[1] ; Lamo, Paula ^[1] ; Alonso, Ricardo S. ^[1]
1. [1] Universidad de La Rioja
  
  Universidad de La Rioja
  
  Logroño, España
Localización: ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, ISSN-e 2255-2863, Vol. 14, Nº. 1, 2025
Idioma: inglés
DOI: 10.14201/adcaij.32926
Enlaces
- Texto completo
Resumen
- Large Language Models (LLMs) have accelerated code generation, yet their security implications remain underexplored. This work evaluates the capability of eight general-purpose and code-specialized LLMs to detect weaknesses and vulnerabilities in generated code using CVE and CWE references. Stratified sampling of the CVEfixes dataset across five programming languages is assessed using precision, recall, F1-score, and relevance-oriented metrics (REINP, REINR, and REINF1). Results show limited detection of specific vulnerabilities (CVE), but better performance for general weaknesses (CWE), especially in Python and Ruby. We discuss limitations in model training and prompt conditioning, and outline improvements through dataset diversification, prompt engineering and hybrid human-in-the-loop approaches. The study highlights the current potential and limitations of LLMs for practical vulnerability detection in generated code.
Referencias bibliográficas
- Ahmad, B., Thakur, S., Tan, B., Karri, R., & Pearce, H. (2024). On hardware security bug code fixes by prompting large language models....
- Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). GQA: Training generalized multi-query transformers...
- Albanese, M., Adebiyi, O., & Onovae, F. (2024). CVE2CWE: Automated mapping of software vulnerabilities. In Proceedings of the International...
- Alian, A. A., Sobhy, B., Nasser, M., & Hani, L. (2023). Backslash map: An automated vulnerability scanner. In Proceedings of the International...
- Allamanis, M., Jackson-Flux, H., & Brockschmidt, M. (2021). Self-supervised bug detection and repair. In Proceedings of the Neural Information...
- Alon, U., Brody, S., Levy, O., & Yahav, E. (2018). code2seq: Sequences from structured code representations. arXiv. https://arxiv.org/abs/1808.01400
- Alon, U., Zilberstein, M., Levy, O., & Yahav, E. (2019). code2vec: Learning distributed representations of code. In Proceedings of the...
- Backslash Security. (2025, abril). Can AI Vibe Coding Be Trusted? https://www.backslash.security/blog/can-ai-vibe-coding-be-trusted
- Bhandari, G., Naseer, A., & Moonen, L. (2021). CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software....
- Bo, Y., Zhang, N., Li, S. P., & Xia, X. (2020). Survey of intelligent code completion. Journal of Software, 31(5), 1435-1453.
- Cassano, F., Gouwar, J., Nguyen, D. P., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, C. J., Feldman, M. Q.,...
- Chakraborty, S., Krishna, R., Ding, Y., & Ray, B. (2021). Deep learning-based vulnerability detection. IEEE Transactions on Software Engineering,...
- Chen, J., Huang, H., Lyu, Y., An, J., Shi, J., Yang, C., Zhang, T., Tian, H., Li, Y., Li, Z., Zhou, X., Hu, X., & Lo, D. (2025). SecureAgentBench:...
- Chen, X., Liu, C., & Song, D. (2018). Tree-to-tree neural networks for program translation. In Proceedings of the Neural Information Processing...
- Chen, Y., Ding, Z., Alowain, L., Chen, X., Deepmind, G., & Wagner, D. (2023). DiverseVUL: A vulnerable source code dataset. In Proceedings...
- Choi, Y.-D., Na, C. W., Kim, H., & Lee, J.-H. (2023). ReadSum: Retrieval-augmented transformer. IEEE Access, 11, 51155-51165. https://doi.org/10.1109/ACCESS.2023.3271992
- Croft, R., Babar, M. A., & Kholoosi, M. M. (2023). Data quality for software vulnerability datasets. In Proceedings of the 45th International...
- Dai, S.-C., Xu, J., & Tao, G. (2025). Rethinking the evaluation of secure code generation. arXiv. https://arxiv.org/abs/2503.15554
- Du, X., Liu, M., Wang, K., Wang, H., Liu, J., Chen, Y., Feng, J., Sha, C., & Peng, X. (2024). Evaluating LLMs in class-level code generation....
- Fan, J., Li, Y., Wang, S., & Nguyen, T. N. (2020). AC/C++ code vulnerability dataset. In Proceedings of the 17th International...
- Feng, Y., Wang, F., Wong, K. K., Wang, S., Yu-hong, L., Zhu, M., Wang, B., & Chen, W. (2023). PromptMagician: Interactive prompt engineering...
- Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W., Zettlemoyer, L., & Lewis, M. (2023). InCoder:...
- Ganti, M., Orr, L., & Wu, S. (2024). Evaluating text-to-SQL model failures on Real-World data. In Proceedings of the IEEE 38th International...
- Gokcimen, T., & Das, B. (2025). A novel system for strengthening security in LLMs. Alexandria Engineering Journal, 123, 71-90. https://doi.org/10.1016/j.aej.2025.03.030
- Gu, X., Zhang, H., & Kim, S. (2018). Deep code search. In Proceedings of the 40th International Conference on Software Engineering (pp....
- Guo, L. (2022). Using metacognitive prompts to enhance self-regulated learning. Journal of Computer Assisted Learning, 38(3), 811-832. https://doi.org/10.1111/jcal.12650
- Hindle, A., Barr, E. T., Gabel, M., Su, Z., & Devanbu, P. (2016). On the naturalness of software. Communications of the ACM, 59(5), 122-131....
- Hliš, T., Četina, L., Beranič, T., & Pavlič, L. (2023). Evaluating usability of intelligent code assistants. Applied Sciences, 13(24),...
- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A.,...
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
- Jaoua, I., Ben, S. O., & Sahraoui, H. (2025). Combining large language models with static analyzers for code review generation. arXiv....
- Joshi, H., Sanchez, J. C., Gulwani, S., Le, V., Verbruggen, G., & Radiček, I. (2023). Repair is nearly generation: Multilingual program...
- Kalia, A. K., Xiao, J., Krishna, R., Sinha, S., Vuković, M., & Banerjee, D. (2021). Mono2Micro: a practical and effective tool for decomposing...
- Kulal, S., Pasupat, P., Chandra, K., Lee, M., Padon, O., Aiken, A., & Liang, P. S. (2019). SPOC: Search-based pseudocode-to-code. In Proceedings...
- Kumar, A., Jindal, K., Sharma, H., & Chaudhary, A. (2024). Enhancing syndicate lending through AI-Powered Fine-Tuning: Leveraging LLAMA2...
- Lajkó, M., Csuvik, V., & Vidács, L. (2022). Towards JavaScript program repair with generative pre-trained transformer (GPT-2). In Proceedings...
- Latibari, B. S., Nazari, N., Alam Chowdhury, M., Immanuel Gubbi, K., Fang, C., Ghimire, S., Hosseini, E., Sayadi, H., Homayoun, H., Salehi,...
- Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. C. H. (2022). CodeRL: Mastering code generation through pretrained models and...
- LeClair, A., Jiang, S., & McMillan, C. (2019). Generating summaries of code. In Proceedings of the International Conference on Software...
- Li, Y., Shi, J., & Zhang, Z (2024). An approach for rapid source code development based on ChatGPT and prompt engineering. IEEE Access,...
- Li, Y., Wang, S., Nguyen, T. N., & Van Nguyen, S. (2019). Improving bug detection via context-based representation learning. Proceedings...
- Lin, R., Fu, Y., Yi, W., Yang, J., Cao, J., Dong, Z., Xie, F., & Li, H. (2024). Vulnerabilities and security patch detection in OSS: A...
- Lin, X. V., Wang, C., Zettlemoyer, L., & Ernst, M. D. (2018). NL2Bash: A corpus and semantic parser for natural language interface to...
- Liu, F., Li, J., & Zhang, L. (2023). Syntax and domain aware model for unsupervised program translation. arXiv. https://arxiv.org/abs/2302.03908
- Liu, S., Gao, C., Chen, S., Nie, L. Y., & Liu, Y. (2020). ATOM: Commit message generation based on abstract syntax tree and hybrid ranking....
- Nikitopoulos, G., Dritsa, K., Louridas, P., & Mitropoulos, D. (2021). CrossVul dataset. In Proceedings of the 29th ACM Joint Meeting on...
- Nitin, V., Asthana, S., Ray, B., & Krishna, R. (2022). CARGO: AI-Guided dependency analysis for migrating monolithic applications to microservices...
- Obreja, D. M., & Rughiniș, R. (2023). The moral status of artificial intelligence: Exploring users’ anticipatory ethics in the controversy...
- Omar, M., Sorin, V., Collins, J. D., Reich, D., Freeman, R., Gavin, N., Charney, A., Stump, L., Bragazzi, N. L., Nadkarni, G. N., & Klang,...
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton,...
- Park, D., An, G. T., Kamyod, C., & Kim, C. G. (2023). A study on performance improvement of prompt engineering for generative AI with...
- Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., & Gadepally, V. (2023). From...
- SecureIT Project. (2021). CVEfixes Dataset. GitHub. https://github.com/secureIT-project/CVEfixes
- Sv, S., Sunil, S., AS, P. A., & Satish, G. (2024). Democratizing data science using LLMs. In Proceedings of the 4th International Conference...
- Teja, N. S., Kumar, K., & Malarvel, M. (2024). Multilingual text enhancer with Llama2. In Proceedings of the 3rd International Conference...
- Tóth, R., Bisztray, T., & Erdődi, L. (2024). LLMs in web development: Evaluating LLM-Generated PHP code unveiling vulnerabilities and...
- Vasiliniuc, M.-S., & Groza, A. (2023). AI-assisted mobile code generation. arXiv. https://arxiv.org/abs/2308.04736
- Verbert, K., Manouselis, N., Ochoa, X., Wolpers, M., Drachsler, H., Bosnic, I., & Duval, E. (2012). Context-aware recommender systems...
- Wang, W., Li, G., Ma, B., Xia, X., & Jin, Z. (2020). Detecting code clones with GNNs. In Proceedings of the 27th International Conference...
- Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., & Hoi, S. C. H. (2023). CodeT5+: A large code language model. arXiv. https://arxiv.org/abs/2305.07922
- Wang, Y., Wang, W., Joty, S., & Hoi, S. C. H. (2021). CodeT5: Identifier-aware pretrained models. arXiv. https://arxiv.org/abs/2109.00859
- Xu, S., Yao, Y., Feng, X., Gu, T., Tong, H., & Jian L. (2019). Commit Message Generation for Source Code Changes. https://doi.org/10.24963/ijcai.2019/552
- Xu, X., Su, Z., Guo, J., Zhang, K., Wang, Z., & Zhang, X. (2024). ProSec: Security alignment for code LLMs. arXiv. https://arxiv.org/abs/2411.12882
- Yao, D., Zhang, J., Harris, I. G., & Carlsson, M. (2024). FuzzLLM: A novel and universal fuzzing framework for proactively discovering...
- Yin, P. (2021). Learning Structured Neural Semantic Parsers [Tesis doctoral, Carnegie Mellon University].
- Yin, P., & Neubig, G. (2018). TranX: Neural abstract syntax parser. arXiv. https://arxiv.org/abs/1810.02720
- Yu, L., Zhang, J., Wang, X., Ma, J., Yang, L., & Zhang, F. (2025). Towards secure and explainable smart contract generation with Security-Aware...
- Yu, T., Zhang, R., Er, H. Y., Li, S., Xue, E., Pang, B., Lin, X. V., Tan, Y. C., Shi, T., Li, Z., Jiang, Y., Yasunaga, M., Shim, S., Chen,...
- Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., & Liu, X. (2019). A novel neural source code representation based on abstract syntax...
- Zhang, Y., Qiu, Z., Stol, K. J., Zhu, W., Zhu, J., Tian, Y., & Liu, H. (2024). Automatic commit message generation. IEEE Transactions...
- Zheng, Q., Xiao, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Shen, L., Wang, Z., Wang, A., Li, Y., Su, T., Yang, Z., & Tang, J. (2023)....
- Zhou, Y., Liu, S., Siow, J., Du, X., & Liu, Y. (2019). Devign: Vulnerability identification via GNNs. In Proceedings of the Neural Information...