Logroño, España
Large Language Models (LLMs) have accelerated code generation, yet their security implications remain underexplored. This work evaluates the capability of eight general-purpose and code-specialized LLMs to detect weaknesses and vulnerabilities in generated code using CVE and CWE references. Stratified sampling of the CVEfixes dataset across five programming languages is assessed using precision, recall, F1-score, and relevance-oriented metrics (REINP, REINR, and REINF1). Results show limited detection of specific vulnerabilities (CVE), but better performance for general weaknesses (CWE), especially in Python and Ruby. We discuss limitations in model training and prompt conditioning, and outline improvements through dataset diversification, prompt engineering and hybrid human-in-the-loop approaches. The study highlights the current potential and limitations of LLMs for practical vulnerability detection in generated code.
© 2008-2026 Fundación Dialnet · Todos los derechos reservados