Structural Prompt Variants and Their Impact on LLM-Based Vulnerability Detection Accuracy
pdf

Keywords

prompt structure
vulnerability detection
prompt ambiguity
LLM reasoning
CWE classification

Abstract

The structure of prompts used in vulnerability detection tasks significantly affects the reasoning process of large language models. This study systematically evaluates how formatting,decomposition, and semantic density influence the accuracy of detecting five major vulnerability categories (CWE-20, CWE-79, CWE-89, CWE-120, CWE-798). We construct a benchmark of 12,000 prompt–code pairs and test three LLMs: GPT-4, Claude-3, and Llama-3-70B. Results show that structured prompts with explicit reasoning stages improve true-positive rates by 24–41%, while overly verbose prompts increase hallucination-induced false alarms by 15%. Prompt–model mismatch analysis reveals that models differ in tolerance to prompt ambiguity. These findings highlight the necessity of prompt normalization for trustworthy security auditing using LLMs.

pdf
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2026 Mariana Torres, Javier Martínez, Alejandro Ruiz (Author)