Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

Source: arXiv:2605.23243 · Published 2026-05-22 · By Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari, Chandra Khatri

TL;DR

This paper rigorously evaluates the readiness of leading general-purpose large language models (LLMs) for cybersecurity tasks using a dual-mode benchmark comprising white-box function-level vulnerability detection and black-box web application security testing across realistic, production-style targets. Six frontier LLMs (including GPT-5.4, Codex 5.3, and Anthropic Claude Opus 4.6) and two domain-specialized vertical models are tested under four testing paradigms ranging from direct prompting to deterministic-hybrid architectures. The authors find that frontier models exhibit high false positive rates (10-50%) in white-box detection and low ground-truth recall in black-box testing (4-8%), only marginally improved with external tools. Encoding structured penetration-testing methodology into domain-specialized agents yields over 50% per-family detection, highlighting that methodology and training data quality, rather than raw model scale, are the primary levers for cybersecurity performance. A vertical Defense-LLM specialized for blue-team use achieves the best precision (0.904) and lowest false positive rate (9.7%) on a single GPU, demonstrating clear gains from domain specialization. The authors identify fundamental training data bottlenecks—the absence of end-to-end security testing traces, failure sequences, and multi-step attack chains—and propose self-play security testing as a data generation strategy to overcome them. Overall, the study makes a compelling case for vertical foundation models purpose-built for cybersecurity workloads and professional threat hunting.

Key findings

Frontier models produce 10-50% false positive rates in white-box vulnerability detection on VulnLLM-R benchmark, making them impractical for production triage (Table 2).
In black-box web app testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% when augmented with external tools like Playwright MCP and Burp Suite MCP (Table 4).
Methodology-guided agents encoding structured penetration-testing workflows raise per-family vulnerability detection to over 50%, a 4x improvement over tool-augmented frontier models (Table 4).
An Agentic Reasoning Graph (ARG) pipeline with deterministic confirmation reduces false positives to under 20% while maintaining 30.2% overall coverage across five benchmark applications (Table 4).
The domain-specialized Defense-LLM achieves highest overall F1 (0.873), precision (0.904), and lowest FPR (9.7%) on VulnLLM-R white-box benchmark, surpassing all frontier models including Codex 5.3 (F1=0.833) (Table 2).
Defense-LLM uses 18.2x fewer tokens than Sonnet 4.6 and achieves 2.3s mean latency on a single B200 GPU, 10x faster than Codex and 30x faster than Claude (Table 3).
Frontier LLMs systematically over-predict vulnerabilities, flagging 60-69% of samples as vulnerable regardless of ground truth, confirming confirmation bias documented by He et al. (2026).
Universal blind spots exist across all models, e.g. confusion between CWE-327 and CWE-798 (weak cryptography vs hardcoded credentials), indicating persistent failure modes requiring targeted data.

Threat model

The adversary is either a black-box tester with authorized user credentials attempting to discover and exploit vulnerabilities in web applications without source code access, or a white-box analyst with full source code access tasked with identifying vulnerable functions. The adversary cannot perform zero-day discovery on uncontrolled systems, network-layer, social engineering, or physical attacks. Testing is conducted in authorized, instrumented, controlled environments with ground-truth verification. The attacker aims to find exploitable security flaws within realistic authorization and business logic constraints.

Methodology — deep read

Threat Model & Assumptions: The authors consider two attacker models: a black-box adversary with valid user accounts attacking web applications without source code visibility, targeting vulnerabilities like IDOR, SQL injection, auth bypass, XSS across 20+ CWE families; and a white-box analyst with full source code access performing per-function binary vulnerable/benign classification. They exclude zero-day discovery in uncontrolled systems, network layer, social engineering, and physical attacks. Benchmarks run in controlled, deterministic environments.
Data: They design a dual-mode benchmark including five realistic Python-based JWT-authenticated web applications with 118 verified ground-truth vulnerabilities spanning 20+ CWE families, open-sourced for community use. White-box vulnerability detection is evaluated on the VulnLLM-R benchmark comprising C, Java, and Python functions drawn from UCSB-SURFI dataset. The benchmark provides severity labels mapped to CVSS v3.1 for consistency.
Architectures & Algorithms: Eight LLMs evaluated: six frontier general purpose models including GPT-5.4 (OpenAI), Codex 5.3, Claude Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro and Gemini 3 Flash (Google), plus two domain-specialized vertical models—Defense-LLM (blue team, high precision, conservative) and Attack-LLM (red team, offensive capability). The Defense and Attack models are LoRA-tuned variants of Qwen3-Next-80B-A3B-Instruct trained on generated attack and defense chains. Four black-box testing paradigms:

P1: Direct prompting using native LLM capabilities (HTTP, bash, file IO).
P2: Tool-augmented with external security tools (Playwright MCP, Burp Suite MCP).
P3: Methodology-guided agents encoding professional penetration-testing workflows with multi-session management, multi-signal confirmation, endpoint enumeration, using external tools.
P4: Agentic Reasoning Graph (ARG) pipeline combining LLM-driven reconnaissance and payload generation with deterministic logic for vulnerability confirmation, eliminating hallucinations and false positives.

Training Regime: Defense and Attack models trained with supervised LoRA fine-tuning over approximately 7,000 model-generated attack and defense chains, creating around 20,000 offensive and 20,000 defensive samples. Curriculum learning employed: 3 epochs on full dataset, plus 2 epochs on high-effectiveness samples (Section F details). Model hyperparameters and hardware specifics not fully detailed but run on single GPU setups.
Evaluation Protocol: White-box detection evaluated as binary vulnerable/benign classification per function on VulnLLM-R with exact CWE matching for true positives; temperature zero except Claude at 1 for reasoning mode. Metrics include precision, recall, F1, MCC, false positive rate, non-actionable rate, duplicate rate, and exploitability rate. Black-box testing results measured as ground-truth vulnerability coverage and false positive reliability across 5 benchmark apps. Multiple LLM backends evaluated with methodology-guided and ARG pipelines. Ablations isolate impact of external tools (P2 vs P1), methodology (P3 vs P2), and deterministic confirmation (P4 vs P3).
Reproducibility: The benchmark applications and black-box targets are planned for open-source release. Defense and Attack model training data are generated and described but code and weights availability is unclear. VulnLLM-R benchmark is public. Details sufficient for partial replication, but full pipeline depends on proprietary model variants and datasets.

Concrete example: For black-box testing on the TeamLedger app (21 vulnerabilities), P1 frontier models detected between 1-3 vulns with 4-8% coverage; P2 with external tools improved to 6 vulns detected; P3 methodology-guided attack model detected 15 vulns (>50% coverage) with multi-signal confirmation and session management; P4 ARG pipeline detected 12 vulns with deterministic confirmation reducing false positives drastically. This illustrates the incremental gains moving from raw LLM scale to structured methodology to architectural innovations controlling confirmation bias and hallucinations.

Technical innovations

A dual-mode vulnerability benchmark combining white-box function-level detection (VulnLLM-R) and large-scale black-box web application testing across five realistic production-style apps with 118 ground-truth vulnerabilities and 20+ CWE families.
Development of domain-specialized vertical foundation models (Defense-LLM and Attack-LLM) trained on synthetically generated attack and defense chains encoding professional security workflows.
Methodology-guided agent scaffolds encoding structured penetration-testing procedures with multi-session state, multi-signal confirmation, and family-specific workflows that raise detection coverage over 50%.
The Agentic Reasoning Graph (ARG) architecture that separates creative LLM-driven reconnaissance and payload generation from deterministic, programmatic confirmation nodes to eliminate hallucinated false positives.

Datasets

VulnLLM-R — function-level vulnerability detection benchmark over C, Java, Python — UCSB-SURFI open-source
Five production-style web applications — 118 ground-truth vulnerabilities across 20+ CWE families — designed and open-sourced by authors

Baselines vs proposed

Codex 5.3 white-box detection: F1 = 0.833 vs Defense-LLM proposed: F1 = 0.873
Sonnet 4.6 white-box detection: recall = 0.923, FPR = 43.1% vs Defense-LLM: recall = 0.846, FPR = 9.7%
P1 Direct prompting black-box coverage: 4-8% ground-truth vs P3 Methodology-guided agents: >50% per-family detection
P2 Tool-augmented black-box coverage 10-19% vs P4 ARG pipeline coverage 30.2% overall
Gemini 3.1 Pro black-box with P3: recall 40.7%, precision 51.6% vs Claude Opus 4.6 with P3: recall 80.5%, precision 84.8%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.23243.

Fig 1

Fig 1: Four black-box security testing paradigms evaluated in this work. P1 and P2 use frontier models with

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Limitations

Frontier models refuse security tasks due to consumer-oriented guardrails, limiting reliable offensive capability (e.g., GPT-5.4 refusal rate of 2-3 in 5 runs).
Training data lacks critical end-to-end security testing traces, failure-heavy sequences, and multi-step attack chains, constraining model performance.
Benchmark uses locally-run deterministic applications; results may not generalize to large-scale or real-world constantly changing deployments.
No evaluation of adversarial evasion or active targeted attacks on models' detection capabilities was performed.
Limited transparency on fine-tuning details and hardware resources hinders reproducibility of domain-specialized models.
White-box benchmarks rely on static labeling; practical exploit generation or patch validation beyond proof-of-concept is limited.

Open questions / follow-ons

How can self-play data generation strategies effectively scale to diverse security contexts and more complex multi-step attack scenarios?
What are the limits of vertical specialization versus multitask generalization for LLMs applied to cybersecurity?
Can adversarial training or few-shot in-the-loop reinforcement improve frontier models' reliability and reduce false positives in the absence of vertical fine-tuning?
How do detection and offensive reasoning capabilities transfer across domains with different programming languages, protocols, or deployment environments?

Why it matters for bot defense

This work provides rigorous, data-driven evidence that general-purpose large language models—as of mid-2026—are structurally unprepared for precision-critical cybersecurity tasks relevant to bot defense and automated vulnerability detection. High false positive rates and low coverage indicate that deploying such models directly for vulnerability scanning or penetration testing in production will cause excessive noise and missed findings. The significant gains observed when encoding professional penetration testing methodology and applying vertical foundation models suggest that effective bot-defense systems integrating AI need to incorporate domain-specialized models and controlled confirmation logic rather than relying on raw frontier LLM scale. Moreover, the identification of training data gaps (e.g., lack of multi-step attack sequences and security testing traces) signals important directions for future data collection and model fine-tuning in automated security tooling pipelines. Engineers developing CAPTCHAs or bot-detection workflows should be cautious about over-reliance on frontier LLMs' native safety judgments alone and consider architecting pipeline modules that embed security domain knowledge and deterministic verification to reduce false alarms and improve trustworthiness.

Cite

bibtex

@article{arxiv2605_23243,
  title={ Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks },
  author={ Vivek Dahiya and Sunny Nehra and Vipul Dholariya and Bhavik Shangari and Chandra Khatri },
  journal={arXiv preprint arXiv:2605.23243},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.23243}
}

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​