SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

Source: arXiv:2605.26548 · Published 2026-05-26 · By Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang, Chunqiu Steven Xia, Lingming Zhang

TL;DR

SEC-bench Pro addresses gaps in evaluating large language models (LLMs) for automated software security tasks by providing a benchmark designed for real-world, long-horizon bug hunting on complex, critical software projects. Unlike prior benchmarks that rely on fuzz-harnesses or overly informative inputs, SEC-bench Pro focuses on realistic code auditing by collecting 183 validated vulnerabilities from JavaScript engines V8 and SpiderMonkey, including high-profile bugs with over $1.5 million in Google VRP awards. The benchmark uses a three-phase pipeline—vulnerability report collection, environment reconstruction with autonomous coding agents, and oracle-based validation—to produce reproducible and graded tasks. It runs PoC attempts against vulnerable, fixed, and latest builds with an LLM judge assessing correctness beyond simple crash matching. Evaluation of state-of-the-art coding agents (OpenAI GPT-5.4 Codex, Anthropic Opus 4.6 ClaudeCode, Moonshot Kimi-K2.6 open baseline) yields success rates below 40%, exposing significant challenges in automated long-horizon bug hunting for real-world projects.

Key findings

The SEC-bench Pro dataset includes 183 validated vulnerabilities: 103 from V8 and 80 from SpiderMonkey, spanning classes like use-after-free, type confusion, sandbox bypass, and JIT issues.
Strongest LLM-based agent on V8 (OpenAI GPT-5.4 Codex) achieves 32.0% success (33/103 instances), while ClaudeCode reaches 22.3% and Moonshot Kimi-K2.6 baseline reaches 11.7%.
On SpiderMonkey, ClaudeCode verifies 38.8% (31/80) and Codex verifies 23.8% (19/80), with their combined union reaching 48.8% success.
Crash-only grading inflates success counts by up to 43.6% due to unrelated crashes; the three-image LLM-based judge eliminates these false positives by considering fixed and latest behavior.
Failed PoC runs consume more tokens than successful ones; ClaudeCode submits more speculative PoCs at lower yield, while Codex submits fewer, higher-confidence PoCs.
Vulnerability classes like type confusion and use-after-free dominate and remain among the hardest for automated agents.
The benchmark’s automated pipeline enables self-evolving dataset construction as new bugs are disclosed, preserving reproducibility with Docker image triples (vulnerable, fixed, latest).
The agent-driven environment reconstruction automates historically brittle build setups with caching and iterative retries, producing 100% validated reproducible images.

Threat model

The adversary is an automated coding agent tasked with discovering and reproducing vulnerabilities by auditing source code and generating PoCs without privileged internal debugging information, sanitizer traces, or fuzzing harnesses. Agents have access to source revisions, build environments, and public bug reports but must reason about complex, multi-layered engine behaviors to trigger bugs through realistic public interfaces. The model excludes adversaries with direct memory access or manual human intervention.

Methodology — deep read

Threat model and assumptions: The adversary is an automated coding agent attempting to discover vulnerabilities and generate proof-of-concept (PoC) exploits by auditing source code, with no privileged access to sanitizer reports or internal fuzzing harnesses. The evaluation assumes the agent knows the source code and build systems but must reason about bug triggers through public interfaces similar to real-world security engineers.
Data: SEC-bench Pro collects real-world vulnerability reports with linked PoCs and fixes from public issue trackers (Chromium Issue Tracker, Mozilla Bugzilla), curated advisories (MFSA, Pwn2Own, CISA KEV), restricted to high-severity or bounty-qualified bugs. Dataset contains 183 total instances (103 V8, 80 SpiderMonkey). Each instance includes vulnerable source revision, patched source, canonical PoC artifact, crash metadata, and reproducible Docker images representing vulnerable, patched, and latest builds.
Architecture/algorithm: The benchmark construction uses an agentic pipeline. First, an ingestion engine normalizes heterogeneous security reports into a uniform schema capturing bug metadata, linked PoC inputs, commits, severity, and patch references. Second, autonomous coding agents execute inside Docker containers to reconstruct the historical build environment—checking out pinned revisions, applying instrumented sanitizer/build flags, and iteratively fixing build errors until the PoC reproducibly triggers the expected crash signature. The agents use task-specific prompts and are forbidden from hallucinating outputs. The outcome is a standardized artifact bundle including compiled binaries, patches, PoCs, and metadata.

Third, two automated oracles validate each candidate instance: (a) vulnerable-image oracle confirms PoCs reproducibly trigger the expected crash category (sanitizer violation, sandbox violation, DCHECK failure, runtime crash) with high confidence, and (b) fixed-image oracle verifies the patch suppresses the PoC reproducibility (blocked or no reproduction). Only instances passing both oracles enter the released dataset.

Training regime: N/A for benchmark construction. Evaluated coding agents run unmodified at default settings with fixed compute budgets (5400 seconds per instance, 300-second per-execution timeout, 3 retries). Agents use underlying LLMs including OpenAI GPT-5.4, Anthropic Opus 4.6, and Moonshot Kimi-K2.6.
Evaluation protocol: Agents submit generated PoCs for each benchmark instance. Each PoC is replayed against the triple of docker images (vulnerable, fixed, latest) with up to 3 retries. The captured exit codes, stdout, and stderr outputs from all three runs form evidence passed to an LLM-based judge prompt. The judge classifies outcomes as verified (matches target vulnerability and patch mitigation), unsure (needs manual review), or illegal (off-target crashes or failure). Success rates are reported per instance if at least one PoC verifies. Statistics on tokens consumed, tool calls, runtime, and pass rates are also collected. Pattern-based grading baselines (crash-only, exit-code-based) are compared against the triple-image judge to highlight false positives/negatives.
Reproducibility: The benchmark artifacts and automated pipeline standardize environment construction with Docker images, and the code framework is described as pluggable for other projects. The underlying datasets (V8 and SpiderMonkey security reports) are largely public. Exact code releases or weights for the evaluated agents (Codex, ClaudeCode) were not discussed in the provided text.

Concrete example: For one V8 vulnerability, the coding agent receives the raw bug report and PoC. It attempts to build the exact instrumented binary for the pinned revision inside Docker, iteratively adjusting build options until the PoC triggers the expected ASAN crash (validated by matching stderr crash signatures). The agent packages the patched binary with the fix commit and verifies the PoC no longer crashes. During evaluation, generated PoCs are run on vulnerable, patched, and latest images; their combined outputs go to the LLM judge that confirms exploitation correctness. This full end-to-end pipeline enables rigorous, reproducible measurement of agent success against complex, real-world vulnerabilities.

Technical innovations

A self-evolving, project-parameterized pipeline for automatically ingesting, reconstructing, and validating real-world vulnerability instances with linked PoCs and patches.
Automated environment reconstruction delegated to coding agents that reproduce historical builds and validate PoC triggering dynamically inside Docker containers.
A three-image execution grading framework combining vulnerable, patched, and latest binaries to differentiate on-target vulnerabilities from unrelated crashes or infrastructure failures.
An LLM-based judge prompt designed to robustly classify PoC correctness and patch mitigation by analyzing multi-image execution traces and metadata, outperforming simpler crash-based heuristics.

Datasets

SEC-bench Pro V8 subset — 103 validated vulnerabilities — sourced from Chromium Issue Tracker, supplemented with Google VRP bounty reports
SEC-bench Pro SpiderMonkey subset — 80 validated vulnerabilities — sourced from Mozilla Bugzilla, MFSA advisories, Pwn2Own, and CISA KEV

Baselines vs proposed

Moonshot Kimi-K2.6 Open-weight baseline on V8: success = 11.7% vs proposed best (Codex GPT-5.4) 32.0%
Codex GPT-5.4 on V8: success = 32.0% vs ClaudeCode Opus 4.6 22.3%
ClaudeCode Opus 4.6 on SpiderMonkey: success = 38.8% vs Codex GPT-5.4 23.8%
Union of Codex + ClaudeCode on V8: success = 37.9% vs individual max 32.0%
Union of Codex + ClaudeCode on SpiderMonkey: success = 48.8% vs individual max 38.8%
Crash-only grading inflates successes by 43.6%, e.g. Codex V8 raw success 45 vs judged 33

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.26548.

Fig 1

Fig 1 (page 1).

Fig 1

Fig 1: SEC-bench Pro overview. The construction pipeline collects disclosed security

Fig 3

Fig 3 (page 5).

Fig 4

Fig 4 (page 5).

Fig 5

Fig 5 (page 5).

Fig 6

Fig 6 (page 5).

Fig 7

Fig 7 (page 5).

Fig 8

Fig 8 (page 5).

Limitations

Current best LLM agents verify fewer than 40% of challenging, real-world JS engine vulnerabilities indicating substantial room for improvement.
Evaluations limited to two JavaScript engines; applicability to other system software or languages is plausible but untested here.
LLM-as-judge classification, while superior to simple heuristics, relies on API calls to proprietary models, possibly limiting full reproducibility.
Open-weight baseline is evaluated only on V8 due to resource constraints, no corresponding SpiderMonkey results.
No detailed adversarial or robustness testing under adaptive or obfuscated bug conditions reported.
Reconstruction relies on existing public reports and patched commits; unknown or zero-day vulnerabilities without PoCs are outside current scope.

Open questions / follow-ons

How can LLM agents improve reasoning about deep semantic execution features like JIT tier states and garbage collection timing to increase bug discovery success?
What architectural or training innovations in LLM scaffolds lead to better balance between speculative PoC generation and high-confidence exploitation?
Can the SEC-bench Pro pipeline and LLM judge methodology be generalized to other complex software domains beyond JavaScript engines, like OS kernels or database engines?
What role can hybrid human-machine workflows play in complementing LLM-based agents for long-horizon bug hunting, especially in triaging uncertain PoCs?

Why it matters for bot defense

This paper's insights are valuable for bot-defense and CAPTCHA practitioners interested in automated adversarial code understanding and exploitation detection. The demonstrated challenges in long-horizon bug-hunting by LLM agents underscore that even leading models struggle with complex logic and multi-step exploit synthesis, which parallels difficulties bots face in solving multi-factor, time-extended interaction challenges like advanced CAPTCHAs. The SEC-bench Pro approach to multi-image environment validation and LLM-based judgement suggests methods of combining dynamic response environments with AI-based oracles to distinguish valid from spurious adversarial inputs. However, the relatively low success rates imply that sophisticated systems remain largely out of reach for AI-driven automated attack, providing some security margin. Practitioners can adapt the benchmark principles—reproducible environments, multi-modal validation, and oracle-driven scoring—to evaluate CAPTCHAs or interaction puzzles against evolving AI capabilities, improving defense assessments beyond simplistic pass/fail metrics.

Cite

bibtex

@article{arxiv2605_26548,
  title={ SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks? },
  author={ Hwiwon Lee and Jiawei Liu and Dongjun Kim and Ziqi Zhang and Chunqiu Steven Xia and Lingming Zhang },
  journal={arXiv preprint arXiv:2605.26548},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.26548}
}

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks? ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​