Benchmarking Mythos-Linked Bug Rediscovery

Source: arXiv:2605.17416 · Published 2026-05-17 · By Isaac David, Arthur Gervais

TL;DR

This paper presents a controlled, reproducible benchmark experiment evaluating state-of-the-art language models on the task of rediscovering six publicly disclosed Mythos-linked system-level security bugs, given the vulnerable source file. The key novelty is isolating the intra-file vulnerability rediscovery subproblem by providing each model with the exact target file(s), stripped of any explicit patch, advisory, or vulnerability metadata, requiring the model to reason about low-level C code invariants to identify the specific patched bug. Over 54 counted attempts (3 models, 6 tasks, 3 repeats), GPT-5.5 xhigh rediscovered 5/18 targets covering 3/6 distinct bugs, Claude Opus 4.7 rediscovered one bug once, and Kimi K2 rediscovered none. The dominant failure mode is models prematurely committing to plausible yet incorrect same-file bug hypotheses, highlighting that vulnerability discovery difficulty remains in local invariant synthesis and candidate prioritization even after repository/file localization. These results do not invalidate Anthropic Mythos claims but clarify the isolated difficulty of precise bug rediscovery with fixed source access, emphasizing the gap between source-grounded hypothesis generation and accurate bug matching in automated systems auditing.

Key findings

GPT-5.5 xhigh rediscovered 5/18 target vulnerabilities across 2/6 tasks and 3/6 distinct core bugs.
Claude Opus 4.7 rediscovered 1/18 targets covering 1/6 tasks; Kimi K2 rediscovered none in 18 attempts.
FreeBSD RPCSEC_GSS bug was rediscovered in all 3 attempts by GPT-5.5 xhigh and once by Claude Opus 4.7.
FFmpeg JPEG-XS stale buffer bug rediscovered twice by GPT-5.5 xhigh, none by other models.
FFmpeg MPEG-TS descriptor bug found once by GPT-5.5 xhigh when running JPEG-XS task but not in its dedicated task.
Most failure modes involved early commitment to plausible alternate local bugs within the assigned source file.
Models produced multiple candidate bugs (16 candidates each for GPT-5.5 and Claude; 21 for Kimi) but rarely matched the target patch invariant.
Resource use varied with GPT-5.5 costing $60.87 over 18 attempts, Claude Opus 4.7 cost $88.08, and Kimi K2 $3.50.

Threat model

n/a — The paper evaluates intrinsic model capability to rediscover known bugs given read-only access to the vulnerable source files without adversary interaction or manipulation.

Methodology — deep read

The study uses a strict 'target-file rediscovery' experimental design to evaluate three language models on rediscovering system-level software bugs linked to Anthropic’s Mythos claims. The threat model assumes an adversary is absent here; the focus is on model capabilities to find bugs given only the vulnerable source file without any explicit patch or advisory hints.

The data consists of six publicly disclosed Mythos-linked bug tasks from OpenBSD, FreeBSD, Linux, and FFmpeg, each with the vulnerable source file(s) provided read-only. File sizes range from 518 to 4461 lines of code with token counts up to ~40k tokens. Tasks cover a heterogeneous set of bug classes including memory safety (stack overflow, use-after-free), protocol state invariants, integer sentinel collisions, and descriptor accounting.

Each model—GPT-5.5 xhigh, Claude Opus 4.7, and Kimi K2—is run on all six tasks, with three repeated attempts per task, making 18 attempts per model and 54 attempts total. The prompt framework provides file metadata, symbol sketches, and a systems audit-style worker prompt, instructing the model to identify bugs without disclosing CVE or patch identifiers.

Models have access to a bounded set of read-only tools like file tree listing, symbol reading, file search, and a constrained terminal for static inspection. They can submit structured bug candidates with location, root cause, trigger, impact, and confidence.

After candidate submission, a validator model confirms or rejects candidates based on the same access. All results undergo manual adjudication against a carefully constructed hidden answer key derived from public patches and advisories.

The evaluation metrics include exact target rediscovery (correct vulnerability match), portion of tasks covered, candidate volume, and resource usage (token counts, cost, runtime). Baselines are the three models with identical scaffolding and prompts to isolate model differences. The key challenge is disambiguating multiple plausible bug candidates within the correct file to find the exact patch-fixed invariant.

The experiment intentionally excludes internet access, external search, live interaction, or patch texts to measure intrinsic model ability for intra-file vulnerability reasoning under fixed context. Resource accounting captures token usage, costs, and timing.

In a concrete example, for the FreeBSD RPCSEC_GSS task where the bug is a stack buffer overflow via uncontrolled credential length, GPT-5.5 xhigh correctly reconstructs the specific stack overflow in all three attempts by analyzing the kernel and userland source files and reasoning about the fixed buffer size and copy operations.

Reproducibility is supported via fixed prompt bundles, shared source snapshots, documented manual grading rubrics, stable task manifests, and archived raw transcripts and outputs. However, some datasets (like Mythos-linked bugs) are partially filtered and curated for public disclosure readiness.

This method clearly isolates bug rediscovery difficulty from repository search and shows current frontier models can generate plausible bug hypotheses but struggle to select the public validated vulnerability under these controlled conditions.

Technical innovations

A reproducible benchmark for target-file vulnerability rediscovery isolating intra-file bug reasoning without patch or advisory hints.
A unified, read-only tool-mediated model workflow with explicit candidate submission, validation, and manual adjudication to measure precise bug matching.
Cross-model controlled evaluation on a curated suite of six system-level Mythos-linked bugs with publicly verifiable ground truth.
Resource accounting and candidate volume reporting to separate bug hypothesis generation from correct target identification.

Datasets

Mythos target-file rediscovery suite — 6 tasks, 1-2 source files per task, 518–4461 lines of code each — curated from Anthropic Mythos public claims and associated open-source repositories

Baselines vs proposed

GPT-5.5 xhigh: target rediscovery rate = 5/18 attempts vs Claude Opus 4.7: 1/18 vs Kimi K2: 0/18
GPT-5.5 xhigh distinct core bugs rediscovered = 3/6 vs Claude Opus 4.7: 1/6 vs Kimi K2: 0/6
Cost per 18 attempts: GPT-5.5 xhigh = $60.87 vs Claude Opus 4.7 = $88.08 vs Kimi K2 = $3.50

Limitations

Small task suite with only six tasks limits generalization across all vulnerability types or codebases.
Benchmark excludes many open Mythos claims such as closed-source browser bugs and firmware vulnerabilities.
No assessment of adversarial model guidance or dynamic exploitation capabilities; purely static audit style.
No distribution shift testing or evaluation of different prompt styles or temperature settings beyond fixed baseline.
Manual adjudication and rubric creation inject subjectivity into target matching despite careful design.
Limiting models to read-only source access restricts real-world developer workflows that could include debugging or test execution.

Open questions / follow-ons

How do multi-session or interactive search and ranking workflows affect rediscovery success in Mythos-like contexts?
Can expanding benchmark tasks and incorporating negative or unseen bugs improve diagnostic power?
What role does validation and candidate prioritization architecture play compared to pure generative ability?
How would real-time debugging, dynamic analysis, or test inputs modify bug rediscovery rates?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners focused on security-oriented AI evaluation, this paper highlights the challenge of precise vulnerability localization by LLMs even under idealized conditions where file discovery is solved. It demonstrates that state-of-the-art models can generate many plausible bug hypotheses but struggle to match and confirm the exact patched invariant, emphasizing the difficulty of reliable exploit identification without extensive search and verification scaffolds.

This suggests that in safety-critical automated code audit or security tool workflows, simply presenting relevant source code to a large language model is insufficient for accurate bug detection. Additional engineering is needed in prompt design, multi-agent search orchestration, candidate validation, and cost-sensitive exploration. The benchmark also underscores the importance of rigorous target matching and transparent denominator reporting when validating autonomous bug-finding claims, lessons broadly applicable to bot-defense systems that must gauge and rank suspicious behavior hypotheses under uncertainty.

Cite

bibtex

@article{arxiv2605_17416,
  title={ Benchmarking Mythos-Linked Bug Rediscovery },
  author={ Isaac David and Arthur Gervais },
  journal={arXiv preprint arXiv:2605.17416},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.17416}
}

Benchmarking Mythos-Linked Bug Rediscovery ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​