Agentic Vulnerability Reasoning on Windows COM Binaries

Source: arXiv:2605.05000 · Published 2026-05-06 · By Hwiwon Lee, Jongseong Kim, Lingming Zhang

TL;DR

SLYP targets a very specific and nasty class of Windows local-privilege-escalation bugs: race conditions in COM services that run privileged, are reachable from authenticated users, and often expose concurrent RPC entry points without framework-level serialization. The paper’s core claim is not just that an LLM can point at suspicious code, but that an agent can be scaffolded to move from static suspicion to debugger-verified proof-of-concept generation. The key design move is to break the task into two stages: discovery from binary decompilation plus vtable resolution, then PoC synthesis with COM metadata and live debugging feedback.

What is new here is the end-to-end agentic plumbing: three MCP tool surfaces for binary exploration, COM inspection, and dynamic debugging; structured memory/compaction; and task-verification middleware that keeps the agent from stopping early. On their benchmark of 20 COM objects and 40 labeled cases, SLYP reaches 0.973 F1 for vulnerability discovery, beats production coding agents by up to 0.208 F1, and beats their COMRace++ reproduction by 3.3x in bug discovery (0.973 vs 0.299). For PoC generation, the strongest SLYP configuration verifies working PoCs on 67.5% of cases (27/40), while default production agents without the COM inspection and debugger tools verify essentially none. In production deployment, they report 28 previously unknown vulnerabilities across nine COM services, all confirmed by MSRC, with 16 CVEs and $140,000 in bounties.

Key findings

Benchmark size: 20 COM objects covering 40 vulnerability cases, with manually verified ground-truth labels and 8 independently discovered 1-day CVEs used for circularity-free validation.
Discovery accuracy: SLYP reaches 0.973 F1 (GPT-5.4 backbone) on bug discovery.
Compared with production coding agents, SLYP improves F1 by up to 0.208 on the same benchmark; the cited production-agent result is 0.910 F1 for Codex/GPT-5.4 with minimal binary-analysis tools.
Compared with the static analyzer COMRace++, SLYP reports 3.3x better bug discovery, with scores 0.973 vs 0.299 F1.
For PoC generation, the strongest SLYP configuration verifies working PoCs for 67.5% of cases (27/40).
Default production coding agents without COM inspection and dynamic debugging tools verify essentially no PoCs on either frontier model.
In deployment on production Windows services, SLYP found 28 previously unknown vulnerabilities across 9 COM services; MSRC confirmed all 28, assigned 16 CVEs, and paid $140,000 in bounties.

Threat model

A low-privileged but authenticated local attacker who can activate reachable COM services and issue concurrent RPC calls against privileged, multi-threaded COM objects. The attacker is assumed to know the target CLSID or can enumerate it, but does not have source code, symbols, or elevated privileges. They cannot bypass the Windows COM access/launch permissions that gate reachability, and they are not assumed to have kernel compromise; the main challenge is finding and exploiting race windows in service code.

Methodology — deep read

The threat model is a local attacker who already has authenticated access to the machine and can invoke reachable COM interfaces over RPC/ALPC, but does not have elevated privileges or source code. The paper focuses on race conditions in privileged COM services where concurrent requests from separate RPC threads access shared object state without proper synchronization. The attacker is assumed able to trigger concurrent method calls and observe crashes or service behavior; the system is meant to discover bugs in closed-source Windows binaries, not to defend against kernel-level tampering or already-compromised systems.

The evaluation data consist of a benchmark of 20 COM objects spanning 40 vulnerability cases. The paper says the labels were manually verified, and that 8 independently discovered 1-day CVEs were included to avoid circularity in validation. Beyond that, the excerpt does not specify an exact train/test split, whether any of the benchmark objects were held out from prompt design, or whether the same objects were used to tune tool prompts. For the real-world deployment, the system was run against production Windows services, but the paper excerpt does not provide a public dataset release or fully enumerated service list in the excerpted text. Because the paper is about binary analysis rather than supervised learning in the usual sense, there is no conventional data preprocessing pipeline; instead, the key “data” are decompiled functions, vtable-resolved call graphs, registry metadata, interface definitions from proxy/stub DLLs, and debugger outputs from live PoC execution.

Architecturally, SLYP is a two-stage ReAct agent. Stage 1 (Discovery) uses only the IDA-backed binary exploration tools to decompile functions, recover vtable targets, inspect xrefs/callers/callees, and trace shared-state accesses across entry functions. The agent is encouraged to reason about race patterns such as read-free-write on shared fields, free/free on aliased pointers, and missing synchronization around member accesses. Stage 2 (PoC Generation) consumes the structured vulnerability report from Stage 1 and adds COM inspection plus dynamic debugging. COM inspection surfaces registry/service metadata, threading model, CLSIDs, interface signatures, security descriptors, and skeleton code generation; dynamic debugging provides compile/execute/capture-crash functionality plus debugger command execution, process/service control, and VM control. A notable technical component is integrated virtual dispatch resolution during decompilation: the system traces object definitions backward to candidate vtables, maps method offsets to concrete functions, and annotates pseudocode with resolved callees. The authors explicitly note this is incomplete when vtables are loaded dynamically from external binaries or when polymorphic initialization obscures the target class.

Training in the ML sense is not the main story here; the paper is more about orchestration and prompting than model fine-tuning. The excerpt does not report any model training epochs, optimizer, batch size, or seed strategy because the same framework is run across multiple backbone models and production agents without modification. Instead, the important “regime” is iterative tool use: the agent runs a think-act-observe loop, with auto-compaction triggered as trajectories grow, a structured checkpoint written to external memory, and a post-compaction directive that forces the agent to re-verify tentative conclusions. The memory system stores durable findings in topic files with YAML frontmatter and uses BM25 retrieval over those files to restore evidence after compaction. Task-verification middleware intercepts premature stopping or empty final answers and nudges the agent to continue until there is an actual artifact.

Evaluation is split into vulnerability discovery and PoC generation. Discovery is measured with F1 over the 40 labeled cases. The paper compares SLYP against production coding agents (Codex and Claude Code) on six backbone models using identical task prompts, and against COMRace++ (their enhanced reproduction of COMRace) as the static-analysis baseline. The reported results are 0.973 F1 for SLYP versus 0.910 F1 for Codex/GPT-5.4 with minimal tools and 0.724 F1 for Claude Code/Opus 4.6; adding all binary exploration tools narrows but does not close the gap, with SLYP still ahead by up to 0.208 F1. For PoC generation, the evaluation checks whether the compiled PoC triggers a debugger-attached crash and whether the faulting call stack still reaches the target function from Stage 1. The paper says the strongest configuration solves 27/40 cases, and that the four top configurations collectively solve 34/40. The excerpt does not mention statistical tests, confidence intervals, or cross-validation. One concrete end-to-end example is the SetPrintTicket method from PrintWorkflowUserSvc (CVE-2024-49095 in the example): Stage 1 would mark the read-free-allocate-write sequence on a shared heap pointer as a race candidate; Stage 2 would retrieve the interface/activation metadata, generate a C++ skeleton calling the right COM activation API, add concurrent calls to the method, and iterate until page-heap or debugger evidence shows the use-after-free or double-free at the faulting instruction. Reproducibility looks partial: the paper provides agent trajectory artifacts across three scaffolds and six backbones, but the excerpt does not state that code, weights, or the full benchmark are publicly released.

The concrete novelty is the interface design and the operationalization of debugging feedback. SLYP is not simply “LLM + IDA.” It is an agent with reusable tool abstractions for binary exploration, COM registry/IDL inspection, and live debugging, plus memory management that survives long analyses. The paper’s strongest claim is that these interactions move the system from bug suspicion to verified PoC generation in a way that default coding agents cannot do on the same tasks.

Technical innovations

Three MCP tool surfaces specialized for binary exploration, COM inspection, and dynamic debugging, rather than a monolithic analyzer.
Integrated virtual dispatch resolution in decompilation to turn opaque vtable indirections into concrete callees for downstream reasoning.
A two-stage agent pipeline that separates vulnerability discovery from debugger-verified PoC synthesis.
Structured checkpointing plus file-based memory to preserve cross-function findings across context compaction.
Task-verification middleware that blocks premature termination and forces tool-backed progress.

Datasets

Benchmark of 20 COM objects — 40 vulnerability cases — manually verified ground truth; includes 8 independently discovered 1-day CVEs.
Production Windows services evaluation set — 9 COM services — non-public Microsoft production services, confirmed by MSRC.

Baselines vs proposed

Production coding agents (Codex, minimal binary tools): F1 = 0.910 vs proposed: 0.973
Production coding agents (Claude Code / Opus 4.6, minimal binary tools): F1 = 0.724 vs proposed: 0.973
COMRace++: F1 = 0.299 vs proposed: 0.973
PoC generation with default production agents (no COM inspection/dynamic debugging): verified cases = essentially 0 vs proposed: 27/40 (67.5%)
Top four SLYP configurations collectively: verified cases = 34/40 vs proposed strongest config: 27/40

Limitations

The benchmark is small: 20 COM objects and 40 cases, so the reported F1 may not generalize to the broader Windows ecosystem.
The excerpt does not report statistical significance, confidence intervals, or variance across seeds/backbones.
The paper focuses on race conditions in COM services; it does not show equally strong evidence for other vulnerability classes beyond saying the architecture could be adapted.
Some key implementation/evaluation details are missing from the excerpt, including exact prompt templates, full hyperparameters, and whether the benchmark or artifacts are publicly released.
The vtable-resolution technique is acknowledged to miss dynamically loaded vtables and some polymorphic initialization patterns.
PoC success depends on live-debugging conditions and page-heap behavior, so results may be environment-sensitive.

Open questions / follow-ons

How well does the same agentic pipeline transfer to non-COM Windows binaries with different activation and threading models?
Can the tool interfaces be extended to other bug classes, such as type confusion or logic flaws, without collapsing precision?
What is the cost in analyst time and runtime per discovered bug relative to manual reverse engineering or static analysis?
How robust is the PoC synthesis loop under distribution shift, e.g., patched services, different Windows builds, or hardened heap settings?

Why it matters for bot defense

For bot-defense practitioners, the main takeaway is that agents can be driven by structured tools and verification loops to solve long-horizon binary-reasoning tasks that are hard for plain prompting. The specific COM race-condition target is not a CAPTCHA problem, but the engineering pattern transfers: expose domain-specific inspection tools, keep the agent grounded in concrete runtime feedback, and add verification gates so the system cannot stop on a plausible but unproven hypothesis. If you are building bot detection or abuse-resistant systems, this paper is a reminder that sophisticated attackers can use agentic workflows to move from enumeration to working exploit evidence, not just signatures. The practical response is to assume that vulnerable surfaces will be searched with iterative, tool-augmented automation, and to harden interfaces, reduce concurrent state-sharing bugs, and make crash/verification signals less informative where possible.

Cite

bibtex

@article{arxiv2605_05000,
  title={ Agentic Vulnerability Reasoning on Windows COM Binaries },
  author={ Hwiwon Lee and Jongseong Kim and Lingming Zhang },
  journal={arXiv preprint arXiv:2605.05000},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.05000}
}

Agentic Vulnerability Reasoning on Windows COM Binaries ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​