Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software

Source: arXiv:2605.03956 · Published 2026-05-05 · By Shravya Kanchi, Xiaoyan Zang, Ying Zhang, Danfeng Yao, Na Meng

TL;DR

This paper addresses the challenge of automatically generating proof-of-vulnerability (PoV) tests to demonstrate real exploitability of vulnerabilities in third-party libraries (Libs) as they affect downstream Java applications (Apps). The motivation is that developers need concrete, executable evidence of vulnerability impact to decide risk levels and prioritization, but manually writing such PoV tests is difficult and prior automated tools fail on many real cases. The authors present PoVSmith, an AI agent-based approach that combines call path analysis, exemplar tests, iterative test code generation via Codex, automatic build/execution validation, and LLM-based quality assessment via GPT. Evaluated on 33 ⟨App,Lib⟩ pairs representing 20 CVEs, PoVSmith revealed 158 unique app entry points calling vulnerable APIs (152 correctly found), generated 152 tests (141 compiled), and produced 84 tests that successfully triggered vulnerabilities—a 55% success rate. Compared to prior LLM-based approaches, PoVSmith improves test quality and greatly reduces human effort required. The authors also contributed a new dataset of 207 verified call paths and discovered previously unreported vulnerabilities with PoVSmith demos. This work advances automated PoV test generation by tightly integrating static/dynamic analysis with large language models and iterative refinement guided by execution feedback.

Key findings

PoVSmith identified 158 unique public methods calling vulnerable library APIs across 33 Java apps; 152 (96%) of these methods and their call paths were correctly found.
PoVSmith generated 152 PoV tests for the found methods; 141 (93%) compiled successfully.
Of the generated tests, 84 (55%) successfully demonstrated exploitable behavior of the vulnerabilities when executed.
Phase I (call path analysis) outperformed CodeQL and other static tools by leveraging Codex with customized prompts to find attack surfaces and call chains in modern Java code.
Phase II’s iterative test generation and self-critique via Codex allowed fixing builds and runtime errors within up to 5 refinement attempts per test.
Phase IV’s GPT-based evaluation accurately assessed whether tests triggered vulnerabilities, complementing automated build/execution logs.
PoVSmith performed better than a state-of-the-art standalone LLM test generation approach [78], yielding substantially more triggered tests and less human involvement.
Using alternative coding agents (Gemini Code Assist and Mistral Vibe) produced worse results than Codex under the same approach.

Threat model

An adversary aware of vulnerabilities in third-party Java libraries aims to exploit these vulnerabilities by calling exposed vulnerable APIs indirectly through an application's public methods. The adversary can manipulate inputs at these entry points but cannot directly modify the application or library source code. The threat involves exploiting reachable vulnerable APIs causing abnormal, exploitable program behavior in the downstream application.

Methodology — deep read

The threat model assumes an adversary who knows about vulnerabilities in third-party Java libraries (Libraries) used by downstream applications (Apps). They aim to exploit these vulnerabilities by calling vulnerable APIs via the application's public methods. The adversary can supply malicious inputs at these entry points leading to exploitation.

The data consists of 33 Java ⟨App,Lib⟩ pairs curated from a prior dataset [78], covering 20 CVEs with vulnerabilities capable of causing denial-of-service, directory traversal, remote code execution, and other attacks. The dataset includes 29 Maven, 3 Gradle, and 1 plain Java project, all vulnerable versions of various libraries such as Apache Commons Codec and Spring Security.

The approach, PoVSmith, has four phases:

Agent-Based Call Path Analysis: Uses GPT-5.2 Codex as an AI coding agent guided by a custom prompt template describing the sink (vulnerable API) to find all public methods in the App calling that sink, along with the full call paths. This focused on source code (not bytecode), filtering redundant call paths to produce minimal attack surfaces.
Agent-Based Iterative Test Generation: For each call path/source method, Codex uses another prompt template containing nine key inputs (sink, source method, call path, vulnerable libs and versions, exemplar test function from the Lib, etc.) to generate an initial JUnit PoV test. Quality control rules were provided, including no mocks for call path classes, using the Arrange-Act-Assert pattern, self-contained test data, and semantic payload reuse. An iterative self-critique and repair loop ensures successful compilation and correct behavior over up to 5 refinement attempts.
Test Compilation and Execution: Tests were automatically compiled and run using project-specific build systems (Maven, Gradle, or plain Java), capturing logs that include build outcomes, runtime exceptions, and test results.
LLM-Based Evaluation: GPT-5.1 was prompted with the test code and combined human-readable logs and structured JSON summaries of execution to decide if the test successfully triggered the vulnerability. The output included triggered: yes/no/unknown, confidence level, and reasoning.

A concrete example: For a vulnerable Dom4j library vulnerable to XML injection, PoVSmith identified a public method in an App calling the vulnerable API DocumentHelper.createElement. Using the exemplar function from Dom4j’s own test, it generated a PoV test that called the App method with a malicious "element>name" payload. Execution of this test on vulnerable Dom4j showed an AssertionError related to an expected IllegalArgumentException, demonstrating successful vulnerability triggering.

For evaluation, two authors manually inspected methods, call paths, tests, and GPT assessments to verify correctness and successful PoV generation. Various coding agents were tried, and baselines from previous LLM approaches compared under the same inputs and datasets. Code and detailed prompt templates are described but not publicly released, due to dependency on private GPT-5.1/5.2 models used.

Technical innovations

Using an AI coding agent (Codex) for precise call path analysis in Java apps targeting vulnerable library API sinks, overcoming limitations of existing static call graph tools.
An iterative AI-driven PoV test generation process integrating exemplar tests, code context, and execution feedback for test code refinement and auto-debugging.
Combining automatic test build/run logs with LLM-based (GPT) assessment for reliable, self-critiqued evaluation of vulnerability triggering without human oracle.
A prompt engineering strategy segmenting tasks into phases and subtasks, tightly coupling static and dynamic analysis with AI agents for comprehensive PoV test generation.

Datasets

33 ⟨App,Lib⟩ Java program pairs — 33 apps with 20 CVEs — curated from prior work [78], covering Maven, Gradle, and plain Java projects

Baselines vs proposed

Prior state-of-the-art LLM-based PoV test approach [78]: triggered test rate = substantially lower than PoVSmith's 55%
Alternative AI coding agents: Gemini Code Assist and Mistral Vibe achieved worse test generation and refinement success than Codex with the same pipeline
CodeQL static call graph analysis: failed to accurately identify call paths and produced many false positives/negatives compared to PoVSmith's Codex-based analysis

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.03956.

Fig 1

Fig 1: The threat model of software supply chain attacks

Fig 2

Fig 2: PoVSmith has four phases

Fig 3

Fig 3: A simplified version of the prompt template we

Limitations

The dataset is limited to 33 compilable Java apps; 16 original apps could not be compiled or were incompatible, limiting generalizability.
PoVSmith relies on proprietary, non-open GPT-5.x Codex and GPT models, raising reproducibility and accessibility concerns.
No evaluation against adversarially crafted App or Lib code designed to mislead AI agents or evade call path detection.
The approach is validated only on Java and JUnit-based tests; applicability to other languages, test frameworks, or build systems is untested.
Manual inspection was required to verify a portion of results, leaving potential for subjective bias and limited scale in evaluation.
LLM assessment may still produce hallucinations or errors; confidence calibration and error analysis remain limited.

Open questions / follow-ons

How well would PoVSmith generalize to other programming languages, dependency ecosystems, and build systems beyond Java and Maven/Gradle?
Can the iterative AI test generation approach be combined with symbolic or fuzz testing to improve coverage and exploit discovery?
What defenses or mitigations can be developed against adversarial inputs or code artifacts designed to mislead AI-based call path or test generators?
How might the evaluation and confidence scoring in Phase IV be improved by combining LLM assessments with formal verification or anomaly detection?

Why it matters for bot defense

Bot-defense and CAPTCHA practitioners managing complex software systems or client applications often rely on third-party libraries with potential vulnerabilities. PoVSmith’s approach of combining static analysis, AI-guided test generation, and dynamic validation could inspire methods to automatically generate security tests demonstrating exploitability of dependencies in bots or automated clients. Its agent-based iterative refinement and execution feedback mechanisms highlight how automated proof-of-exploit generation can be feasible even for complex call chains.

However, the reliance on heavy-weight AI models and curated Java ecosystem limits immediate applicability. Still, the principles of combining contextual AI coding agents with execution logs and self-critique could guide future automated test generation or attack simulation in bot defense systems, particularly for supply chain risk management and vulnerability triage. Understanding real exploit paths remains key to informed risk assessment beyond static vulnerability detection.

Cite

bibtex

@article{arxiv2605_03956,
  title={ Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software },
  author={ Shravya Kanchi and Xiaoyan Zang and Ying Zhang and Danfeng Yao and Na Meng },
  journal={arXiv preprint arXiv:2605.03956},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.03956}
}

Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​