Adversarial SQL Injection Generation with LLM-Based Architectures

Source: arXiv:2605.11188 · Published 2026-05-11 · By Ali Karakoc, H. Birkan Yilmaz

TL;DR

This paper studies whether large language models can be used to generate SQL injection payloads that evade modern defenses, and it does so as a benchmarking paper rather than a single-model demo. The authors compare seven generators: three baselines (a SQLMap-like deterministic payload engine, vanilla zero-shot GPT-4o, and a reimplementation of GenSQLi) plus two proposed systems, RADAGAS and RefleXQLi, each instantiated with multiple foundation models. The testbed is broad: six rule-based WAF configurations (ModSecurity PL1–3 and Coraza PL1–3), two AI/ML WAFs (WAF-Brain and CNN-WAF), two commercial WAFs (AWS WAF and Cloudflare WAF), and a MySQL 8.0 execution validator.

The main novelty is in the generation pipelines. RADAGAS combines retrieval-augmented generation over a curated SQLi knowledge base with maximum marginal relevance retrieval and multi-stage diversity filtering; RefleXQLi uses four-step chain-of-thought planning plus a dual-LLM generator/discriminator loop. The result is a careful comparison of how model choice and decoding temperature interact with attack success, diversity, and execution validity. The headline finding is that these LLM-based systems can be quite effective against AI/ML-based WAFs, but they remain weak against rule-based WAFs, and there is no single configuration that wins everywhere.

Key findings

RADAGAS-GPT4o achieved the highest reported average WAF bypass rate at 22.73%, beating the traditional SQLi baseline (15.01%), vanilla GPT-4o (12.90%), and GenSQLi (20.35%).
RADAGAS variants were much stronger on AI/ML WAFs than on rule-based WAFs: RADAGAS-DeepSeek reached 92.49% bypass on WAF-Brain, and RADAGAS-Claude reached 80.48% on CNN-WAF.
Rule-based defenses were far harder to evade: the authors report only 0–5.70% bypass on ModSecurity and Coraza for the RADAGAS variants.
Cloudflare was the most permissive commercial WAF for RADAGAS-GPT4o in their study, with a reported 49.50% bypass rate.
Against MySQL execution validation, RADAGAS systems achieved 60–64% execution success, compared with 21–37% for the baselines.
RefleXQLi achieved a 21.21% average bypass rate and reported low variance (σ = 0.37), while its MySQL execution success was only 27.80%.
The authors state the study ran 240 experiments, generated 240,000 payloads, and performed about 2.2 million WAF/execution tests.
Temperature effects were model-specific and non-monotonic: GPT-4o performed best at T=0.1 (22.73%), DeepSeek-R1 at T=0.9 (22.09%), and Claude 3.7 Sonnet at T=0.6 (21.73%).

Threat model

The adversary is an automated SQLi payload generator trying to evade detection by a target WAF and, separately, to produce payloads that execute successfully against a MySQL-backed vulnerable application. The generator knows the WAF family or characteristics used in prompting and can iterate multiple candidate payloads, but it cannot bypass the WAFs by direct modification of the defender, alter the vulnerable app beyond the query input, or rely on privileged access to the target environment. In RefleXQLi, the adversary additionally receives feedback from a discriminator LLM; in RADAGAS, it receives retrieved context from a curated SQLi corpus and uses diversity filters. The paper does not model human defenders adapting in real time during the attack loop.

Methodology — deep read

The threat model is an automated adversarial tester that attempts to generate SQL injection payloads to probe and bypass WAFs, not a live attacker exploiting a real target site. The paper assumes the defender has one of ten evaluated protections: six rule-based open-source WAF configurations, two AI/ML WAFs, or two commercial cloud WAFs, plus a MySQL-backed vulnerable application used as an execution oracle. The attacker is the generator system itself; it knows the target WAF category and, in RefleXQLi, receives a discriminator-style rejection signal. The paper does not claim stealth against human review or operational constraints like rate limits, ban policies, or adaptive anti-automation controls.

On data provenance, the paper’s “data” is not a labeled external benchmark in the usual ML sense, but a curated SQLi knowledge base plus generated payload corpora. For RADAGAS, the offline retrieval corpus is described as an 82 KB document assembled from OWASP, PortSwigger Web Security Academy SQLi/bypass guidance, GitHub repositories such as PayloadsAllTheThings, MySQL syntax notes, and common obfuscation/mutation/encoding techniques. That corpus is chunked with RecursiveCharacterTextSplitter using chunk size 200 and overlap 50, embedded with sentence-transformers/paraphrase-MiniLM-L3-v2, and indexed in FAISS. The experiment set spans 240,000 generated payloads across 48 RADAGAS configurations: GPT-4o at 4 temperatures (0.10, 0.30, 0.60, 0.90) × 6 diversity thresholds (0.70, 0.75, 0.80, 0.85, 0.90, 0.95) × 5 runs, and Claude/DeepSeek at 2 temperatures each × 6 thresholds × 5 runs. RefleXQLi and the baselines were also run in 5 repeated trials of 1,000 payloads per run, but the excerpt does not fully expose all split details for the non-RADAGAS systems beyond that repeated-run structure.

Architecturally, RADAGAS is a three-stage pipeline. First, the offline phase builds the knowledge base embeddings and FAISS index. Second, the online phase receives a query q, converts it to an embedding, and uses MMR retrieval with k=3 to select context that balances relevance and novelty. Third, the generation phase calls an LLM (GPT-4o, Claude 3.7 Sonnet, or DeepSeek-R1) to produce a payload conditioned on the retrieved context and previously accepted payloads. New payloads are filtered in several ways: SQL validity must pass against a vulnerable MySQL-backed web app, semantic similarity to previously accepted payloads must be above a threshold, normalized Levenshtein distance must be above a threshold, and BERTScore similarity must remain below a threshold. The paper also defines separate diversity metrics for analysis: semantic diversity via SBERT cosine distance, lexical diversity via Levenshtein, n-gram Jaccard distance, contextual diversity via BERTScore, syntactic diversity via AST tree distance, and functional diversity via execution outcome categories. In effect, the system is designed to optimize for both bypass potential and payload variety, while explicitly rejecting near-duplicates.

RefleXQLi is a different design choice: it uses GPT-4o as both generator and discriminator, with a four-step chain-of-thought prompt structure. The generator first analyzes the target WAF, then formulates an evasion strategy, designs a payload, and refines it for subtlety. The discriminator evaluates each payload and provides rejection feedback when it flags patterns. The process loops for at most Imax=3 iterations with a rejection threshold τ=7 out of 10. This is a constrained adversarial loop, closer to iterative self-refinement than retrieval. The paper says RefleXQLi produced 100% unique payloads, but it does not provide a formal diversity table in the excerpt to disentangle uniqueness from deeper semantic or functional diversity.

For training, there is no end-to-end supervised training of the generators in the paper excerpt; the systems are primarily inference-time architectures over proprietary or hosted foundation models. The main tunable hyperparameters are temperature and the RADAGAS diversity threshold. GPT-4o was explored at four temperatures to find a finer-grained optimum, while Claude and DeepSeek were each tested at two temperatures for comparative coverage. RefleXQLi and the baseline systems used a default temperature of 0.7 to isolate the effect of reasoning and architecture. The excerpt notes the compute environment as a mix of bare-metal and cloud infrastructure (AWS EC2, SageMaker, Bedrock), but the hardware section is truncated before full CPU/GPU details and does not give training epochs, batch sizes, seeds, or optimizer settings because there is no conventional training loop reported.

Evaluation is straightforward but broad. Each generated payload is tested against all ten WAFs and the MySQL execution oracle, and the paper reports bypass rate, mean bypass rate across WAFs, execution success, standard deviation over five runs, and diversity metrics. They also compute Pearson and Spearman correlations between diversity and bypass rate with p<0.05 used as the significance threshold. The authors compare RADAGAS, RefleXQLi, GenSQLi, vanilla GPT-4o, and the SQLMap-like traditional baseline, and they additionally slice results by WAF family, model, temperature, and diversity threshold. One concrete end-to-end example implied by the pipeline is: a user query is embedded, MMR retrieves three relevant SQLi snippets from the OWASP/PortSwigger/GitHub corpus, GPT-4o generates a candidate payload at a chosen temperature, the payload is executed against the MySQL-backed vulnerable app, then filtered by semantic/lexical/contextual thresholds before being sent to each WAF; accepted payloads are accumulated and later measured for bypass and diversity. Reproducibility is only partial: the authors specify models, temperatures, chunking, thresholds, and the knowledge-base sources, but the excerpt does not mention public code, frozen weights, or a released dataset; moreover, at least the GenSQLi baseline was not publicly available, so it was reimplemented from the paper.

Technical innovations

RADAGAS combines RAG, FAISS, MMR retrieval, and multi-stage diversity filtering to generate SQLi payloads that are both relevant and non-redundant.
RefleXQLi introduces a four-step chain-of-thought generation loop plus a dual-LLM generator/discriminator feedback mechanism for iterative payload refinement.
The paper evaluates diversity with six separate metrics, including AST-based syntactic diversity and functional diversity from execution outcomes, rather than relying on uniqueness alone.
The benchmark spans both open-source and commercial WAFs plus a live MySQL execution oracle, giving a broader comparison than prior single-defense studies.

Datasets

Curated SQLi knowledge base — 82 KB — OWASP, PortSwigger Web Security Academy, PayloadsAllTheThings GitHub, MySQL syntax and obfuscation notes
Generated payload corpus — 240,000 payloads — synthetic outputs from RADAGAS variants across 48 configurations
Baseline payload runs — 5 runs × 1,000 payloads per run per baseline system — synthetic outputs from SQLMap-like, vanilla GPT-4o, and GenSQLi reimplementation

Baselines vs proposed

Traditional SQLi (SQLMap-like): average WAF bypass rate = 15.01% vs proposed RADAGAS variants = 22.73%, 22.09%, 21.73%
Vanilla GPT-4o: average WAF bypass rate = 12.90% vs proposed RADAGAS-GPT4o = 22.73%
GenSQLi: average WAF bypass rate = 20.35% vs proposed RADAGAS-GPT4o = 22.73%
RADAGAS-DeepSeek: WAF-Brain bypass rate = 92.49% vs proposed system name = RADAGAS-DeepSeek
RADAGAS-Claude: CNN-WAF bypass rate = 80.48% vs proposed system name = RADAGAS-Claude
RADAGAS-GPT4o: Cloudflare bypass rate = 49.50% vs proposed system name = RADAGAS-GPT4o
RADAGAS systems: MySQL execution success = 60–64% vs baseline systems = 21–37%
RefleXQLi: average WAF bypass rate = 21.21% vs proposed system name = RefleXQLi

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.11188.

Fig 6

Fig 6: Bypass performance heatmap shows higher bypass rates in darker shades.

Limitations

The paper reports strong performance on AI/ML WAFs but very weak bypass on rule-based WAFs, so the gains are not universal.
The excerpt does not provide full reproducibility details such as code release, seeds, exact prompt text, or all infrastructure specifications; the hardware section is truncated.
The GenSQLi baseline was reimplemented from the paper because the original implementation was not publicly available, which can introduce fidelity gaps.
The study uses a curated knowledge base and synthetic payload generation rather than live red-team targets, so external validity to real-world application stacks is limited.
RefleXQLi’s discriminator is another LLM, so its rejection behavior may not match an actual defensive system and may partly measure self-consistency rather than exploitability.
The correlation analysis is described, but the excerpt does not include actual correlation coefficients or full statistical tables, limiting interpretation of diversity-versus-bypass claims.

Open questions / follow-ons

How much of the gain comes from retrieval over curated SQLi examples versus the diversity filters versus the specific foundation model choice?
Would the same pipelines work against WAFs that adapt online or use per-tenant behavioral baselines rather than static classifiers or signatures?
How sensitive are the results to the quality and scope of the retrieval corpus; would a smaller or more stale corpus reduce bypass rates materially?
Can the diversity metrics be linked to exploit effectiveness in a causal way, or are they only weak proxies for better payload search?

Why it matters for bot defense

For a bot-defense engineer, the key lesson is that LLM-driven adversarial generation is no longer just a curiosity: a retrieval-augmented or self-reflective generator can systematically explore payload variants, and the success profile depends heavily on the defense class being tested. If your stack relies mostly on pattern matching or ML classifiers trained on narrow SQLi templates, this paper suggests a determined generator may find a substantial bypass surface. The practical reaction is to evaluate defenses against broader, model-generated payload families, not just canonical SQLMap-like signatures, and to include execution-grounded validation plus diversity-aware test corpora when building or tuning WAF rules.

Cite

bibtex

@article{arxiv2605_11188,
  title={ Adversarial SQL Injection Generation with LLM-Based Architectures },
  author={ Ali Karakoc and H. Birkan Yilmaz },
  journal={arXiv preprint arXiv:2605.11188},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.11188}
}

Adversarial SQL Injection Generation with LLM-Based Architectures ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​