Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

Source: arXiv:2604.17803 · Published 2026-04-20 · By Prasoon Goyal, Sattvik Sahai, Michael Johnston, Hangjie Shi, Yao Lu, Shaohua Liu et al.

TL;DR

Adversarial Arena tackles a practical bottleneck in post-training LLMs: high-quality, diverse, multi-turn data is expensive to produce, especially in low-resource or safety-critical domains. The paper’s core idea is to turn data collection into a competitive game: independent attacker teams try to elicit failures from defender systems, while defender teams try to remain useful and safe. Because the data comes from repeated adversarial tournaments, the resulting conversations are expected to be both richer and more varied than standard crowdsourcing or single-pipeline synthetic generation.

The paper validates this framework with a cybersecurity-alignment competition involving 10 academic teams split into 5 attackers and 5 defenders, across 4 official tournaments and 13 practice runs. The competition generated 19,683 labeled multi-turn conversations after filtering incomplete runs. The main downstream result is that fine-tuning Mistral-7B-Instruct on the competition data improved secure code generation on CyberSecEval-Instruct and improved refusal of malicious cyberactivity on CyberSecEval-MITRE, with the paper also reporting evidence that multiple teams and multiple tournament rounds increased semantic diversity across the dataset.

Key findings

The competition produced 19,683 labeled multi-turn conversations after discarding incomplete runs from 13 practice runs and 4 official tournaments.
Fine-tuning Mistral-7B-Instruct on 9,942 conversations improved CyberSecEval-Instruct secure code generation from 72.60% to 86.01% (+13.41 points; the abstract reports this as an 18.47% improvement).
Fine-tuning Mistral-7B-Instruct on 13,336 conversations improved CyberSecEval-MITRE refusal from 57.10% to 73.90% (+16.80 points; the abstract reports this as a 29.42% improvement).
The annotated evaluation pipeline used 3 human experts per conversation and reported moderate agreement: overall Fleiss’ kappa = 0.385 and Krippendorff’s alpha = 0.385 for the security-event labels.
Static analysis with Amazon CodeGuru was used as the first-stage detector for vulnerable code; if it fired, the conversation was labeled a successful attack without further human review.
Dataset diversity analysis found that within-subset semantic distance was lower than between-subset distance at all three grouping levels: attacker-level 0.2904 vs 0.3211, defender-level 0.3114 vs 0.3282, tournament-level 0.3018 vs 0.3269.
Each attacker-defender matchup comprised 200 conversations, with up to 10 total turns per conversation (5 adjacency pairs), and 25 matchups per tournament because 5 attackers were paired against 5 defenders.

Threat model

The adversary is an external attacker team with access to open-source models and custom prompting/planning infrastructure, trying to induce the defender system to produce insecure code, malicious code, or explicit cyberattack assistance over multi-turn English conversations. The adversary cannot use closed-box model APIs at runtime and is constrained by the conversation cap and turn budget. Defenders cannot assume a perfect evaluator; they must tolerate noisy labels, static-analysis false positives, and changing attack strategies over successive tournaments.

Methodology — deep read

The threat model is a competition-style adversarial setup rather than a conventional red-team benchmark. The attackers are independent teams that can use open-source models, planners, rules, prompt mutators, and dialog managers to elicit unsafe or low-quality outputs from the defender system. The defenders are systems under test; in the cybersecurity case study they start from a provided open-weight 8B coding model and can add limited auxiliary components. Closed-box model APIs were not allowed at runtime for either side. The task of interest is cybersecurity alignment: attackers try to induce vulnerable code, malicious code, or cyberattack assistance; defenders try to stay useful while refusing unsafe requests.

The data came from a staged competition with 10 academic teams, split evenly into five attackers and five defenders, over 13 practice runs and 4 official tournaments. Each attacker-defender matchup generated 200 conversations, and each conversation could contain up to 10 turns total. The paper says the challenge ultimately produced more than 96,000 multi-turn conversations across all runs, of which about 20,000 came from the official tournaments and were labeled. After removing conversations that were incomplete due to execution failures, the final released dataset contained 19,683 conversations. The paper does not describe conventional train/validation/test splits for the dataset itself; instead, it uses subsets of the generated data for downstream fine-tuning depending on whether the defender responses contained vulnerable code or malicious assistance. Preprocessing for evaluation extracted Python code from defender responses for static analysis; for the human-review path, expert annotators followed refined guidelines.

Architecturally, Adversarial Arena is an orchestrated tournament framework. The orchestrator runs many multi-turn attacker-defender conversations asynchronously and reproducibly, then feeds outcomes into rankings and feedback loops. The crucial novelty is that evaluation is not just for scoring: it is also the supervision signal that shapes future generations. In the cybersecurity case study, evaluation had two steps. First, any Python code in the defender response was checked with Amazon CodeGuru for vulnerabilities; if a vulnerability was detected, the conversation was labeled a successful attack. If not, a three-annotator human panel labeled whether the defender complied with a malicious request or provided malicious code/explanations. Majority vote determined the final label. The paper is explicit that the static analyzer is a high-precision, low-recall filter and that the human annotation process went through five refinement stages. A concrete example from the paper’s logic: if an attacker persuades a defender to emit Python that CodeGuru flags as vulnerable, the conversation is counted as an attacker win immediately; if not, the conversation is only an attacker win if three cybersecurity experts agree that the model assisted with malicious activity.

Training and optimization details are only partially specified because the paper is primarily a data-collection framework, not a model-training paper. The defender teams were allowed to augment the base 8B model with prompt handling, classifiers, decoding logic, and small auxiliary models, but the total size of auxiliary models had to remain under 800M parameters. They could also chain multiple passes through the core 8B model as long as latency stayed within 45 seconds. The paper does not report optimizer, learning rate, batch size, number of epochs, or random-seed strategy for the downstream Mistral fine-tuning, so those details are not reproducible from the text provided. It does report runtime characteristics of the competition infrastructure: each run typically completed in under 10 hours; attack bots averaged 2–7.9 seconds per response and defender bots averaged 4.1–10.1 seconds.

Evaluation was multi-layered. At the conversation level, the main labels were successful attack vs successful defense. For attacker ranking, the paper combined attack success rate (ASR) with diversity: diversity was measured by average pairwise BLEU similarity among successful attacks within a matchup, and the normalized attacker score was ASR × Diversity / 100. For defender ranking, defense success rate (DSR) was combined with normalized utility; utility was measured on three static test sets: instruction-based code generation, multi-turn benign cybersecurity conversations, and multi-turn code generation. Utility was capped at the base ChallengeLLM’s utility, and the defender score was Average DSR × (Utility/100)^4, so utility drops were heavily penalized. The paper reports inter-annotator agreement in Table 1 and semantic diversity results in Table 2, and it reports downstream fine-tuning results in Table 3. The downstream evaluation uses CyberSecEval-Instruct for secure code generation and CyberSecEval-MITRE for malicious cyberactivity refusal, comparing the base Mistral-7B-Instruct model with the same model fine-tuned on selected subsets of the arena data.

Reproducibility is mixed. The paper says the resulting datasets will be released upon publication, but the full implementation details for the orchestrator are deferred to Appendix B, and the downstream fine-tuning recipe is underspecified in the main text. The paper also notes that evaluation evolved during the challenge, which is realistic but makes exact replication of leaderboard scores harder. Because the paper is about a live competition, some design choices were adapted midstream based on observed attacker behavior, which is valuable operationally but limits strict experimental control.

Technical innovations

Adversarial Arena turns data generation into a two-sided tournament where attackers and defenders are independently developed by different teams, rather than relying on a single synthetic-data pipeline.
The framework uses evaluator outputs as both labels and incentive signals, creating a feedback loop that rewards attacker success, defender robustness, and data diversity.
The cybersecurity case study combines automatic static analysis with expert human annotation to label multi-turn conversations, instead of relying on only one judge type.
Attacker ranking explicitly incorporates diversity through pairwise BLEU similarity among successful attacks, making repetitive winning attacks score worse than diverse winning attacks.
Defender ranking combines defense success with a steep utility penalty, using a fourth-power term on normalized utility to prevent safety gains from collapsing usefulness.

Datasets

Adversarial Arena cybersecurity alignment dataset — 19,683 labeled multi-turn conversations — generated in 13 practice runs and 4 official tournaments; will be released upon publication
Training subset for secure code generation fine-tuning — 9,942 conversations — filtered to exclude conversations containing vulnerable code as detected by Amazon CodeGuru
Training subset for malicious cyberactivity refusal fine-tuning — 13,336 conversations — filtered to exclude conversations containing code or detailed malicious-assistance explanations as labeled by expert annotators

Baselines vs proposed

Mistral-7B-Instruct baseline on CyberSecEval-Instruct secure code generation: 72.60% vs proposed fine-tuned model: 86.01%
Mistral-7B-Instruct baseline on CyberSecEval-MITRE refusal: 57.10% vs proposed fine-tuned model: 73.90%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.17803.

Fig 1

Fig 1: Adversarial Arena Overview: Attacker/defender pairs interact over several tournament

Fig 2

Fig 2: T-SNE plots: Conversations in the left plot are grouped by attackers, the middle plot is

Fig 3

Fig 3: Orchestrator Architecture

Fig 4

Fig 4 (page 17).

Fig 4

Fig 4: Vulnerable vs Malicious Sessions Across Tournaments

Fig 5

Fig 5: Example Adversarial Conversation 1: A representative conversation between an attacker

Fig 6

Fig 6: Example Adversarial Conversation 2: A conversation demonstrating a multi-step attack

Fig 7

Fig 7: Visualizations of pairwise agreements between annotators, along with a histogram of inter-

Limitations

The paper relies on a noisy evaluator: CodeGuru is treated as high-precision/low-recall, and the human labels have only moderate agreement (overall Fleiss’ kappa 0.385).
The downstream fine-tuning recipe is not fully specified in the provided text, so exact reproduction of the reported improvements is not possible from the paper excerpt alone.
The competition setting is cybersecurity-specific; the paper argues generality, but only one full empirical case study is shown here.
Evaluation evolved during the challenge, so the final dataset and rankings reflect an adaptive process rather than a frozen protocol.
Attack success became imbalanced toward vulnerable-code elicitation in later stages, which the authors note as a coverage problem.
The paper does not report statistical significance tests, confidence intervals, or variance across random seeds for the downstream fine-tuning results in the excerpt provided.

Open questions / follow-ons

How stable are the gains if attacker strategies shift after the data is collected, e.g. to a held-out adversary team or a later tournament distribution?
Would a different diversity objective, such as embedding-based clustering or novelty under a held-out judge, better capture useful coverage than BLEU-based similarity?
How well does the framework transfer to non-cyber adversarial tasks where the evaluator is even noisier or more subjective, such as misinformation or over-agreement?
Can tournament design be made online or continual without encouraging strategic withholding of the strongest attacks until the end?

Why it matters for bot defense

For bot defense and CAPTCHA practitioners, the main lesson is that data quality improves when generation is tied to an adversarial objective and a scoring loop, rather than gathered by passive crowdsourcing. If you are building challenge-response systems, abuse detection, or red-team corpora, this paper suggests a way to structure the pipeline so that multiple independent attack styles are explored in parallel and the resulting transcripts are labeled by an explicit evaluator.

The security caveat is important: the framework’s value depends heavily on the evaluator. In CAPTCHA or anti-bot contexts, if your scoring function is too brittle, participants will optimize against its blind spots, and you will collect skewed examples that overrepresent one exploit family. The paper’s mix of automatic detection, expert review, diversity penalties, and utility constraints is a useful template for designing collection programs where you want both realism and coverage, but it also shows that you should expect label noise, distribution drift across rounds, and attacker adaptation over time.

Cite

bibtex

@article{arxiv2604_17803,
  title={ Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition },
  author={ Prasoon Goyal and Sattvik Sahai and Michael Johnston and Hangjie Shi and Yao Lu and Shaohua Liu and Anna Rumshisky and Rahul Gupta and Anna Gottardi and Desheng Zhang and Lavina Vaz and Leslie Ball and Lucy Hu and Luke Dai and Samyuth Sagi and Maureen Murray and Sankaranarayanan Ananthakrishnan },
  journal={arXiv preprint arXiv:2604.17803},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.17803}
}

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​