Benchmark Everything Everywhere All at Once

Source: arXiv:2606.06462 · Published 2026-06-04 · By Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han et al.

TL;DR

This paper addresses the challenges in constructing benchmarks for evaluating large language models (LLMs) and multimodal large language models (MLLMs), focusing on the problems of labor-intensive manual design, lack of sustainability, and rapid performance saturation on existing benchmarks. The authors propose Benchmark Agent, a fully autonomous agentic system that automates the entire benchmark construction pipeline—from user query parsing and subtask formulation to data grounding, allocation, and quality-controlled benchmark item generation. By integrating a dual-component architecture, the Benchmark Planner formulates feasible benchmark specifications from user requirements, while the Benchmark Executor realizes these into concrete, evaluation-ready benchmark samples through multi-tool orchestration and stringent verification. Extensive evaluation including human annotation, LLM-based judges, and consistency tests show that Benchmark Agent generates high-quality, customizable, and discriminative benchmarks spanning text, audio, image, and multimodal domains with minimal human effort.

The experimental results demonstrate that the agent-produced benchmarks achieve high human acceptance rates (~96-98%) and strong alignment with user intent (UIA scores 68.5–81.5), outperform benchmarks generated by direct prompting of LLMs alone, and effectively reveal model performance differences across scales and families. Ablation studies validate the critical roles of key components such as design agent and transformability scoring. The method also drastically reduces construction cost, cutting annotation times from minutes per sample by humans to under half a minute per sample by the agent. The continual and customizable nature of Benchmark Agent promises a sustainable approach to evolving benchmarks that better probe emerging model capabilities, particularly in underexplored domain-specific reasoning tasks.

Key findings

Benchmark Agent achieves human acceptance rates between 96.1% and 98.5% across 5 representative benchmarks.
LLM-as-Judge user-intention alignment (UIA) scores range from 68.54 to 81.48, indicating strong reflection of user evaluation goals.
Qwen3.5 model series shows consistent performance scaling on Agent-generated benchmarks, e.g. from 71.06 to 87.23 accuracy on Multi-Perspective benchmark when scaling from 2B to 27B parameters.
Direct LLM prompting without the agentic framework yields much lower overall quality scores, especially on target-signal dependency and skill-specific challenge metrics.
Ablations removing the Design Agent or Transformability Checking components reduce overall benchmark quality by 5-15 points in LLM-as-Judge evaluation.
Agent reduces annotation time from approximately 5-6 minutes per sample for humans down to 0.2-0.3 minutes per sample.
Generated benchmarks cover diverse modalities and domains, including text-only, audio-text, image-text, and tri-modal (audio-text-image) scenarios.
Qualitative error analysis confirms that failure cases predominantly arise from model limitations rather than benchmark annotation errors.

Threat model

The paper’s threat model focuses on constructing reliable and discriminative benchmarks to evaluate model capabilities accurately rather than defending against adversarial attacks or manipulations. The adversary is implicitly the evaluated model, which may try to game benchmarks or reach performance saturation. However, Benchmark Agent’s threat model assumes no external attackers can corrupt the benchmark construction process or underlying dataset pool. It does not address adversarial robustness or manipulation of benchmark samples.

Methodology — deep read

The paper’s threat model assumes the Benchmark Agent designs evaluation benchmarks for validated model assessment without adversarial attacks on the benchmarks themselves; the focus is on reliable and adaptive benchmark construction rather than security against model evasion.

The core data comprises a large dataset pool (General-Bench) comprising varied datasets across modalities and domains. The system searches this pool to ground user-specified evaluation tasks in real data. Benchmark items are constructed on demand via composable transformation plans.

Architecturally, Benchmark Agent consists of two tightly coupled modules: the Benchmark Planner and the Benchmark Executor.

The Planner decomposes a user query into subtasks (Design Agent), validates grounding feasibility via dataset search and transformation plan scoring (Grounding Agent), and allocates quotas under resource constraints (Allocation Agent). It operates in a closed loop to refine task sets and plans until feasible.
The Executor realizes each subtask-dataset-transformation triple by orchestrating a sequence of fine-grained sample-level actions, leveraging LLM-based planning and multiple non-LLM tools (e.g., OCR, ASR, image resizing, text-to-speech). It continuously verifies quality and enforces quota fulfillment, with error recycling.

Training is less emphasized since the system relies on large off-the-shelf LLMs (GPT-5.1 primarily) and predefined procedural tools rather than a trainable model. Instead, performance depends on carefully designed prompts, scoring heuristics, and multi-agent collaboration.

Evaluation protocols include human expert assessment on correctness, clarity, and relevance; an LLM-as-Judge framework scoring various quality dimensions (format correctness, coherence, grounding, dependency on target skill, challenge level); and consistency checks measuring coherent model ranking across variants. Results are reported on 15 benchmarks spanning text, audio, images, and multimodal tasks, with ablation studies isolating key components.

The paper reports human acceptance and LLM judge scores as percentages, model performance on benchmarks measured by accuracy on Qwen3.5 models, and cost comparisons measuring minutes per sample annotation. Benchmark Agent’s code and demo are planned for public release. Some dataset details, especially of the used dataset pool, are proprietary or less clearly enumerated.

One concrete example: given a user query to evaluate omni-modal understanding, Benchmark Planner decomposes it into subtasks like multimodal counterfactual reasoning, forms dataset-transformation plans grounded in existing multimodal datasets, assigns quotas, and then Benchmark Executor invokes tools to generate concrete, verified QA pairs with images/audio/text. This cycle iteratively optimizes coverage and quality before finalizing the benchmark.

Overall, the methodology blends agentic LLM prompting, multi-agent collaboration, automated data grounding, and multi-tool orchestration with rigorous verification to achieve autonomous, adaptive benchmark construction.

Technical innovations

A dual-component agentic architecture (Benchmark Planner and Benchmark Executor) adapted from brain-cerebellum hierarchical principles for scalable autonomous benchmark construction.
Multi-agent collaborative design involving Design, Grounding, and Allocation agents to iteratively convert abstract user requirements into executable benchmark specifications with feasible data grounding and allocated quotas.
A novel transformability validation process combining LLM-based and non-LLM tools to assess candidate dataset transformations for alignment, robustness, and evaluation signal preservation.
Interleaved sample-level planning and execution with continuous feedback loops to finely control benchmark sample generation and quality verification.
Integration of multi-modal tool pools (e.g., OCR, speech synthesis, image transformation) under agent orchestration to flexibly generate complex benchmark samples spanning modalities.

Datasets

General-Bench — unspecified size — proprietary mixed-domain and mixed-modal dataset pool used for dataset search and grounding

Baselines vs proposed

Directly prompting LLMs for benchmark creation: overall LLM-as-Judge score = 41.86 vs Benchmark Agent: 72.55 (Multi-Perspective, Table 2)
Removing Design Agent: overall score drops from 72.55 to 72.34 (negligible) vs removing Transformability Checking + Plan Scoring (TC+Scoring): drops sharply to 64.59 (Multi-Perspective, Table 4)
Qwen3.5-2B on Multi-Perspective benchmark: 71.06 vs Qwen3.5-27B: 87.23 (Table 1 shows model scaling signal preserved)
Benchmark Agent annotation cost: ~0.2-0.3 min/sample vs human annotation: 5-6 min/sample (Table 5)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.06462.

Fig 1

Fig 1: Our Benchmark Agent, as the first fully autonomous benchmark building system, can

Fig 2

Fig 2: Benchmark performance saturation on Qwen.

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

The approach depends heavily on the coverage and quality of the underlying dataset pool (General-Bench), which may limit domain breadth or novelty of benchmarks.
Evaluation of benchmarks relies on LLM-as-Judge and human annotations but may lack rigorous adversarial robustness or stress testing for evaluation gaming.
Quality dimensions such as target-signal dependency and skill-specific challenge show larger variability and remain challenging to control precisely.
The current system primarily handles evaluation construction in controlled settings; real-world deployment and continuous adaptation details are limited.
Some tool-chain executions and transformations involve handcrafted or heuristic steps that may be brittle or hard to generalize extensively.
Benchmark Agent’s dependence on a large backbone LLM (GPT-5.1) may raise reproducibility and accessibility concerns across wider research groups.

Open questions / follow-ons

How well do the dynamically generated benchmarks generalize to adversarial or out-of-distribution model evaluations?
Can the agentic benchmark construction framework incorporate real-time user feedback to continuously refine evaluation criteria with minimal human oversight?
How to better quantify and improve target-signal dependency and skill-specific challenge metrics to ensure deeper capability probing?
What are the long-term effects of continually evolving benchmarks on preventing model overfitting and benchmark saturation?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, Benchmark Agent’s framework offers a promising direction for constructing continually refreshed, customizable evaluation benchmarks that can help rigorously assess state-of-the-art language and multimodal models underpinning bot or CAPTCHA circumvention. The fully automated pipeline reduces reliance on costly manual annotation and enables rapid adaptation to emerging model capabilities or new modalities (e.g., audio-visual inputs). This can inform the design of CAPTCHA challenges resistant to prevailing attack model strengths by identifying domain-specific weaknesses or modalities where models underperform.

Moreover, the agent’s modular and iterative design approach allows tailoring benchmarks toward particular application needs or adversarial threat models relevant to bot defense, offering a pathway to more robust, fine-grained evaluation that evolves with attackers’ sophistication. Nonetheless, practitioners should remain cautious about the current limitations related to dataset coverage, challenge difficulty control, and adversarial robustness, which are crucial for secure CAPTCHA evaluation and defense strategy development.

Cite

bibtex

@article{arxiv2606_06462,
  title={ Benchmark Everything Everywhere All at Once },
  author={ Shiyun Xiong and Dongming Wu and Peiwen Sun and Yuang Ai and Bokang Yang and Wencheng Han and Xiao-Hui Li and Xiangyu Yue },
  journal={arXiv preprint arXiv:2606.06462},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.06462}
}

Benchmark Everything Everywhere All at Once ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​