WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

Source: arXiv:2606.01869 · Published 2026-06-01 · By Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang, Yongcan Yu, Yubin Wang et al.

TL;DR

WorldCoder-Bench addresses the emerging problem of evaluating large language models (LLMs) tasked with generating fully executable, physically grounded 3D worlds in browser-native Three.js environments. Unlike prior benchmarks which assess only rendered pixels or DOM elements, this work introduces a benchmark with 2,026 expert-curated tasks that require correctness in physical simulation, asset integration, and interactive state synchronization. Alongside the benchmark, they propose StateProbe, a novel evaluation protocol that runs generated programs in a sandboxed browser, probes hidden runtime state variables, and verifies execution against mutation-hardened, expert-authored behavioral contracts.

The results reveal major capability gaps: the best model only achieves 27.8% verification coverage on core tasks and 19.9% on more challenging perturbed variants. Failures predominantly arise from state-schema drift and broken interaction chains rather than missing scene elements, showing that visual plausibility is a poor proxy for functional correctness. Complementary utility metrics demonstrate that despite low formal correctness, faster or cheaper models can still provide value on simpler domains. WorldCoder-Bench represents a significant step toward benchmarking not just visually plausible, but behaviorally correct 3D world generation by autonomous agents.

Key findings

The strongest model (GPT-5.4) reaches only 27.8% Verification Coverage (V-Cov) on WORLDCODER-CORE and 19.9% on WORLDCODER-ROBUST (Table 1, 3).
Affordance Coverage (A-Cov) ranges 40.6–67.4%, State Coverage (S-Cov) 30.7–46.1%, but Transition Coverage (T-Cov) is much lower at 15.8–26.3%, indicating failures mostly occur at runtime state transitions.
DOM-based evaluation correlates poorly with ground truth V-Cov (Kendall τb=−0.021) and has a false positive pass rate of 33.7% on defective outputs (Table 2).
An 8-step pure agent inspection approach has a positive but weak correlation (τb=+0.111) and misclassifies 45.6% of low-quality outputs as passing.
Mutation-hardened behavioral contracts successfully reject common program defects such as stale state updates and broken event handlers, ensuring evaluation rigor (Section 3.2).
Robust evaluation variants with randomized physical constants expose template memorization; Qwen3.6-Plus drops from 25.3% to 13.5% V-Cov under perturbations (Table 3).
Error taxonomy shows 83.6% of zero Transition Coverage failures arise from State Schema Drift (42.8%) and Broken Interaction Chain (40.8%) (Figure 7).
Return on Automation (RoA) and Time Efficiency Multiplier (TEM) reveal that despite low V-Cov, some models (e.g. DeepSeek) offer high cost-adjusted utility by trading correctness for faster inference.

Threat model

The adversary is an autonomous large language model generating executable Three.js 3D worlds from natural language prompts, potentially producing incorrect or incomplete programs that appear visually plausible but fail to meet physical or interaction correctness. The adversary cannot access hidden evaluation contracts, action scripts, or runtime thresholds, and cannot interfere with sandboxed execution or API rate limiting. The threat is inadvertent or intentional generation of non-functional or defective code rather than adversarial attempts to break the evaluation system itself.

Methodology — deep read

Threat Model and Assumptions: The benchmark targets autonomous code generation systems (LLMs) that generate entire executable 3D worlds from natural language prompts without access to internal evaluation contracts or hidden action sequences. The assumption is that adversarial models might produce superficially plausible worlds but fail in underlying correctness. The evaluator StateProbe assumes no arbitrary model interference during sandboxed runtime probes but expects the generated HTML to expose a standardized state interface.
Data: WORLDCODER-BENCH comprises 2,026 expert-curated tasks split into Simulation, Rendering, and Application macro-categories covering 15 fine-grained domains (e.g., soft_body physics, shaders, games). Tasks come as prompt.json with optional GLB asset folders. The dataset construction pipeline involves expert seed curation (5 PhD researchers), LLM-assisted expansion (10k candidates), human filtering for clarity and evaluability, runtime verification in headless browsers, and anti-contamination via randomized variants with perturbed physics constants and asset filenames. The final splits include WORLDCODER-CORE (205 tasks), WORLDCODER-EXTENDED (1,621 tasks), WORLDCODER-ROBUST (randomized 615 variants), and WORLDCODER-DEV (200 public tasks with contracts).
Architecture / Algorithm: The nine evaluated models include proprietary LLMs like GPT-5.4, Gemini 3.1 Pro, Claude Opus/Sonnet, and open-weight models DeepSeek, Qwen, Kimi, MiniMax. These models receive only the prompt and task schema and output a single self-contained HTML page embedding Three.js code. No direct details on models’ architectures are provided—evaluation focuses on output behavioral correctness.
Training Regime: Not applicable for the benchmark construction or evaluation. Models were tested zero-shot on the benchmark tasks.
Evaluation Protocol: The core innovation is the StateProbe evaluation. Each generated HTML is loaded in a headless Chromium browser with a locked version of Three.js within a sandbox. The program must provide a global runtime interface exposing internal state variables (e.g., object positions, velocities, UI states). StateProbe runs scripted action sequences (simulating user interactions, physics steps), recording before and after snapshots of runtime state, and checks these against hidden behavioral contracts encoded as assertions on physical constraints, asset integration, and interaction logic. Behavioral contracts were hardened using mutation testing, where intentional faults (stale updates, swapped event targets, manipulated physics parameters) are injected to ensure contracts only pass fully correct runs.

Metrics include Affordance Coverage (presence of expected objects), State Coverage (intermediate reachable states), Transition Coverage (correct state changes on actions), and overall Verification Coverage (proportion of passed assertions). Practical utility metrics combine correctness with developer labor cost and runtime to produce Return on Automation and Time Efficiency Multiplier.

Reproducibility: WORLDCODER-BENCH dataset (prompt, assets), StateProbe evaluation code, and contracts are publicly released for benchmarking and evaluator integration at the provided URL. Proprietary LLMs are not open source. Hardware used for evaluation is headless Chromium via Playwright. Exact random seeds for task variants are controlled but not exhaustively documented.

End-to-end example: Given an input task with a prompt specifying creation of N bouncing balls under gravity with elasticity E, an LLM generates a self-contained HTML that uses Three.js to render balls with physics. StateProbe loads this HTML, verifies the window.3D_STATE exposure, applies scripted actions (e.g., advancing simulation steps), snapshots states, and checks if positions, velocities, collisions, and energy conservation obey the behavioral contract thresholds. Failure to update the state or broken interaction handlers lead to CHECK_FAIL; correct runs get CHECK_PASS and count toward verification coverage.

This thorough methodology ensures evaluation goes beyond pixel-level assessment to rigorous, actionable behavioral correctness in physically grounded 3D synthesis.

Technical innovations

Introduction of WORLDCODER-BENCH, the first large-scale expert-curated benchmark specifically targeting executable, physically grounded 3D world synthesis in Three.js with 2,026 tasks across simulation, rendering, and application domains.
Development of STATEPROBE, an execution-based evaluation protocol that probes runtime internal state variables through a standardized interface and validates task-specific hidden behavioral contracts, replacing unreliable pixel or DOM-based judgments.
Mutation-hardening of behavioral contracts via injection of deliberate program faults to ensure evaluators strictly reject permissive or incorrect generated worlds, enhancing metric rigor.
Definition of new practical utility metrics—Return on Automation and Time Efficiency Multiplier—that combine model correctness, inference cost, and latency with human developer labor rates to measure real-world deployment value.

Datasets

WORLDCODER-BENCH — 2,026 tasks — expert-curated from internal seed creation and LLM-assisted expansions
WORLDCODER-CORE — 205 hidden canonical tasks — primary leaderboard set
WORLDCODER-EXTENDED — 1,621 hidden tasks — large-scale evaluation
WORLDCODER-ROBUST — 615 randomized perturbation variants — stress test for mechanism-level generalization
WORLDCODER-DEV — 200 public tasks with contracts and reference outputs — for debugging and integration

Baselines vs proposed

DOM-only evaluation: Kendall τb = -0.021 correlation with V-Cov, False positive pass rate (FPR) = 33.7% vs STATEPROBE V-Cov as ground truth
Screenshot + Vision-Language Model judge (GPT-5.4): τb = -0.068 correlation, FPR = 3.2%, cost ~50x DOM-only
Pure agent 8-step exploration: τb = +0.111 correlation, FPR = 45.6%, cost ~400x DOM-only
GPT-5.4 model: V-Cov = 27.8%, second-best Gemini 3.1 Pro: V-Cov = 26.5% on WORLDCODER-CORE
Qwen3.6-Plus: V-Cov drops from 25.3% to 13.5% on WORLDCODER-ROBUST showing decline under perturbation
DeepSeek-V4-Flash renders highest Return on Automation ($19,150/$) despite lower verification coverage
GPT-5.4 achieves highest Time Efficiency Multiplier (68.9 hr/hr) balancing correctness and latency

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.01869.

Fig 1

Fig 1: Representative 3D worlds generated in WORLDCODER-BENCH, spanning three macro-

Fig 2

Fig 2: Motivation for WORLDCODER-BENCH. Existing benchmarks under-test physical correct-

Fig 3

Fig 3: Data curation of WORLDCODER-BENCH, from expert seed creation and LLM-assisted

Fig 4

Fig 4: Distribution and composition of

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Verification coverage remains low (below 30%) even for best models, indicating large capability gaps in generating fully correct executable worlds.
The benchmark focuses on Three.js and WebGL-based 3D worlds; insights may not generalize directly to other 3D platforms or native applications.
Evaluated models receive no fine-tuning on WORLDCODER-BENCH tasks (zero-shot), so performance might improve with domain-specific adaptation but remains untested.
Proprietary models outperformed most open-weight models but the gap was modest on core tasks, limiting conclusions about model architecture impacts.
Error analyses highlight dominant failure modes but deeper causal analysis of architectural shortcomings in models is absent.
The complexity of behavioral contracts means some subtle bugs or emergent behaviors may evade detection or be underrepresented.

Open questions / follow-ons

How to improve large language models’ capability to maintain exact and synchronized runtime state interfaces and interaction chains over long temporal sequences in 3D world programs?
What architectural or training modifications best enhance a model’s physical reasoning and algorithmic correctness to reduce state-schema drift and broken interaction chains?
Can StateProbe be extended or adapted to evaluate other 3D or interactive programming domains beyond browser-native Three.js (e.g., VR/AR apps, game engines)?
How does fine-tuning or few-shot adaptation on WORLDCODER-BENCH impact model robustness and verification coverage, especially on perturbed and randomized variants?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, WorldCoder-Bench presents critical insights into evaluating and detecting sophisticated autonomous agents capable of generating complex 3D interactive content—not just static images or webpages. The StateProbe approach demonstrates that superficial visual or DOM inspection is insufficient to verify functional correctness, with invisible runtime state driving success or failure. This lesson applies directly to understanding how adversaries might use advanced LLMs to evade detection by producing superficially plausible but behaviorally incorrect interactive bots or scripts.

Practitioners designing CAPTCHA challenges or bot defenses can leverage insights from mutation-hardened contracts to create robust checks that probe hidden state transitions or internal consistency, going beyond pixel or element-level signals. The evaluation methodology highlights the importance of examining internal model runtime states and user interaction synchronization to detect automation that superficially mimics human actions but fails deeper functional tests. Overall, WorldCoder-Bench and StateProbe provide a valuable framework for moving toward security mechanisms that verify not only appearance but true behavior under interaction, a capability increasingly necessary as LLMs grow more advanced.

Cite

bibtex

@article{arxiv2606_01869,
  title={ WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis },
  author={ Shuo Lu and Yinuo Xu and Kecheng Yu and Siru Jiang and Yongcan Yu and Yubin Wang and Haitao Yang and Yuxiang Zhang and Bin Wang and Ran He and Jian Liang },
  journal={arXiv preprint arXiv:2606.01869},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.01869}
}

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​