WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Source: arXiv:2606.03220 · Published 2026-06-02 · By Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang et al.

TL;DR

WebRISE addresses a major evaluation gap in benchmarks for multimodal large language model (MLLM)-generated interactive web artifacts. Existing protocols focus on local visual or script-based signals without explicitly modeling requirement-induced states and transitions that determine whether a generated webpage actually works as intended. To fix this, WebRISE introduces requirement-driven Interaction Contract Graphs (ICGs) that represent stable UI states, user-intent transitions, and DOM/visual assertions for verifying behavior. It then evaluates 442 tasks spanning five input modalities (Text, Markdown, Sketch, Image, Video) against these contracts using adaptive browser agents, reporting detailed state, transition, and requirement-level conformance metrics.

Evaluated across 14 prominent MLLMs, WebRISE reveals that even the strongest models reach only about 66% transition validity and requirement coverage, leaving a large gap to fully functional interactive web generation. Moreover, visual quality metrics are poorly correlated with behavioral correctness, confirming that appearance is an unreliable proxy for usability. Video-based input improves interactive compliance significantly (+10.6 percentage points on implicit constraints) compared to Text. Defect injection experiments show that ICG-based evaluation finds state errors at a 2x to 16x higher rate than prior checkpoint-style methods, demonstrating superior diagnostic sensitivity.

Together, WebRISE reframes MLLM web artifact evaluation as requirement-induced state-transition conformance, enabling more rigorous and implementation-agnostic benchmarking. Its large-scale multimodal task suite and contract-guided, oracle-verified testing reveal key bottlenecks around implicit state constraints, inform modality strengths, and set a new comprehensive standard for assessing generated interactive webpages.

Key findings

WebRISE benchmark spans 442 tasks and 5 input modalities, with 5,495 transitions and 5,271 requirement checks.
Across 14 MLLMs, the strongest model GPT-5.5 achieves only 65.6% transition validity (T) and 66.3% requirement coverage (R) under best modality (Video).
Visual quality scores can be misleading: Qwen3.6-35B-A3B on Markdown achieves V=80.8 but T=15.5 and R=19.2, showing good visuals but poor interaction correctness.
Video modality improves implicit requirement coverage by +10.6 percentage points over Text input.
Defect injection shows ICG-based evaluation detects 16/25 injected defects, compared to 8/25 for checkpoint-style broad evaluation and only 1/25 for strict checkpoint evaluation.
Open-weight models like Kimi-K2.6 achieve competitive overall scores (63.3), surpassing some proprietary models.
Safety and robustness checks reveal uniformly low pass rates across all models; even GPT-5.5 passes only 41.3% of HTML safety checks.
Model scaling within the Qwen3.5 family shows improvement only beyond 122B parameters, especially for Text and Markdown modalities.

Threat model

The adversary is a multimodal large language model generating interactive, frontend-only HTML/CSS/JavaScript web artifacts from rich multimodal input specifications. The adversary’s goal is to produce functional webpages that satisfy explicit user requirements and implicit consistency constraints under browser execution. They cannot modify the evaluation environment or backend services and are evaluated solely on the executability and conformance of generated client-side code and interaction behavior.

Methodology — deep read

Threat model & assumptions: The adversary is a multimodal LLM tasked with generating web artifacts from rich input specifications (text, markdown, sketches, images, videos). The evaluation framework assumes no external backend or manual state initialization and tests whether the generated frontend HTML/CSS/JS meets explicit and implicit user requirements under browser execution. The attacker does not manipulate the evaluation environment.
Data: WebRISE compiles 442 real-world domain-grounded user-facing web tasks from anonymous industry practitioners, covering 8 domains and 35 scenarios like e-commerce, utilities, social media, and productivity. Each task's requirements include explicit user-stated functions (like search or filtering) plus implicit state constraints (pagination reset, synchronization). Tasks are expressed across 5 modalities: Text, Markdown, Sketch, Image, and Video. Ground-truth HTML pages verify task executability but serve only for validation, not as exact reference outputs.
Architecture / algorithm: Each task is converted into an Interaction Contract Graph (ICG), representing observable UI states (S), user-intent transitions (T), DOM/visual assertions (Φ), and a mapping (M) from test items back to explicit and implicit requirements. States model stable UI configurations, excluding transient effects which are attached as transition-level predicates. Transitions encode user actions and expected outcomes, verified with dual oracles combining DOM process logs and visual screenshot comparisons. This contract-driven approach decouples evaluation from implementation details like fixed DOM selectors.
Training regime: N/A (evaluation-only benchmark). Models generate self-contained HTML artifacts from each modality specification without auxiliary backend code.
Evaluation protocol: WebRISE uses an adaptive browser agent that, guided by the ICG, executes transitions sequentially. At each transition, the agent checks preconditions, performs interaction goals, and collects DOM event logs and pre/post screenshots. DOM assertions verify intermediate and final conditions, while visual oracles verify user-visible changes. Transitions are marked PASS/FAIL/BLOCKED/SKIPPED based on fulfillment. Metrics are aggregated at state, transition, and requirement levels, separately measuring explicit (Re%), implicit (Ri%), and overall requirements (R%). The final benchmark averages scores macro across tasks. Defect injection with known bugs validates evaluator sensitivity.
Reproducibility: The paper references a public project page (https://iigroup.github.io/WebRISE) but does not explicitly state code release or frozen weights. The large-scale human and expert validation of contracts and test items support reliability, but some proprietary models and datasets may be closed.

Concrete example (shopping-cart interaction): The benchmark defines an ICG encoding states before and after user unchecking an item, with transitions modeling click and state propagation. The agent executes the transition, collects DOM and visual evidence, and checks assertions verifying that the checkout button disables and totals reset. A failure occurs if state updates do not propagate though UI elements reflect the click. WebRISE reports detailed diagnoses localizing the failure to specific DOM and visual discrepancies.

Technical innovations

Formulating interactive web artifact evaluation as requirement-induced observable state-transition conformance using Interaction Contract Graphs (ICGs).
Adaptive, contract-guided browser agent execution that replays verified trajectories to reach source states and localizes unreachable states from contract violations.
Dual-channel oracle combining transient DOM event logs and visual screenshot comparisons to verify process and user-visible outcomes for each state transition.
Separating explicit user-stated functions from implicit product-level state and interaction constraints in the evaluation metric to dissect failure modalities.

Datasets

WebRISE benchmark — 442 tasks with 5,495 transitions and 5,271 requirement checks — sourced from anonymous industry practitioners and expert normalization.

Baselines vs proposed

Checkpoint-style WebGen-broad evaluator: detected 8/25 injected defects vs WebRISE (ICG): 16/25 defects detected.
Checkpoint-style WebGen-strict evaluator: 1/25 injected defects detected vs WebRISE (ICG): 16/25 detected.
Qwen3.6-35B-A3B Markdown modality: Visual score V=80.8 vs Transition validity T=15.5, showing visual quality mismatch.
GPT-5.5 best modality Video: Transition validity T=65.6 vs other modalities Text (T lower by ~8.8 percentage points).
Open-weight model Kimi-K2.6 Overall score 63.3 surpasses several proprietary models.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03220.

Fig 1

Fig 1: Overview of WebRISE. Top:

Fig 2

Fig 2: summarizes the benchmark pipeline.

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Evaluation limited to frontend-only HTML/CSS/JavaScript without backend or external service generation.
No exhaustive adversarial evaluation against maliciously corrupted HTML beyond defect injection.
Potential incomplete coverage of implicit state constraints - some injected defects remain undetected.
Proprietary models evaluated without full transparency or weights released, limiting full reproducibility.
Visual oracle relies on screenshot comparison, which may not capture all nuanced interactive errors.
Evaluation budget per transition potentially limits agent exploration depth, affecting rare-edge tests.

Open questions / follow-ons

How to extend WebRISE to incorporate backend state or full-stack web artifact generation including server logic?
Can exploration-based or reinforcement learning methods improve contract-guided agent effectiveness to detect deeper state violations?
How to better model and verify more complex implicit state constraints that remain challenging for current ICG inference and oracles?
What are the effects of richer multimodal inputs (e.g., combined video+sketch) or interactive feedback loops on generation quality and evaluation coverage?

Why it matters for bot defense

WebRISE’s approach to evaluating generated interactive web artifacts through requirement-induced state transitions and implicit consistency constraints is highly relevant to bot-defense and CAPTCHA practitioners focused on detecting or simulating realistic UI interactions. By modeling explicit stateful contracts rather than relying on visual snapshots or superficial script outcomes, WebRISE enables finer-grained analysis of whether an interactive web component behaves correctly under complex user workflows. This improves detection of subtle functional or synchronization errors that bots might exploit or fail to replicate, informing more robust challenge designs. Additionally, the multimodal and contract-informed adaptive agent techniques could inspire novel automated bot interaction detectors or CAPTCHA solvers that reason about state transitions and implicit site logic beyond static elements. However, WebRISE currently covers generation evaluation rather than adversarial attack scenarios or active defense, so direct integration requires adaptation. Still, its diagnostic framework for attributing failures to explicit vs implicit constraints provides a useful conceptual lens for understanding complex UI attack surfaces against bots.

Cite

bibtex

@article{arxiv2606_03220,
  title={ WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts },
  author={ Yuxin Meng and Yuhan Suo and Junjie Wang and Yuhan Sun and Yiyao Yu and Ruixu Zhang and Ruining Hu and Yubin Wang and Shouwei Ruan and Bin Wang and Yuxiang Zhang and Yujiu Yang },
  journal={arXiv preprint arXiv:2606.03220},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03220}
}

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​