Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Source: arXiv:2605.29927 · Published 2026-05-28 · By Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

TL;DR

This paper addresses the challenge that large language model (LLM)-based web agents still face in planning complex, multi-step tasks on the web. Although prior work has focused on improving planning algorithms, the authors identify a less explored dimension: how the representation or format of the plan, expressed in natural language, affects downstream performance. To systematically study this, they propose PLANAHEAD, a static planner-executor framework that separates plan generation from execution, and evaluate four plan representations—sequential subgoals, narrative, pseudocode, and checklist—on a set of 158 hard tasks picked from the WebArena benchmark. They also introduce two novel metrics, Achievement Rate (AR) and Solved-Task Consistency (STC), to better quantify task achievability and execution robustness under LLM inherent stochasticity. Experiments across multiple LLM families (OpenAI GPT-4.1-mini, Qwen, Gemini) reveal that both the choice of plan representation and the planner-executor LLM combination substantially influence agent success and stability, with different LLMs exhibiting distinct preferences and strengths. Notably, static planning outperforms dynamic single-agent planning by avoiding failure modes like action loops. Overall, the work demonstrates that plan format is a critical, underexplored factor in web agent design, with implications for modular agent architectures and evaluation methodology.

Key findings

Using static plans with a planner-executor separation outperforms dynamic single-agent plans on the 158 Hard WebArena tasks, avoiding failures such as action loops and premature subtask marking.
GPT-4.1-mini achieves highest performance using narrative plan representations, while Qwen-2.5-VL favors checklist plans; Gemini performs well across multiple formats, notably exhibiting highest Achievement Rate (AR) with pseudocode plans when used as both planner and executor.
The best overall performance (AR=10.7%, STC=47%) is achieved by GPT-4.1-mini as planner combined with Gemini 2.5 Flash as executor using checklist plans, showing complementary model strengths.
Task difficulty classification (Easy, Medium, Hard) was automated using GenericAgent success rates, with 158 of 381 WebArena tasks labeled Hard (0% success by all models over 5 runs), enabling focused evaluation on challenging tasks without human annotation.
Achievement Rate (AR) and Solved-Task Consistency (STC) metrics better capture task achievability and consistency over multiple stochastic runs compared to traditional Success Rate alone; bootstrap confidence intervals confirm observed AR differences are statistically significant.
Plan representation affects performance beyond the underlying LLM: for example, pseudocode (a non-standard NL form not designed for readability) yielded among the highest AR for Gemini, challenging assumptions that sequential subgoals are always best.
Mixed planner-executor pairs generally outperform homogeneous model setups, underscoring benefits of modular agent design to leverage different model capabilities for planning versus execution.
Static planner-executor approach requires only a single plan generation at the start, simplifying agent design while improving robustness compared to dynamic replanning methods.

Methodology — deep read

The authors introduce PLANAHEAD, a framework separating planning from execution: the planner LLM observes the task goal and initial web page screenshot, then generates a static plan once per task; the executor LLM receives this plan along with dynamic browser state at each step and issues low-level actions.

For evaluation, they use WebArena, a closed, reproducible web-agent benchmark with 381 tasks over domains like shopping and GitLab. To focus on challenging settings, they develop an automated difficulty grading pipeline using BrowserGym's GenericAgent with 5 LLM backbones. Tasks are labeled Easy (all models succeed 100% runs), Hard (all fail 0%), or Medium otherwise, producing 158 hard tasks.

Four distinct plan representations are used:

Sequential subgoals — ordered explicit list of high-level subtasks.
Requirement checklist — unordered list of necessary requirements, leaving execution order flexible.
Pseudocode — plans with program-like constructs (loops, conditionals) expressing contingencies.
Narrative — verbose natural language description of the task.

Planner LLMs generate plans at temperature 0.6 to encourage plan diversity; executor LLMs act deterministically (temp 0) to isolate plan influence. Executors have up to 30 steps per task.

Three multimodal LLMs—GPT-4.1-mini, Qwen-2.5-VL-72B, Gemini 2.5 Flash—are tested as planners and executors in all configurations, including homogeneous and mixed pairs.

Evaluation uses multiple stochastic runs (N=5) per task and calculates two novel metrics:

Achievement Rate (AR): percentage of tasks where at least one run succeeds, capturing achievability beyond average success rate.
Solved-Task Consistency (STC): average success rate over runs for tasks solved at least once, measuring robustness.

A dynamic single-agent planning baseline is included, where the agent generates and updates plans each step.

Statistical confidence intervals are applied via bootstrap resampling over the hard task set to verify significance of observed metric differences.

Plan prompting templates are designed per representation type and illustrated with real task examples.

Three ablation analyses explore sensitivity to number of stochastic runs, combinations of planner-executor models, and plan representations.

Qualitative failure analysis shows dynamic plans induce issues such as infinite action loops and premature step marking that static plans largely avoid.

The entire study hinges critically on the automated task difficulty partitioning, static planning framework structure, and multi-run performance metrics to robustly assess impact of plan format across multiple LLM backbones simulating real-world web agent scenarios.

Code and datasets are released for reproducibility.

Technical innovations

PLANAHEAD: a static planner-executor framework isolating the effect of plan representation on web agent performance.
Automated task difficulty grading pipeline classifying WebArena tasks into Easy, Medium, and Hard via multi-model success evaluation across runs, eliminating need for human annotations.
Introduction of two novel evaluation metrics: Achievement Rate (AR) for task achievability across stochastic runs, and Solved-Task Consistency (STC) measuring intra-task robustness.
Systematic empirical evaluation of four distinct natural language plan representations—sequential subgoals, narrative, pseudocode, and checklist—highlighting the critical role of plan form in task success.
Demonstration that mixing different LLM backbones for planner and executor modules yields superior performance by leveraging complementary strengths.

Datasets

WebArena — 381 tasks — closed benchmark for reproducible web agent evaluation, covering consumer-facing tasks across multiple domains

Baselines vs proposed

Dynamic single-agent planner-executor baseline: AR ~2-4%, STC 30-40% across various models vs static planner-executor (best AR=10.7%, STC=47%)
GPT-4.1-mini planner + Gemini executor with checklist plans: AR=10.7%, STC=47%; outperforms homogeneous setups (e.g., GPT-4.1-mini only AR=5.1%, STC=42%)
GPT-4.1-mini performs best with narrative plans: AR=5.1-5.4%, STC ~42-75% depending on executor
Qwen-2.5-VL planner and executor perform best with checklist plans: AR=5.1%, STC=75%, outperforming narrative and pseudocode
Gemini planner-executor pairs achieve AR up to ~7.6% and STC ~73% with narrative and pseudocode plans

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.29927.

Fig 1

Fig 1: Overview of PLANAHEAD, in comparison to single-agent and proactive planning frameworks

Fig 2

Fig 2: The four representations used in our evaluation of PLANAHEAD. On the left, we show the standard plan

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 3).

Fig 4

Fig 4: Examples of single-agent planner-executor behavior across different WebArena tasks. (a-c) show task

Fig 5

Fig 5: Prompt for sequential plans.

Limitations

Evaluation restricted to three LLM backbones and one benchmark (WebArena), limiting generalizability across models and real-world environments.
Task set and difficulty labels depend on behavior of GenericAgent with selected LLMs; intrinsic subjectivity or bias in difficulty categorization is possible.
Only 5 stochastic runs per task are used; higher numbers might uncover further performance variance.
Static planning only is studied; dynamic or adaptive replanning strategies could yield different results under other scenarios.
Evaluation metrics AR and STC focus on success rates and consistency but do not incorporate qualitative aspects of plan interpretability or executor reasoning traces.
Closed environment WebArena may not capture all complexities of live web interfaces or user behavior variability.
Pseudocode plans, while effective for some models, are not designed for human readability which may limit interpretability and debugging.

Open questions / follow-ons

How do alternative, dynamic or hybrid planning representations impact robustness and adaptability under live web environment changes?
What is the influence of increasing stochastic evaluation runs on statistical confidence and metric stability for planning performance assessments?
Can the modular planner-executor approach be extended to multiple planners or hierarchical planning layers to further improve task generalization?
How generalizable are these results to other benchmarks, languages, or modalities beyond WebArena and the evaluated LLM families?

Why it matters for bot defense

This work is directly relevant to bot-defense and CAPTCHA practitioners interested in designing robust autonomous web agents or detecting automated behaviors. The findings show that the internal plan representation used by an LLM-based agent significantly impacts its ability to complete complex web tasks reliably. Recognizing that some plan formats (e.g., checklist vs narrative) lend themselves better to certain model families can inform the design of more interpretable and robust automation pipelines, or analysis tools to differentiate legitimate human-like plans from brittle scripted behaviors. Furthermore, the novel multi-run evaluation metrics AR and STC provide better lenses for measuring stochastic execution robustness, a crucial characteristic when assessing agent consistency and reliability in adversarial or noisy web environments. Finally, the demonstration that static planner-executor separation reduces action loops and common failure modes advises that bot detection systems or CAPTCHAs might exploit dynamic plan generation weaknesses or atypical execution patterns characteristic of fully single-agent systems. Overall, the paper emphasizes the critical yet often overlooked role of plan representation and evaluation methodology in analyzing and defending against LLM-driven web automation.

Cite

bibtex

@article{arxiv2605_29927,
  title={ Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents },
  author={ Alejandra Zambrano and Sara Vera Marjanovic and Imene Kerboua and Xing Han Lù and Leila Kosseim },
  journal={arXiv preprint arXiv:2605.29927},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.29927}
}

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​