Skip to content

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

Source: arXiv:2509.21072 · Published 2025-09-25 · By Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu

TL;DR

This paper addresses the challenge of enabling intelligent browser-use agents to complete complex, multi-turn, long-horizon tasks on real-world webpages. Existing multimodal large language models (MLLMs) suffer from disordered action sequencing and inefficiency due to excessive trial and error. Recon-Act proposes a novel self-evolving multi-agent framework structured around a Reconnaissance–Action paradigm. The system consists of a Reconnaissance Team that analyzes failure trajectories compared to successful ones, synthesizes corrective strategies, and generates generalized tools (hints or rule-based code). These tools are registered and used in real-time by an Action Team that manages intent decomposition, tool orchestration, and task execution, closing a feedback loop of data, tools, action, and evaluation. At the current implementation Level 3 (partial human-in-the-loop), Recon-Act achieves state-of-the-art performance on the VisualWebArena dataset, improving overall success rate to 36.48%, outperforming prior automated agents by 2.74%. This demonstrates that targeted tool generation grounded in contrastive failure analysis can enhance robustness and efficiency in web navigation tasks.

Key findings

  • Recon-Act achieves 36.48% overall success rate on VisualWebArena, surpassing prior best automated agent (ExAct R-MCTS MAD) at 33.74%, and approaching human performance at 88.7%.
  • On the Shopping domain, Recon-Act attains 39.27%, exceeding the previous best of 32.3% by nearly 7%.
  • Reconnaissance Team synthesizes 11 distinct generalized tools (e.g., AuthorFinder, CategoryGuide) that improve navigation and task completion.
  • Recon-Act requires fewer self-corrective actions than prior agents despite moderate average step counts (approx. 5 steps), indicating improved stability in execution.
  • Human analysts currently involved in trajectory contrastive analysis and tool merging to avoid tool proliferation and ensure robustness at Level 3.
  • Use of contrastive analysis of failure vs success trajectories forms a closed loop enabling iterative tool generation and policy improvement.
  • Reconnaissance-based tool generation reduces random-walk style exploration, enabling efficient data curation tailored to browser environments.
  • Action Team’s hybrid use of hint and decision mode tools allows flexible orchestration with fallback to default Execution Agent actions.

Threat model

The adversary is effectively the complexity and variability of real-world webpages presenting information-dense and heterogeneous environments. The system assumes no direct adversarial users but aims to overcome environmental uncertainty and error-prone action spaces. It does not address intentional attacks or adversarial manipulations against the agent.

Methodology — deep read

The core methodology involves a dual-team multi-agent system: the Reconnaissance Team and the Action Team collaborating in a closed-loop training and inference pipeline.

  1. Threat Model & Assumptions: The adversary is implicitly the complex and dynamic web environment presenting information-dense pages with hidden structures. The system assumes partial observability and unknown page layouts requiring exploratory actions. There is no explicit adversarial human attacker scenario.

  2. Data: The system is evaluated on VisualWebArena, a benchmark with ~910 queries across classifieds, shopping, and reddit domains. Each query requires multi-turn navigation and reasoning over text and images. For training the reconnaissance team, a curated small training set (<10 examples per domain) was manually authored rather than relying on random-walk exploration, emphasizing efficiency and coverage.

  3. Architecture/Algorithm: Reconnaissance Team comprises Analyst and Coder agents. Analyst performs contrastive step-level analysis of failed vs successful trajectories using preconfigured reconnaissance tools (e.g., get_url, image extraction). It identifies failure causes and prescribes remediation strategies. The Coder translates these into executable tool code following a unified input-output schema (functions take broad argument superset, return string). These generalized tools (either hints returning signals or decision-mode yielding direct browser actions) are registered to a tool archive accessible by the Action Team.

The Action Team has a Master agent that interprets queries and browser context to select and orchestrate tools, a Tool Manager (human-assisted) that decides on tool addition/update and manages functional branching, and an Execution Agent which generates fallback default actions.

  1. Training Regime: Training proceeds iteratively by executing cold-start queries to generate failure trajectories, using these paired with success trajectories and context, to generate new tools that augment the toolset. The process repeats until no further tool improvements occur or training plateaus. Human analysts remain in the loop for tool merging and naming to prevent tool proliferation.

  2. Evaluation Protocol: Success is measured on VisualWebArena with exact match, semantic equivalence, and visual goal completion metrics. Recon-Act’s success rates are compared to state-of-the-art baselines (e.g., ExAct, TreeSearch, WebDreamer) using the same dataset splits. Average action steps and stability are additionally evaluated. The paper does not report adversarial robustness or distribution shift analyses beyond unseen websites in VisualWebArena.

  3. Reproducibility: Code is publicly released (https://github.com/inclusionAI/AWorld/tree/main/examples/visualwebarena). Details on fixed prompt designs, tool interfaces, and human involvement in tool management are disclosed. Dataset is VisualWebArena, an established benchmark.

Example end-to-end: On a failed shopping query, the Reconnaissance Team’s Analyst compares failed trajectories to successful ones, identifies failure due to improper price sorting, and prescribes a price sorter tool coded by Coder which is added to the archive. The Action Team Master invokes this new price sorter decision tool during task execution, improving navigation and leading to successful query completion. The evaluator signals success, terminating training for that instance.

Technical innovations

  • Introduction of a dual-team self-evolving multi-agent framework grounded in a Reconnaissance–Action behavioral paradigm for browser agents.
  • Formalization of in-browser reconnaissance operations as exploratory actions to distill salient information and enable targeted tool generation.
  • Contrastive step-level analysis of failure vs successful trajectories to synthesize generalized tools (hints or decision-mode code) dynamically.
  • Closed-loop iterative pipeline integrating data, tools, action, and feedback to continuously improve agent performance on long-horizon web tasks.

Datasets

  • VisualWebArena — ~910 queries — public benchmark for visual and textual browser task evaluation

Baselines vs proposed

  • ExAct R-MCTS MAD: success rate = 33.74% vs Recon-Act (GPT-5-Chat): 36.48%
  • TreeSearch (GPT-4o): 26.40% vs Recon-Act: 36.48%
  • WebDreamer (Dreamer-7B + In-Domain): 23.20% vs Recon-Act: 36.48%
  • ICAL (GPT-4o): 23.40% vs Recon-Act: 36.48%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2509.21072.

Fig 1

Fig 1: Success Rates on VisualWebArena Dataset (Left) While remains a substantial gap to human

Fig 2

Fig 2 (page 1).

Fig 2

Fig 2: System Architecture. The system comprises two integrated teams: Reconnaissance and Action

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Limitations

  • Current implementation is Level 3 requiring human-in-the-loop for Analyst and Tool Manager roles, limiting full autonomy.
  • Reconnaissance capabilities are restricted to a fixed set of target websites; not yet generalized to broad heterogeneous web environments.
  • Training relies on manually curated small datasets with both successful and failed trajectories; lacks fully autonomous exploration.
  • Tool synthesis and merging involves manual human intervention, particularly to maintain feature branch isolation and consolidate fragmented functions.
  • Master agent occasionally errs in tool invocation, which impacts orchestration robustness.
  • No adversarial evaluation or robustness testing against actively manipulated web content or attacker strategies.

Open questions / follow-ons

  • How to achieve higher autonomy by enabling the Reconnaissance Team to perform random-walk style self-exploration to generate successful trajectories without human labeling?
  • How to improve the reasoning and coding capacities of Analyst and Tool Manager agents to remove human-in-the-loop and scale to Level 5+ implementations?
  • How to generalize the reconnaissance and tool generation process to operate robustly across a wider and more diverse range of websites beyond the limited fixed set?
  • What are the tradeoffs and limits of merging tool functions towards easier orchestration without overloading the Master agent?

Why it matters for bot defense

For bot-defense or CAPTCHA teams, Recon-Act offers valuable insights into improving autonomous browser agents’ efficiency and robustness through iterative tool generation informed by failure analysis. The approach highlights that embedding a reconnaissance capability—actively exploring and reasoning about state failures to synthesize specialized tools—enables more reliable multi-turn interactions on complex, real-world websites.

This system design framework may inspire defenders looking to detect or disrupt similar self-evolving multi-agent browser bots by monitoring how agents adapt or generate new scripted behaviors and tools based on prior failures. Captchas could be tailored to exploit points in this iterative feedback loop to impede autonomous tool synthesis or reconnaissance reliability. Also, understanding tool augmentation and hint versus decision modes may guide design of bot-detection layers sensitive to tool invocation patterns or exploration behaviors uncommon to humans.

Cite

bibtex
@article{arxiv2509_21072,
  title={ Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution },
  author={ Kaiwen He and Zhiwei Wang and Chenyi Zhuang and Jinjie Gu },
  journal={arXiv preprint arXiv:2509.21072},
  year={ 2025 },
  url={https://arxiv.org/abs/2509.21072}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution