COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

Source: arXiv:2606.02372 · Published 2026-06-01 · By Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

TL;DR

This paper addresses the challenge that existing textual world models for language agents remain static after training, limiting their adaptability to the on-policy state-action distributions induced by evolving agent policies. Simultaneously, agent policy improvement methods often depend on external rewards or verifiers, restricting their use in real-world interactive environments. To overcome these limitations, the authors propose COMAP, a closed-loop framework that co-evolves the textual world model and the agent policy. At each decision step, the world model predicts the next state feedback for candidate actions, enabling the agent to perform future-aware reflection and refine its action based on the reliability of this feedback. On-policy interaction trajectories are then used to update the world model via self-distillation, aligning it with the agent’s evolving behavior distribution. Extensive experiments on embodied task planning (ALFWorld), Web navigation (WebShop), scientific reasoning (ScienceWorld), and tool-use (StableToolBench) show that COMAP consistently outperforms competitive baselines such as ReAct, Imagine-and-Act, and prior world-modeling approaches including ITPR, yielding up to +16.75% relative success rate improvements with Qwen3-4B. Further analysis reveals that the co-evolutionary training significantly improves the world model’s prediction accuracy (Delta-F1) and enhances the agent’s long-horizon decision-making capability. Ablations confirm the critical role of future-aware reflection, on-policy self-distillation for the world model, and gating mechanisms to balance predicted versus environmental states during training. Overall, COMAP reveals the synergy between dynamic world modeling and policy optimization as a key driver for robust LLM agent improvement without external rewards.

Key findings

COMAP improves task success rate by +16.75% relative over ITPR with Qwen3-4B and +2.6% absolute over ITPR with Qwen3-8B, averaged across ALFWorld, ScienceWorld, WebShop, and StableToolBench benchmarks.
World model Delta-F1, measuring prediction accuracy of action-induced state changes, improves from 76.0% to 85.4% with Qwen3-4B, and from 83.98% to 90.29% with Qwen3-8B under COMAP co-evolving training.
Ablations removing the future-state input reduce policy success by 9.8% (Qwen3-4B) and 24.1% (Qwen3-8B), highlighting the importance of lookahead feedback for policy refinement.
Freezing the world model after initial training causes a 26.9% (4B) and 29.1% (8B) drop in Delta-F1, confirming the necessity of continuous co-evolution to maintain accurate environment modeling.
On-policy self-distillation (WMSD) reduces the negative log-likelihood on change tokens from 2.02 to 1.45 (Qwen3-4B) and from 1.80 to 1.38 (Qwen3-8B), showing improved fidelity in learning transitions.
Future-aware reflection yields low unnecessary/harmful revision rates (URR 0.16/0.08 ALFWorld) and high beneficial revision precision (BRP 0.92), indicating effective but controlled policy corrections.
The world state gate gradually increases adoption of predicted world states during training, balancing noisy early predictions with reliable later adaptation.
COMAP outperforms strong baselines including GPT-5.4, DeepSeek-V4-Pro, ReAct prompting, Imagine-and-Act prompting, Reflexion, RAP, ITPI, WKM, IWM, and ITPR across multiple tasks.

Threat model

The adversary in this context is implicit: the agent faces a potentially changing and partially observable environment with uncertain transitions. The agent’s world model and policy must adapt to non-stationary distributions caused by the agent’s own evolving behavior. The agent cannot directly observe true states and must refine predictions and policies without external reward signals or explicit adversarial interference modeled.

Methodology — deep read

The paper tackles the problem of non-adaptive textual world models for LLM-based agents by proposing co-evolution of the world model and the agent policy through closed-loop interaction.

Threat Model & Assumptions: The agent interacts with a partially observable Markov decision process (POMDP) with state space S, action space A, observation space O, transition function T, and reward R. The true environment state is not directly accessible; the agent relies on textual observations. The agent policy and world model are assumed to share the same backbone LLM architecture.
Data: The framework is initially warm-started by supervised fine-tuning on expert demonstration datasets Dexp and real on-policy rollouts Droll collected in environment simulators. The data includes tuples (state, action, next state), with states and actions expressed as textual sequences describing observations, commands, or API calls. Splits for seen and unseen tasks are used for evaluation.
Architecture / Algorithm:
- The agent policy π_θ maps histories or states to draft actions.
- The textual world model W_ϕ predicts the next textual state given current state and candidate action.
- During inference, the policy drafts an action a_draft, the world model imagines a next state ŝ_{t+1}, which is provided back to the policy as a future feedback signal.
- The policy uses a future-aware reflection module to generate a refined action a_ref and a refinement probability p_t based on the reliability of the world model's prediction.
- A gating mechanism decides whether to adopt the draft action or the refined one for execution.
Training Regime:
- Initialization: Supervised fine-tuning of both policy and world model on expert and offline rollouts.
- Reflection-mode initialization: Using rollouts from both draft and expert actions to produce refinement labels y_ref, training action and refinement heads.
- Co-evolving training: The world model is trained with on-policy self-distillation (WMSD), using a teacher-student scheme where the teacher sees the real next state and the student only predicted states, minimizing token-level divergence.
- The policy is trained with a combined loss from expert imitation, reflection-conditioned learning, and refinement probability supervision.
- Exponential moving average controls teacher updates.
- World state and action gates regulate training inputs and policy adoption decisions to mitigate noise.
- Training uses Qwen3-4B and Qwen3-8B backbones, with batch sizes, epochs, and hyperparameters optimized empirically (details unspecified).
Evaluation Protocol:
- Metrics: Task success rate (SR) for agent policies, including seen and unseen splits where applicable; Solvable pass rate (SoPR) for StableToolBench; Delta-F1 for world model next-state prediction accuracy.
- Baselines: Strong prompting-only API models (GPT-5.4, DeepSeek), trainable open-source baseline methods including ReAct, Imagine-and-Act, Reflexion, RAP, ITPI, WKM, IWM, ITPR.
- Ablations: Systematic leave-one-out component removal for both policy and world model components to test contributions.
- Analysis of future-aware reflection quality via URR, HRR, BRP.
Reproducibility:
- Code and data released at https://github.com/loyiv/CoMAP.
- Datasets are publicly available benchmarks.

End-to-end example: At a given time step t in ALFWorld, the agent generates a draft action a_draft_t from state s_t. The world model predicts the next state ŝ_{t+1}. The agent then computes a refined action a_ref_t based on s_t, a_draft_t, and ŝ_{t+1}, along with a confidence p_t. An action gate decides whether to execute a_ref_t or fallback to a_draft_t. The executed action causes an environment transition to s_env_{t+1}. The world model is updated using the real transition s_env_{t+1} through on-policy self-distillation comparing student and teacher outputs. The updated world model then aids subsequent steps. This loop continues, improving both world model and policy in tandem.

Technical innovations

A closed-loop framework (COMAP) that co-evolves textual world models and agent policies by mutually reinforcing them through interaction.
On-policy self-distillation of world models using a teacher-student regime that incorporates privileged real next states to improve prediction fidelity.
Future-aware reflection in the agent policy to refine draft actions based on model-predicted future states with gating mechanisms to adaptively decide action adoption.
Dynamic gating of predicted world-model states versus environment states during training to balance early noise and later reliable predictions.

Datasets

ALFWorld — thousands of language-conditioned household tasks — public benchmark
ScienceWorld — multi-step scientific reasoning and experimentation tasks — public benchmark
WebShop — e-commerce website navigation and purchasing tasks — public benchmark
StableToolBench — tool-use via API-calling in a stable virtual environment — public benchmark

Baselines vs proposed

ReAct (Qwen3-4B): Average success rate = 42.44% vs COMAP: 60.15%
ITPR (Qwen3-4B): Average success rate = 51.52% vs COMAP: 60.15%
ReAct (Qwen3-8B): Average success rate = 55.21% vs COMAP: 72.11%
ITPR (Qwen3-8B): Average success rate = 69.53% vs COMAP: 72.11%
GPT-5.4 (ReAct): Average success rate = 67.83% vs COMAP (Qwen3-8B): 72.11%
Delta-F1 World Model (Qwen3-4B, On-policy Training): 76.0% vs COMAP: 85.40%
Delta-F1 World Model (Qwen3-8B, On-policy Training): 83.98% vs COMAP: 90.29%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.02372.

Fig 1

Fig 1: Conceptual illustration of the co-evolution of

Fig 2

Fig 2: Overview of the proposed COMAP framework. The textual world model and the agent policy co-evolve

Fig 3

Fig 3 (page 3).

Fig 3

Fig 3: Component ablations of COMAP on ALFWorld. We report leave-one-component-out results on Qwen3-4B

Fig 4

Fig 4: Learning dynamics of co-evolving. The upper

Fig 5

Fig 5: Adoption ratio of the world state gate (Qwen3-

Limitations

COMAP assumes environment states and actions can be fully represented as text, limiting applicability in multimodal tasks with visual or auditory signals.
The approach incurs additional inference latency due to the extra world-model call at each decision step for future-state imagination.
The experiments focus on simulated benchmarks; real-world noisy environments or adversarial settings remain unexplored.
The world model improvements depend on accurate privileged states during training, which may not always be available in complex environments.
Action refinement gating heuristic parameters require tuning and may add brittleness in different application contexts.

Open questions / follow-ons

How to extend COMAP-style co-evolution to multimodal environments combining vision and language?
Can the framework be adapted to online continual learning settings with non-stationary real-world data?
What are the limits of future-aware reflection when model predictions are systematically biased or adversarially attacked?
How to reduce inference latency introduced by world-model lookahead while maintaining policy quality?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, COMAP demonstrates a sophisticated approach whereby language agents dynamically build and refine internal world models anchored on textual environmental states and refine policies without external rewards. This co-evolutionary mechanism enables agents to better anticipate environment changes and make more robust long-horizon decisions. For CAPTCHA practitioners, this insight highlights that attacker bots utilizing LLM agents could become more capable by jointly training their world models alongside policies, adapting to evolving interaction patterns and more accurately predicting future states of challenge-response interactions. Defensive strategies based solely on static challenge designs or fixed verification signals may thus become less effective over time as adaptive agents emerge. Applying similar co-evolution principles defensively — continuously updating challenge models against live attacker behaviors — might therefore be an important countermeasure to maintain resistance. However, the computational cost of maintaining dynamically co-evolving agent models, and the incremental latency introduced, could naturally limit applicability in real-time CAPTCHA scenarios. Overall, COMAP underscores the need for dynamic, adaptive bot-defense mechanisms that track and respond to evolving bot internal modeling strategies rather than fixed static detection rules.

Cite

bibtex

@article{arxiv2606_02372,
  title={ COMAP: Co-Evolving World Models and Agent Policies for LLM Agents },
  author={ Youwei Liu and Jian Wang and Hanlin Wang and Wenjie Li },
  journal={arXiv preprint arXiv:2606.02372},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.02372}
}

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​