Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Source: arXiv:2605.22138 · Published 2026-05-21 · By Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu et al.

TL;DR

This paper addresses the inefficiencies and shortcomings of current large language model (LLM)-based agentic reasoning systems that rely on end-to-end trained reactive policies with unconstrained chain-of-thought for planning. Without explicit control over when, how extensively, and with what structure planning occurs, existing approaches suffer from dramatically increased reasoning length without proportional accuracy improvements. To overcome this, the authors propose decomposing agent decision-making into three interacting systems: (1) simulative reasoning (System II), which grounds planning in explicit future-state prediction via a learned world model; (2) self-regulation (System III), which learns when and how deeply to plan by a learned configurator module; and (3) reactive execution (System I), which handles fine-grained immediate actions without planning.

They instantiate this architecture in SR2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing the configurator and simulative planner as distinct stages in an LLM’s chain-of-thought, with the LLM itself serving as the language-based world model. They explore two versions: v0.1, which records decisions from a prompted multi-module system (8B params), and v1.0, which reconstructs structured plans from pretrained reasoning LLM traces followed by supervised then reinforcement learning (30B params). Across multiple reasoning domains (math, science, tabular analysis, web info-seeking), SR2AM models achieve Pass@1 accuracy competitive with much larger 120B-1T parameter systems, while consuming 25.8% to 95.3% fewer reasoning tokens than comparable agentic LLMs. Reinforcement learning further increases average planning horizon by 22.8% without meaningfully increasing planning frequency (2.0% growth), demonstrating more efficient deeper planning.

Overall, this work provides a novel and generalizable mechanism to enable efficient, adaptive deliberation in agentic LLMs by explicitly separating world-model-based simulative planning from learned self-regulation of planning invocation. This leads to substantial improvements in accuracy-efficiency tradeoffs on challenging interactive reasoning tasks.

Key findings

SR2AM-v0.1-8B achieves Pass@1 of 57.0, competitive with unregulated agentic LLMs at 30–32B and pretrained LLMs with tools at 120–355B parameters.
SR2AM-v1.0-30B reaches Pass@1 of 71.3, competitive with DeepSeek-V3.2 (685B, 73.2) and Kimi-K2.5 (1.0T, 70.9) while using 25.8–95.3% fewer reasoning tokens than comparable agentic LLMs.
Ablation shows removing free-form reasoning (System I) drops accuracy from 66.6 to 46.8 Pass@1; removing selective planning (System III) increases token usage from 4,925 to 5,451, highlighting its role in efficiency.
Reinforcement learning (RL) increases average planning horizon by 22.8% while planning frequency grows only 2.0 percentage points, indicating learning to plan deeper rather than more often.
Planning horizon gains vary by domain: science tasks see 32.7% increase, web tasks 20.9%, consistent with environmental uncertainty limiting horizon.
Compared to unregulated deliberation, SR2AM consumes fewer tokens during RL training while achieving higher Pass@1 (72.8 vs baseline around 60) with lower out-of-context rates.
SR2AM’s self-regulated decomposed approach exceeds teacher CoT supervised finetuning baseline (66.6 vs 65.3 Pass@1) and further improves with RL.
Among 30–32B agentic LLMs, SR2AM-v1.0-30B outperforms MiroThinker-v1.5-30B in accuracy (71.3 vs 74.2) while using 51.2% fewer reasoning tokens (5,518 vs 11,295).

Threat model

n/a — The paper focuses on improving agentic reasoning efficiency and accuracy through better planning regulation rather than addressing adversarial threats or security concerns.

Methodology — deep read

The authors formalize agentic decision-making as a decomposition into three interacting systems: reactive execution (System I) for immediate action selection, simulative reasoning (System II) that plans by predicting future states using a learned world model, and self-regulation (System III) that learns when and how deeply to invoke planning via a configurator.

Threat Model & Assumptions: The agent interacts with an environment modeled as a partially observable Markov decision process (POMDP) where true states are unknown and beliefs are inferred from observations. The agent must maximize expected reward with long-term planning. No specific adversarial threat model is assumed since the focus is on reasoning efficiency and accuracy.

Data: Two supervised data collection methods build the training dataset. (1) Multi-module inference (v0.1): a prompted multi-module LLM system (o4-mini, 8B) produces traces with explicit configurator decisions and plans filtered for correctness and complexity, yielding 4,845 examples. (2) Plan reconstruction (v1.0): annotator LLM (DeepSeek-V3.2, 30B) processes pretrained LLM chain-of-thought traces (DeepSeek-V3.2) to reconstruct configurator decisions and simulative plans, enabling 10,787 supervised examples that better preserve free-form reasoning. Additional datasets cover math, science, tabular, and web reasoning.

Architecture: Both versions embed System II (simulative planner) and System III (configurator) as separate generation stages within the LLM’s chain-of-thought. The world model is implicit within the LLM’s language space. The output plan encodes belief states, proposed actions, and predicted future states. The configurator decides per step whether to plan, continue planning, or skip, controlling invocation frequency and horizon. System I is free-form reasoning and action execution.

Training Regime: Supervised finetuning is done first on the curated datasets from either multi-module inference or plan-reconstruction pipelines. Then policy-gradient reinforcement learning with Group Relative Policy Optimization (GRPO) and asymmetric clipping refines the policies using a composite reward combining answer correctness, structural format compliance, and final answer extractability. RL shapes coordination among the three systems to improve task success and reasoning efficiency. For 30B models, trajectory filtering prevents format collapse.

Evaluation Protocol: Evaluation uses 11 benchmarks across math (AIME-24, AIME-25, MATH-500), science (GPQA-Diamond, SuperGPQA, HLE), tabular analysis (FinQA, MultiHier), and web information seeking (BrowseComp, GAIA-103, XBench-DeepSearch). Pass@K (mostly Pass@1) averaged across datasets measures accuracy. Reasoning efficiency is measured by average reasoning tokens per trajectory (excluding environment/tool outputs). Baselines cover pretrained reasoning LLMs with and without tools, as well as unregulated and partially-regulated agentic LLMs of various parameter sizes. Ablations test removal of each system component. RL training curves compare against unregulated baselines.

Reproducibility: Code and model artifacts are released at https://github.com/sailing-lab/sr2am. The datasets combine open-source benchmark datasets with proprietary LLM-generated traces. The plan reconstruction and multi-module inference pipeline details are provided in appendices.

Example: In the web information seeking setting, the agent receives a query, the configurator decides at each step whether to plan a sequence of tool calls (e.g., web_search, visit_tool, python_repl_tool) simulating future belief states and outcomes. The simulative planner proposes an actionable plan, which the actor executes step-by-step. If uncertainty or urgency is high, the configurator may skip planning and act reactively to save tokens. The plan horizon is adapted accordingly. RL learns to increase average plan lengths when appropriate while limiting invocation frequency to improve token efficiency and task success.

Technical innovations

Decomposition of agentic reasoning into three explicit interacting systems: reactive execution (System I), simulative reasoning/planning grounded in a world model (System II), and self-regulation via a learned configurator to control planning invocation (System III).
Implementation of simulative planning and configurator decision-making as distinct stages embedded in the LLM’s chain-of-thought, with the LLM itself serving as a world model in language space.
Plan reconstruction approach (v1.0) that annotates pretrained LLM reasoning traces to generate supervised data with structured plans and self-regulation signals, enabling scalable end-to-end training.
Use of reinforcement learning with a reward combining answer correctness, format compliance, and extractability to jointly optimize coordination among the three systems, yielding longer planning horizons without increasing planning frequency.
Empirical demonstration that self-regulated simulative planning improves accuracy-efficiency tradeoffs across diverse interactive reasoning tasks compared to unregulated or partially regulated alternatives.

Datasets

Guru + multi-hop QA datasets — 4,845 examples — collected via multi-module prompted system (v0.1)
MegaScience + additional web reasoning datasets — 10,787 examples — collected via plan reconstruction (v1.0)
Benchmark test sets: AIME-24, AIME-25, MATH-500, GPQA-Diamond, SuperGPQA, HLE (500 Q subset), FinQA, MultiHier, BrowseComp, GAIA-103, XBench-DeepSearch — sizes variable — publicly available/open-source

Baselines vs proposed

Qwen3-8B (unregulated deliberation): Pass@1 = ~46.8 (SFT ablation baseline), SR2AM-v0.1-8B: Pass@1 = 57.0
DeepSeek-V3.2 (685B) with tools: Pass@1 = 73.2, SR2AM-v1.0-30B: Pass@1 = 71.3
Kimi-K2.5 (1.0T) with tools: Pass@1 = 70.9, SR2AM-v1.0-30B: Pass@1 = 71.3
MiroThinker-v1.5-30B: Pass@1 = 74.2, reasoning tokens = ~11,300; SR2AM-v1.0-30B: Pass@1 = 71.3, tokens = 5,518 (~51.2% fewer)
Ablation removing free-form reasoning (System I) from SR2AM-v1.0-30B: Pass@1 drop from 66.6 to 46.8
Ablation removing selective planning (System III): tokens increase from 4,925 to 5,451
After RL, SR2AM-v1.0-30B Pass@1 improves to 72.8 with token count rising modestly from 4,925 to 5,414

Limitations

While reasoning efficiency improves, occasional over-planning on simpler tasks indicates configurator calibration for stopping planning is not perfect.
Plan reconstruction approach depends on annotator LLM quality and pretrained reasoning LLM traces, which could limit generality or transferability.
Evaluation domains cover math, science, tabular, and web tasks but do not include embodied or multi-agent environments.
No explicit adversarial robustness or attack evaluation was performed—future threat models involving adversarial inputs are unexplored.
Exact optimization of planning under the world model is intractable, so planning approximations may limit performance ceilings.
The learned configurator is trained only with reward signals related to task success and format compliance—not on explicit costs of compute or latency.

Open questions / follow-ons

Can the self-regulation mechanism be extended beyond inference-time planning to dynamically govern an agent’s ongoing learning or adaptation strategies?
How can calibrating when to start and stop planning be improved to avoid both over-planning and under-planning in diverse environments?
How does the approach perform in settings with unreliable world models or high environment stochasticity where simulation predictions degrade?
Can the system generalize to multimodal agents or embodied settings that require joint vision-language-action planning with stricter real-time constraints?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper provides valuable insights into how large language models can achieve efficient yet accurate multi-step planning through explicit separation of simulative planning and learned self-regulation. In scenarios where bots attempt complex reasoning or multi-step automated interactions (e.g., web scraping, puzzle solving), these findings highlight methods to reduce unnecessary or wasteful planning steps without sacrificing effective decision making.

By adopting a self-regulated simulative reasoning architecture, bot detection systems might discern differences between bots that produce efficient, regulated planning traces and naive bots exhibiting unregulated, lengthy chain-of-thoughts with token inefficiencies. The notion of a learned configurator controlling planning invocation could inspire bot-defense mechanisms that monitor or constrain attacker reasoning behavior adaptively. Moreover, the demonstrated ability to maintain competitive accuracy with drastically fewer reasoning tokens suggests possible efficiency gains in verifying genuine human reasoning patterns versus automated adversaries. However, integration into CAPTCHA challenges would require further work adapting these concepts to real-time constraints and adversarial robustness, which remain open questions.

Cite

bibtex

@article{arxiv2605_22138,
  title={ Efficient Agentic Reasoning Through Self-Regulated Simulative Planning },
  author={ Mingkai Deng and Jinyu Hou and Lara Sá Neves and Varad Pimpalkhute and Taylor W. Killian and Zhengzhong Liu and Eric P. Xing },
  journal={arXiv preprint arXiv:2605.22138},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22138}
}

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​