Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Source: arXiv:2603.24709 · Published 2026-03-25 · By Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin et al.
TL;DR
This paper addresses the challenge of training large language models (LLMs) for multi-step tool orchestration—executing sequences of API calls with interdependencies—where prior state-of-the-art suffer from frequent parameter errors that break entire workflows. The authors identify two core obstacles: the lack of realistic, deterministic environments for training on complex API dependency chains, and the problem of sparse binary rewards that inadequately guide learning in long multi-step tasks. They propose a reinforcement learning framework consisting of (1) a deterministic, cache-based environment built on 100k+ real API responses with workflow templates, enabling constrained synthesis of valid multi-step sequences, and (2) a novel graduated reward function that decomposes correctness into atomic validity (call-level correctness at increasing granularity) and orchestration consistency (correct sequencing respecting dependencies), providing dense learning signals. Empirically, their approach substantially improves turn accuracy on ComplexFuncBench, nearly doubling performance compared to zero-shot baselines, and ablation studies confirm both reward components contribute critically. Further cross-benchmark evaluation on BFCL v4 shows that these learned orchestration skills generalize and transfer to distinct API ecosystems such as agentic web search and memory management, significantly boosting multi-step task accuracy with no degradation on single-step function calls. This work advances RL-based training of LLMs for complex multi-step tool use by delivering a realistic training environment and a carefully designed reward shaping scheme that addresses key challenges of sparse feedback and dependency constraints.
Key findings
- Qwen3-8B improved turn accuracy on ComplexFuncBench from 37.5% (zero-shot) to 52.1% (+14.6%) after RL training on oracle queries (Table 1).
- Ablation shows Ratomic-only reward yields 32.2% turn accuracy, Rorch-only 37.5%, but combined reward improves accuracy to 52.1% (Table 2).
- Training with constrained synthetic data (GRPO) yields up to 36.4% turn accuracy (Qwen3-8B) vs 34.4% zero-shot (Table 3).
- Cross-benchmark evaluation on BFCL v4 Agentic category shows web search accuracy improves from 3.5% to 10.5% and memory tasks from 17.4% to 21.7% with RL trained model (Table 4).
- Error analysis reveals parameter extraction from queries is the main bottleneck rather than dependency propagation errors (Table 6).
- Performance gains increase with dependency depth (up to +31% at depth 4) and hold across workflow patterns including fan-out (Figure 3).
- Increasing samples per workflow template beyond 3 yields diminishing returns, suggesting greater benefit from expanding template diversity rather than per-template data density (Table 5).
- RL training reduces premature stopping errors and balances improvements across function selection, parameter, and sequence errors (Table 6).
Threat model
The adversary is implicitly a malicious or malfunctioning model or agent that may generate incorrect multi-step API call sequences due to parameter errors or misordering, threatening reliability and safety of automated workflows. The training framework assumes the adversary cannot alter or corrupt the deterministic environment or cached API responses, and cannot bypass parameter or dependency validation enforced by the environment. The threat model focuses on improving model correctness rather than defending against active adversarial attacks.
Methodology — deep read
The authors formulate multi-step tool orchestration as a sequential decision-making problem where a model generates a sequence of function calls with parameters, respecting dependencies where later calls utilize outputs from earlier calls. The core is learning a policy πθ(y|q) to maximize reward R(y, y*) reflecting correctness of the predicted workflow y against ground truth y*. They address two foundational obstacles:
Deterministic training environment: To avoid instability and inconsistency from live APIs or synthetic data, they build a cache-based environment seeded with over 100,000 real API responses collected via executing complete valid workflows using live Booking.com APIs. These responses cover 40 functions and are indexed by function, parameter, and value to allow O(1) lookups. They define over 100 workflow templates specifying valid function sequences and parameter dependencies derived from API specs. The cache ensures consistent deterministic execution: a generated function call (f, θ) returns the cached API response or an error in case of invalid calls.
Constrained data synthesis: Using workflow templates, they sample multi-step action-observation sequences from the cache respecting dependency constraints stepwise. Independent steps sample parameters randomly from cache, dependent steps intersect reverse indices to find compatible parameter bindings. If sampling fails, retries occur up to a limit. For each valid action sequence, an LLM is prompted to generate a user query that semantically corresponds to the sampled actions with exact parameter echo in JSON, allowing automatic consistency verification.
Graduated reward function design: To overcome sparse binary reward pitfalls, they decompose reward into two complementary parts:
- Ratomic measures atomic validity of each predicted function call at three graduated AST levels—correct function name, parameter structure/type, and exact parameter values—combined with semantic validation via executing the call and checking success.
- Rorch measures orchestration consistency by ensuring the predicted sequence satisfies the dependency graph constraints: each predicted step matching ground truth must occur only after all its prerequisites are executed correctly and in order. The total reward is a weighted sum Rtotal=λ·Ratomic+(1−λ)·Rorch, with λ=0.5 empirically.
Training setup: They use policy gradient RL (GRPO) to optimize the reward on two base LLMs (Qwen3-8B and Qwen2.5-7B-Instruct). Batched rollouts in the cache environment use batch size 16 and rollout length 16, trained for sufficient steps on 8×H100 GPUs. The KL divergence coefficient is 0.001.
Evaluation: They evaluate on ComplexFuncBench (1000 test samples, 5 API domains, average 3.26 steps/query) measuring Turn Accuracy (longest prefix of consecutively correct turns per sample) and Call Accuracy (individual call correctness). They run oracle experiments training and evaluating on same queries to isolate environment impact, plus training on disjoint synthetic data generated from the constrained pipeline. Cross-benchmark generalization is tested on BFCL v4 agentic tasks involving different APIs. Ablations vary reward components, data scale, and complexity stratifications. Error analysis categorizes failure modes.
Reproducibility: Code and environment are publicly released at https://github.com/horizon-rl/ToolOrchestrationReward. The cache includes 100k real Booking.com API responses. Evaluation benchmarks ComplexFuncBench and BFCL are public.
A concrete example: For a hotel booking workflow template (Search Dest → Search Hotels → Get Details), the constrained synthesis samples a real cached Search Dest call with parameters, then finds compatible Search Hotels cache entries dependent on the first call output. An LLM is prompted to produce a query that entails this exact sequence. RL training improves the policy to produce correct parameter values and respect call order, receiving graduated rewards to guide learning.
Technical innovations
- A deterministic, cache-based execution environment using 100k+ real API responses with workflow-aware, dependency-consistent caching for stable multi-step RL training.
- Constrained data synthesis pipeline leveraging workflow templates and reverse indices to sample valid, dependency-respecting multi-step action sequences for training data.
- Graduated reward design combining atomic validity (multi-level AST and semantic execution checks) with global orchestration consistency enforcing dependency-respecting call order.
- Demonstration of transferability of learned orchestration skills across distinct API ecosystems via cross-benchmark evaluation.
Datasets
- ComplexFuncBench — 1,000 test samples spanning five API domains — public benchmark
- BFCL v4 Agentic category — size unspecified; multiple diverse API tool tasks — public benchmark
- Booking.com API cache — 100,000+ real API response entries over 40 functions — collected by authors, not public
Baselines vs proposed
- Zero-shot Qwen3-8B: Turn Accuracy = 37.5%, RL trained: 52.1% (+14.6%) (Table 1)
- Zero-shot Qwen2.5-7B-Instruct: Turn Accuracy = 16.7%, RL trained: 33.8% (+17.1%) (Table 1)
- Ratomic only reward: Turn Acc = 32.2%, Call Acc = 15.3% vs Combined reward: Turn Acc = 52.1%, Call Acc = 49.5% (Table 2)
- BFCL v4 Web Search (agentic) accuracy: baseline 3.5%, RL trained 10.5% (Table 4)
- BFCL v4 Memory accuracy: baseline 17.4%, RL trained 21.7% (Table 4)
- SFT alone yields negligible or negative improvements compared to zero-shot; GRPO RL training yields best results (Table 3)
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2603.24709.

Fig 1: Framework overview using hotel booking as a running example. Phase 1: Collect
Limitations
- Training environment and data synthesis focused on Booking.com API ecosystem; extension to other domains requires new workflow templates and data collection.
- Evaluation limited to publicly available benchmarks; real-world robustness and unseen API calls not extensively tested.
- Reward design tuned for multi-step tool calls with clear dependency structures; may not generalize to less-structured or highly dynamic API ecosystems.
- Parameter extraction from natural language queries remains a major bottleneck, limiting overall accuracy.
- No adversarial evaluation against malicious or corrupted inputs to test robustness.
- Scaling experiments show limited gains beyond small increases in data per template; broader template diversity is required but costly to curate.
Open questions / follow-ons
- How to automate workflow template extraction and dependency parsing from arbitrary API specifications to scale training to diverse domains?
- Can the graduated reward design be adapted or extended to support stochastic or live API environments with partial observability and non-determinism?
- What methods can better improve query parameter extraction from natural language to address the main source of errors?
- How robust is the trained orchestration policy to adversarial inputs or API changes, and can continual learning or adaptation help?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, this work offers insights into training LLMs to reliably orchestrate multi-step interactions involving complex APIs—a scenario increasingly common when LLMs control or automate system actions. The deterministic cache environment and graduated reward shaping provide a template for creating stable training playgrounds that capture real-world dependencies and provide fine-grained reward signals, overcoming the sparse feedback problem pervasive in multi-step workflows. Such approaches could be valuable when designing adaptive bot-detection defenses that evaluate multi-step behavioral patterns or deploy adaptive CAPTCHA challenges linked to multi-step action verification. Additionally, the paper’s emphasis on separating atomic correctness and global orchestration consistency offers a structured way to analyze failure modes, useful for forensic assessments of suspicious multi-step interactions. The cross-domain transfer results further imply trained models may generalize across API ecosystems—a caution for bot-detection as attackers may reuse learned orchestration capabilities. However, the need for curated workflow templates and reliance on domain-specific cached responses highlights significant engineering overhead before deployment in new contexts.
Cite
@article{arxiv2603_24709,
title={ Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards },
author={ Cheng Jiayang and Xin Liu and Zhihan Zhang and Haoyang Wen and Zixuan Zhang and Qingyu Yin and Shiyang Li and Priyanka Nigam and Bing Yin and Chao Zhang and Yangqiu Song },
journal={arXiv preprint arXiv:2603.24709},
year={ 2026 },
url={https://arxiv.org/abs/2603.24709}
}