Skip to content

Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

Source: arXiv:2605.26037 · Published 2026-05-25 · By Tianda Sun, Dimitar Kazakov

TL;DR

This paper investigates the limits of reinforcement learning with verifiable rewards (RLVR) for teaching large language models (LLMs) to use knowledge graph (KG) navigation tools under a deliberately minimal and opaque API. Using the Complex WebQuestions benchmark over Freebase, the authors train Qwen2.5-7B-Instruct with the GRPO algorithm to compose answers by chaining four low-level Freebase navigation verbs, all operating over opaque machine IDs and returning minimal empty-list failure signals. They observe a striking "peak-then-collapse" failure pattern in the learned policy: the tool-grounded correct answer rate (CvT) climbs from 3.8% to 9.6% over ~250 RL steps, then catastrophically collapses to zero within 50 steps, and this pattern replicates across seeds and reward designs. Attempting various reward redesigns shifts but does not eliminate four identified failure modes, each corresponding to deficiencies in the interface’s feedback signals. An oracle-ablation shows relation selection is not the main bottleneck—most retrieval-dependent errors are due to opaque compositional state and silent failure channels. As a mitigation, a one-iteration self-distillation step improves exact match (EM) accuracy to 40.0% at 7B scale and is nearly capacity invariant, confirming the performance ceiling arises from the interface's limited feedback rather than model capacity. This work highlights fundamental challenges for RL-based tool use in schemas lacking natural-language signals or explicit error traces, contrasting with success on APIs like Python interpreters and web search.

Key findings

  • Under a self-verifiable retrieval reward, Qwen2.5-7B-Instruct's tool-grounded correct-via-tool rate (CvT) improves from 3.8% to a peak 9.6% at 250 RL steps, then sharply collapses to 0% by step 300, a replicated pattern across four random seeds (Fig. 1).
  • Seven reward designs tested systematically produce four distinct mechanistic failure modes involving format collapse, ritual tool use, format drift, and specification gaming, rather than eliminating failure (§4.3, Table 1).
  • Oracle ablation via injecting ground-truth relations at every retrieval call lifts exact match (EM) by only +0.20 percentage points, showing relation selection is not the main bottleneck (95.4% of retrieval-dependent errors are retrieval-composition failures) (§5.3).
  • Mean tool-call volume (Tools/Q) drops sharply from 3.0 to 1.0 concurrent with CvT collapse, indicating a behavioral signature distinct from prior agentic-RL collapse diagnostics (Fig. 1, §4.3).
  • A one-iteration self-distillation step on filtered EM-correct, tool-productive trajectories raises EM to 40.0% at 7B scale with CvT = 5.81%, stable across four seeds and nearly capacity-invariant at 14B scale (only +0.25 pp) (§4.5, Table 2).
  • Pass@16 sampling experiments show R-stepwise produces a 0.4 pp tool-lift gap (tools unused) while R-toolverbs and R-toolverbs·KL yield +11.4 to +14.2 pp genuine capability lifts from tool use (Fig. 2).
  • Cross-benchmark correlation on KGQAGen-10k confirms results are not dataset-specific (Spearman ρ=0.976 on 8 Qwen variants) (§4.6).
  • Closed-model GPT-4o baseline under the same tool interface achieves near 100% format-validity but only 2.1% F1 and 6.2% containment EM, with 78% trajectories reporting schema unfamiliarity, highlighting schema opacity as a core challenge (§4.6).

Threat model

n/a — this is an empirical study of reinforcement learning dynamics in tool-using language models rather than an adversarial security setting. The ‘adversary’ is effectively the internal optimization process and reward specification failures rather than external attackers.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is not a malicious actor but rather intrinsic learning instability in RL-based policies using a minimal knowledge graph (KG) navigation tool interface. The interface lacks natural-language failure signals (only empty list responses on misses), uses opaque entity IDs, and minimal relation vocabularies. The model must learn to compose multi-hop queries via four primitive navigation verbs without gold relation supervision or explicit search feedback. The agent cannot rely on pretraining priors about the schema. 2. Data: The benchmark is Complex WebQuestions (CWQ), a standard multi-hop QA dataset over Freebase, with 27,639 training and 3,531 test questions under the RoG splits. Answers are normalized before exact match (EM) scoring. The knowledge graph has ~2.59 million entities, ~7,000 relations, ~8.3 million triples accessible at inference. Tool calls are limited to five per question. 3. Architecture / Algorithm: The model is Qwen2.5-7B-Instruct, initialized with supervised fine-tuning (SFT) on 5,000 rule-based trajectories derived from gold paths using the four-verb Freebase interface. Low-rank adapters (LoRA) are applied for parameter-efficient fine-tuning. Reinforcement learning uses GRPO (a policy gradient approach with KL constraints) over up to 500 training steps. Rewards are sparse but designed as a ladder with increasing complexity, from outcome-only (final EM/F1), to stepwise verification rewards, tool-usage bonuses, KL-constraint strengthening, self-verifiable retrieval rewards (entities returned must appear verbatim in answers), and finally self-distillation from policy rollouts for stable fine-tuning. 4. Training regime: GRPO uses batch size 128, learning rate 3e-7, KL coefficient varied per reward (0.05 default, 0.25 for higher rungs). Rollouts managed on NVIDIA GH200 hardware. Multiple random seeds (≥4) validate robustness. 5. Evaluation protocol: Full CWQ test set (3,531 questions), greedy decoding, max 512 tokens generation, up to 5 tool calls per question. Metrics include exact match (EM), correct-via-tool (CvT: exact match answers that contain at least one retrieved KG entity verbatim), and containment EM (looser). Tools/Q (mean tool calls/question) monitors policy interaction volume. Extensive trajectory classification into 7 trajectory modes (e.g. correct via tool, wrong answer, tool misuse) enables nuanced failure analysis. Held-out dev splits used to select checkpoints to avoid test contamination. Statistical tests include Wilson confidence intervals, McNemar for paired comparisons. Pass@16 sampling (sampling 16 outputs per question) assesses stochastic behavior. 6. Reproducibility: Authors release all code, reward implementations, per-checkpoint full test evaluations, and trajectory classifiers at https://anonymous.4open.science/r/KG_GRPO-D47D.1. The Freebase KG and CWQ benchmark are publicly available. The Qwen2.5-7B-Instruct checkpoint is from a public Qwen release. The oracle ablation injecting gold relations simulates a perfect relation selector to isolate bottlenecks in entity threading and answer composition downstream. Self-distillation training involves filtering EM-correct, tool-productive trajectories from the best RL checkpoint to re-derive a stable SFT initialization. End-to-end, the pipeline starts from SFT, applies GRPO RL under various reward regimes, observes training dynamics (e.g. peak-then-collapse in CvT and EM), analyzes failure modes with trajectory classifiers, performs oracle ablations, then applies self-distillation to break the failure cycle and reach a 40% EM performance ceiling constrained primarily by the minimal interface design.

Technical innovations

  • Identification and systematic characterization of a novel 'peak-then-collapse' failure mode in RL-based KG tool use, evidenced by a precipitous drop from 9.6% to 0% correct-via-tool rate within 50 steps.
  • Proposal of a 'four interface channels' framework (silent failure, symbolic schema, opaque compositional state, absent pretraining prior) to explain RL signal degradation unique to minimal KG tool APIs lacking natural-language feedback.
  • Design of a structured reward ladder with seven reward variants incrementally introduced to diagnose and shift failure modes rather than fully resolve them.
  • Introduction of a one-iteration self-distillation approach filtering EM-correct, tool-productive trajectories to achieve a stable 40.0% exact-match ceiling at 7B scale, dissociating gains from model scale or initialization.
  • Empirical oracle ablation injecting gold relations at each retrieval call demonstrating relation selection is not the main bottleneck, localizing retrieval errors upstream in compositional entity threading rather than answer extraction.

Datasets

  • Complex WebQuestions (CWQ) — 27,639 train / 3,531 test questions — public, preprocessed with RoG filtering
  • Freebase knowledge graph — 2.59 million entities / 7,058 relations / 8.3 million triples — public

Baselines vs proposed

  • R-binary: EM = 0.000 vs proposed self-distill: EM = 0.400
  • R-stepwise: EM = 0.325, CvT = 0.03% vs proposed R-selfV peak @ 250: EM = 0.395, CvT = 9.57%
  • R-toolverbs·KL @ 400: EM = 0.384, CvT = 3.77% vs R-selfV peak @ 250: EM = 0.395, CvT = 9.57%
  • R-selfV collapsed @ 300: EM = 0.000 vs self-distill: EM = 0.400
  • R-toolverbs·KL-14B: EM = 0.402, CvT = 6.40% vs self-distill (7B): EM = 0.400, CvT = 5.81%
  • GPT-4o under same 4-verb interface: format-valid near 100%, ContEM = 6.2%, F1 = 2.1%, with ~78% trajectories reporting schema unfamiliarity

Limitations

  • Limited to a single model family primarily (Qwen2.5-7B-Instruct), with some cross-family checks on Llama-3.1-8B showing distinct failure modes but no comprehensive replication.
  • Model scale experiments limited to 7B and 14B; larger scales (≥32B) not evaluated, so capacity-bound effects at larger sizes remain unknown.
  • Evaluation limited to the Complex WebQuestions benchmark; no end-to-end retraining or performance evaluation reported on KGQAGen-10k or other KGQA datasets.
  • The four interface channel framework is descriptive and interpretive rather than rigorously causally validated; isolating contributions from each channel is left for future work.
  • No adversarial attack or robustness testing against malicious KG or environment perturbations was performed.
  • The minimal KG API lacks any natural-language or textual signal by design, which may not fully represent practical semi-structured or enriched KG tool environments.

Open questions / follow-ons

  • Can interface enrichments providing richer semantic or natural language feedback signals mitigate or eliminate the peak-then-collapse phenomenon?
  • What specific aspects of the opaque compositional state and silent failure channels are most detrimental to policy gradient convergence, and can architectural inductive biases help?
  • How does increasing model scale beyond 14B (e.g., to 32B or more) affect the capacity-invariance observation and interface-bound performance ceiling?
  • Are there alternative verification or reward designs that produce non-gameable, stable RL signals for knowledge graph tool use?

Why it matters for bot defense

This study provides bot-defense practitioners and CAPTCHA engineers with a detailed characterization of failure modes when large language models attempt to use minimal, opaque knowledge graph APIs as tools through reinforcement learning. Since these minimal KG interfaces lack natural-language failure signals and involve opaque compositional state, RL policies suffer notable instability and proxy reward gaming, culminating in the sharp peak-then-collapse behavior. For bot-defense systems leveraging RL-trained LLMs interacting with external APIs or knowledge bases, these findings caution that straightforward RLVR recipes transferring from web search or programming language interpreters may fail when applied to low-signal or opaque APIs. The four interface channel framework offers a diagnostic lens useful for evaluating new tool API designs in CAPTCHA or anti-bot contexts, emphasizing the need for interface feedback mechanisms that provide verifiable, non-silent failure signals and richer semantic context to sustain stable and robust learning. The demonstration that self-distillation can recover stable policies up to an interface-limited ceiling also highlights practical mitigation strategies for RL-instability in live systems. Overall, this work alerts practitioners that bot-defense tools embedding knowledge source APIs must carefully consider interface design and reward specification to avoid degradations in policy reliability.

Cite

bibtex
@article{arxiv2605_26037,
  title={ Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use },
  author={ Tianda Sun and Dimitar Kazakov },
  journal={arXiv preprint arXiv:2605.26037},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.26037}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution