Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Source: arXiv:2605.30335 · Published 2026-05-28 · By Anany Kotawala

TL;DR

This paper addresses a fundamental coherence failure mode in multi-component large language model (LLM) agents that aggregate probabilistic claims from specialists, each viewing only a part of a joint question. Despite each component being locally coherent on its subset, the naive assembly can violate global probability axioms, creating a "locally coherent, globally incoherent" phenomenon. The authors formalize this with the compositional residual ε⋆, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system outputs and declared cross-component coupling constraints. They prove a product-structure dichotomy theorem characterizing when local coherence suffices (only when the joint polytope factorizes as a Cartesian product), and when it does not, ε⋆ certifies the failure. A hierarchical Boyle-Dykstra projection repair algorithm deterministically enforces joint coherence, and an anytime-valid e-process provides sequential online coherence monitoring. Empirically, on 1,876 multi-LLM ensemble cliques across four relation types, ε⋆ was positive in 33-94% of cases depending on relation complexity and routing. The Rayleigh-quotient magnitude prediction matched observed residuals within 7% for three relation classes. Notably, intuitive LLM-side mitigations such as retrieval grounding, partition-aware prompting, or aggregator-LLM approaches failed to consistently reduce incoherence or even worsened it. This work therefore provides a rigorous theoretical foundation and practical runtime metric and repair method for compositional coherence failures in multi-component LLM agent systems.

Key findings

Compositional residual ε⋆ was strictly positive on 33% (conjunction) to 94% (partition) of 1,876 multi-LLM ensemble cliques, indicating widespread cross-component incoherence even with locally coherent specialists.
The product-structure dichotomy theorem (Thm 3.3) shows local coherence composes to global coherence if and only if the joint polytope factorizes as a Cartesian product; otherwise positive ε⋆ failures exist.
Rayleigh-quotient closed-form magnitude prediction of E[(ε⋆)^2] matched observed residuals within 7% on negation, partition, and disjunction relation classes, and within 17% (0.83 factor) on conjunction.
Hierarchical Boyle-Dykstra projection reduces ε⋆ to ≤1.5×10⁻¹⁶ numeric floor in all relation classes, effectively repairing incoherence deterministically with low runtime cost.
Exposure bound √m⋆ε⋆ links the compositional residual to Dutch-book risk per the Fundamental Theorem of Asset Pricing, translating to approximately +0.115 nats regret per bet on 1,770 resolved bets.
LLM-side mitigation attempts (retrieval grounding, partition-aware prompting, aggregator-LLM repair) mostly failed or regressed residuals, sometimes worsening incoherence.
Cross-model heterogeneity amplifies residual ε⋆ by factors 1.7–4.5 relative to reruns with only one model, showing that source-model disagreement dominates residual size.
Sequential anytime-valid e-process test controls false positives in monitoring long-run residual streams, allowing runtime detection and escalation of persistent incoherence.

Threat model

The adversary is the multi-component LLM agent assembling probabilistic claims from multiple specialists, each holding partial views of a joint probabilistic space with explicit cross-component logical constraints. The attack surface is the agent's naive composition of locally coherent specialties that may produce globally incoherent aggregate forecasts, exposing Dutch-book vulnerabilities. The adversary cannot alter coupling constraints post-hoc or jointly recondition all components—i.e., specialists are honest but contextually disjoint, yielding incoherence from composition rather than malicious interference.

Methodology — deep read

The threat model assumes an adversary is the composed multi-component LLM agent whose subcomponents each issue calibrated and locally coherent marginal probability forecasts on subsets of a joint Bernoulli question set constrained by logical relations. The failure arises from composition that violates the joint coherent polytope constraints, causing globally incoherent overall quotes and Dutch-book exposability.

Data provenance involves 1,876 ensemble "cliques" formed from Paleka (cross-question logical consistency benchmark) and Polymarket (partition simplex constraints) datasets. Each clique consists of m Bernoulli questions and associated logical relations (negation, conjunction, disjunction, partition). Labels come from real resolutions with strict leakage control (events resolved after model snapshots). Splits include random assignment of questions to one of 4 LLM specialists and structured routing variants. Six models from Anthropic, OpenAI, and Meta via public APIs generate 8 verbalized probability samples per question at temperature 0.7.

The core architecture is a multi-component agent with k submodels each owning a subset Qa of local questions emitting empirical marginal vectors ˆp(a). Each specialist applies Joint-Coherent Decoding (JCD) to their marginals, projecting onto local coherent polytopes Ma. An owner-selected aggregator A then assembles the global quote by selecting each joint coordinate from its owning component. This locally repaired quote is then projected onto the joint global polytope M⋆, defined by lifting local constraints and enforcing cross-component coupling C (shared questions, logical relations, partitions). The compositional residual ε⋆ is defined as the L2 distance between the aggregated locally repaired quote and its global projection.

Training in this setting is empirical: submodel probabilistic quotations are gathered, projected locally and globally using quadratic program solvers (OSQP), and residuals measured. The hierarchical Boyle-Dykstra cyclic projection algorithm iterates local and coupling projections to compute the global joint projection Π⋆. Evaluation metrics include compositional residual ε⋆, exposure bound √m⋆ε⋆ interpreting Dutch-book risk, and Brier score differences paired via Diebold-Mariano tests against resolved labels. Multiple baseline regimes include naive composition (local JCD only), hierarchical JCD repair, single-LLM reference (no composition), and LLM-side mitigations.

Evaluation protocol isolates the residual due to cross-component incoherence by ensuring zero local residual per specialist, uses random and structured routing topologies to assess routing impact, controls for leaking future event info, and confirms residual presence via same-model decoupling controls. Statistical significance via p-values and bootstrapped confidence intervals are reported. Correlation and regression (R²) show strong linear relationship of residual size to disagreement with a global reference prediction.

Reproducibility: full raw responses, prompts, and per-clique residuals are released with supplementary materials. Several closed-form projections and theoretical bounds are verified empirically. A comprehensive appendix documents solver tolerances, leakage protocols, and detailed proofs. However, raw training code for JCD and planner LLMs appears not released and the Polymarket dataset is publicly known but indirect.

Technical innovations

Definition and operationalization of compositional residual ε⋆ as an online, distribution-free certificate of system-level coherence failure in composed LLM ensembles.
Product-structure dichotomy theorem (Thm 3.3) precisely characterizing when local coherence suffices for global coherence in owner-selected aggregation, i.e., iff the global polytope factorizes as a Cartesian product.
Closed-form Rayleigh-quotient magnitude prediction (Cor. 3.9) of expected squared residual from specialist covariance restricted to constraint normals, enabling predictive hardness ordering.
Application of hierarchical Boyle-Dykstra cyclic projection (Thm 3.10) for efficient deterministic global repair of incoherent composed forecasts, converging to exact projection on joint coherent polytope.
Formulation of an anytime-valid e-process coherence monitoring test (Thm D.2) supporting sequential false-positive control without fixed stopping times.

Datasets

Paleka — hundreds of cliques targeting logical consistency with negation, conjunction, disjunction, paraphrase relation classes — public benchmark from Paleka et al. (2025).
Polymarket — 67 partition constraint cliques filtered for leakage, expanded to 268 ensemble instances — derived from real forecast markets with unit-mass constraints.

Baselines vs proposed

Naive local-then-global composition: mean exposure bound √m⋆ε⋆ = 0.137 vs. hierarchical JCD repair drives residual to 1.7×10⁻¹⁴ (QP solver tolerance floor).
Same-model rerun baseline reduces ε⋆ from 0.058–0.118 across relations to 0.019–0.058, showing 22–60% residual persists within single model (cross-model heterogeneity amplifies residual by 1.7–4.5×).
LLM-side mitigations: retrieval grounding mean ε⋆ rises from 0.260 to 0.283 (regression), partition-aware prompting reduces mean ε⋆ to 0.066 but 53% partitions still >0.05 residual, aggregator-LLM repair reduces mean to 0.028 but 15% partitions >0.05 and 7% regress.
Paired Brier improvements under hierarchical JCD repair: NEG (−0.0137 p<10⁻⁴³), AND (−0.0076 p<10⁻¹⁶), PARTITION (−0.0048 p<10⁻²³), OR marginal (+0.0027 p=0.07, predicted conditional reversal).
Rayleigh-quotient prediction accuracy (obs./pred. ratio): Negation 1.054 (95%), Partition 1.069 (93%), Disjunction 1.026 (97%), Conjunction 0.830 (83%).

Limitations

Evaluation restricted to relatively small clique sizes (max m=100) and limited number of specialists (4 LLMs), possibly limiting scalability to very large multi-component agents.
The explicit cross-component coupling constraints C are assumed known and declared; real-world agent transcripts often contain implicit coupling, which is left for future work.
LLM-side mitigation methods tested are limited to retrieval grounding, partition-aware prompting, and aggregator-LLM; more sophisticated joint prompting or training-based methods are unexplored.
The residual ε⋆ is sensitive to logical-relation assumptions; label noise or true label incoherence can cause reversals in projection benefits (e.g., disjunction relation class).
No adversarial evaluation against maliciously constructed specialist forecasts designed to maximize incoherence or exploit repair mechanism.
The hierarchical JCD projection and monitoring rely on quadratic program solvers subject to numerical tolerances, which may affect practical runtime and robustness in large-scale applications.

Open questions / follow-ons

How to infer or learn implicit cross-component coupling constraints C directly from unstructured agent transcripts or natural language context?
Can joint training or prompting strategies be developed to produce globally coherent multi-component forecasts without explicit post-hoc geometric projection repairs?
How robust are these coherence certificates and repairs under adversarial specialists deliberately engineered to maximize incoherence or exploit the online e-process monitor?
Can these methods scale efficiently and with low latency to very large multi-agent LLM ensembles operating in real-time interactive workflows?

Why it matters for bot defense

Bot-defense engineers building multi-component LLM agents for complex decision pipelines or ensemble forecasting should be alert to the compositional incoherence failure mode isolated here—where specialists individually produce well-calibrated outputs but naive aggregation violates logical probability axioms. The compositional residual ε⋆ and its decomposition-theoretic characterization offers a runtime, distribution-free metric to detect and diagnose global incoherence and consequent exploitable exposures (e.g., Dutch-book risk). The hierarchical Boyle-Dykstra projection provides a practical repair to enforce coherent published forecasts without retraining specialists or requiring prohibitively expensive cross-model calls. Moreover, the anytime-valid e-process enables continuous monitoring to trigger safety escalations or abstentions when incoherence persists, a valuable feature in high-stakes bot-defense and CAPTCHA contexts where inconsistent multi-LLM beliefs might lead to vulnerabilities. The failure of natural LLM-side mitigations highlights the importance of formal geometric coherence tools rather than purely prompt engineering. While the work focuses on logical relation-consistent joint probabilities rather than adversarial manipulations, the diagnostic and repair infrastructure can improve trust and robustness in multi-LLM agent architectures widely employed in CAPTCHA or automated bot-defense pipelines involving layered specialist models.

Cite

bibtex

@article{arxiv2605_30335,
  title={ Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents },
  author={ Anany Kotawala },
  journal={arXiv preprint arXiv:2605.30335},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30335}
}

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​