Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Source: arXiv:2606.13649 · Published 2026-06-11 · By Nathaniel Bottman, Yinhong Liu, Kyle Richardson

TL;DR

This paper addresses the challenge of detecting reasoning failures of large language models (LLMs) at inference time without relying on ground-truth labels. Existing confidence metrics such as self-consistency and semantic entropy sample variability or self-evaluation but do not explicitly check internal compositional coherence. The authors introduce "operadic consistency" (OC), a novel per-question diagnostic derived from operad theory, which formalizes compositional structures by treating questions as operations composed from sub-questions. OC measures agreement between a model’s direct answer to a compositional query and the answer obtained by composing separately answered sub-queries. Evaluating twelve instruction-tuned LLMs (4B–671B parameters) and five “thinking” models on four multi-hop QA datasets, OC shows uniformly strong correlation with true accuracy (Pearson r between 0.86 and 0.94) across all datasets and models, outperforming canonical chain-of-thought self-consistency which degrades on some datasets. OC also provides complementary information beyond existing confidence signals, improves selective prediction accuracy at equal inference cost, and generalizes to reasoning chains extracted from the model’s own generated reasoning (thinking models). Overall, operadic consistency provides a theoretically principled, label-free signal for compositional reasoning failures, enabling better uncertainty quantification in complex multi-step LLM reasoning.

Key findings

Operadic consistency (OC) correlates strongly with LLM per-question accuracy on four multi-hop QA datasets with Pearson r in [0.86, 0.94], all p ≤ 0.0004.
OC is the only evaluated metric with correlation r ≥ 0.85 uniformly across all four datasets; chain-of-thought self-consistency (CoT-SC) matches OC only on HotpotQA (r=0.93) and DROP (r=0.87) but drops to ~0.45 on MuSiQue and StrategyQA.
Leave-one-model-out prediction of accuracy using OC-rate achieves a mean absolute error of 3.0 percentage points, better than 4.6 pp for CoT-SC with 10 samples (Fig. 1).
At per-question level, OC adds statistically significant predictive value beyond CoT-SC and semantic entropy (p ≤ 10^-16), confirmed by logistic regressions controlling for decomposition-aware baselines (p ≤ 10^-13).
Adding OC to a tuned CoT-SC baseline at equal inference cost (K=3 model calls) improves selective prediction metrics by AUARC +0.086 to +0.096 and AUROC +0.092 to +0.164 with 95% confidence intervals excluding zero on all datasets.
In five frontier thinking models, where decomposition is extracted from model's chain of thought, OC combined with CoT-SC yields positive selective prediction lifts on all 16 tested dataset-budget-metric cells, with 95% CIs excluding zero on 12/16.
Decomposition-aware baselines such as decomposed self-consistency and process reward models do not outperform OC, and OC remains significant when controlling for these baselines, arguing against OC simply capturing sample stability or known process rewards.

Threat model

The focus is on diagnosing untrusted large language model outputs at inference time without access to ground-truth labels. The 'adversary' is effectively the uncertainty estimation system or user seeking to detect when an LLM’s compositional reasoning fails internally. The adversary can execute multiple queries (direct and decomposed) on the model and observe outputs but cannot alter the model or access internal training information. Capability is limited to black-box probing of the model’s reasoning consistency via operadic decompositions; they cannot inject external supervision or manipulate decomposition annotations.

Methodology — deep read

Threat Model and Assumptions: The work assumes an adversary perspective of detecting reliability failures of LLMs during inference on compositional multi-step queries. The adversary (an analyst or system) does not have access to ground-truth labels but does have access to the model’s outputs including decomposed sub-answers. The threat model focuses on identifying if and when the model’s reasoning internally composes correctly, rather than modifying or attacking the model.
Data: Evaluation is conducted on four standard multi-hop and compositional question answering (QA) datasets: HotpotQA, MuSiQue, StrategyQA, and DROP. Each dataset contains 403-625 annotated questions, with two-step question decompositions provided either from human annotations (e.g., Break QDMR for HotpotQA and DROP) or native decomposition annotations (MuSiQue, StrategyQA). Additionally, five frontier "thinking" models are evaluated on these datasets plus GSM8K where decompositions are extracted from the model's own chain-of-thought prompting. No closed datasets are used.
Architecture/Algorithm: The core novel algorithm is "operadic consistency" (OC). It views questions as algebraic operations with blanks, and decompositions as operadic compositions (as per operad theory). For a depth-2 decomposition tree, the model is queried two ways: (a) directly on the complex question; (b) successively on each decomposed sub-question, substituting intermediate answers. The OC signal is computed as a semantic agreement score (dataset-specific metrics like SQuAD-F1, yes/no extraction, or numeric equality) between the final direct answer and the composed answer from sub-steps. The score ranges from 0 to 1, with 1 indicating perfect agreement.
Training Regime: Models are evaluated as-is, with no additional training. For generating CoT self-consistency baselines, K=3 and K=10 temperature samples are drawn at T=0.7, with short answers extracted. Logistic models are fit on combined signals for selective prediction experiments, using leave-one-model-out protocols.
Evaluation Protocol: Strong statistical rigor is applied. Correlations between OC and accuracy are computed per model and dataset. Per-question logistic regressions control for CoT-SC, semantic entropy, and decomposition-aware baselines with cluster-robust standard errors clustered by question. Leave-one-model-out calibration simulates transfer to new unseen models and assesses OC's predictive calibration error versus CoT-SC. Selective prediction metrics (AUARC, AUROC) evaluate how OC combined with CoT-SC improves accuracy at fixed coverage and cost. Bootstrap confidence intervals assess statistical significance. The ablation includes constructed baselines involving decomposed self-consistency and off-label process reward models, showing OC's unique complementary predictive power.
Reproducibility: The authors release code at a public GitHub repository and provide detailed appendix documentation of scorers, prompt formulas, and extractor details. The datasets are public multi-hop QA benchmarks. Model weights range from open source (LLaMA variants, Mistral) to closed-source frontier models. Decompositions are from established annotation sources and the authors' own pipeline.

Concrete example: For the question "Who was First Lady when World War 2 ended?", the OC check decomposes into "When did World War 2 end?" followed by "Who was President at [answer to Q1]?". The model is asked directly for the full question and also answers the decomposed steps. If the direct answer says "Roosevelt" but decomposed answers yield "Truman", OC score is low indicating inconsistency and potential reasoning failure. This simple test generalizes across datasets and models to identify compositional errors effectively.

Technical innovations

Introduction of operadic consistency (OC), a label-free, per-question signal measuring agreement between a model’s direct answer and its composed answer from sub-questions, grounded in formal operad theory for question decomposition.
Development of the 'questions operad' framework to algebraically formalize compositional structures in natural language question answering, modeling LLMs as algebras over the operad.
Demonstration that OC uniformly correlates highly with accuracy across diverse datasets and models, outperforming established uncertainty baselines like chain-of-thought self-consistency.
Extraction of implicit decompositions from the model’s own chain-of-thought reasoning (thinking models) to extend OC without relying on external annotations, enabling practical deployment.
Empirical evidence that OC signals add complementary predictive power beyond sample diversity, semantic entropy, and decomposition-aware process reward models under rigorous statistical controls.

Datasets

HotpotQA — 403-625 questions — public multi-hop QA with Break QDMR decompositions
MuSiQue — 403-625 questions — public multi-hop QA with native decompositions
StrategyQA — 403-625 questions — public multi-hop QA with native decompositions
DROP — 403-625 questions — public multi-hop QA with Break QDMR decompositions
GSM8K — used only for thinking models — public math reasoning benchmark without two-step decompositions

Baselines vs proposed

Chain-of-thought self-consistency K=10: Pearson r = 0.93 on HotpotQA vs OC r = 0.92
Chain-of-thought self-consistency K=10: Pearson r = 0.87 on DROP vs OC r = 0.86
Chain-of-thought self-consistency K=10: Pearson r ≈ 0.45 on MuSiQue and StrategyQA vs OC r ≈ 0.94 on both
Semantic entropy K=10: Max r = 0.80 on StrategyQA but drops to 0.44 on MuSiQue vs OC > 0.85 on all
P(True) self-evaluation: no significant correlation on any dataset vs OC r ≥ 0.86 consistently
Leave-one-model-out accuracy prediction MAE: OC 3.0 pp vs CoT-SC (K=10) 4.6 pp
Selective prediction AUARC lift over CoT-SC K=3: OC +0.086 to +0.096 across datasets (95% CI excludes zero)
Selective prediction AUROC lift over CoT-SC K=3: OC +0.092 to +0.164 across datasets (95% CI excludes zero)
On thinking models, OC + CoT-SC improves AUARC and AUROC with 95% CIs excluding zero on 12/16 dataset-budget-metric cells

Limitations

Operadic consistency is only instantiated for depth-2 decompositions; generalization to deeper or complex trees is untested in this paper.
Evaluation is limited to multi-hop QA and math reasoning tasks; applicability to other compositional reasoning domains or generative tasks is unexplored.
OC requires either external question decompositions or reliable decomposition extractors, which may not always be available or accurate.
Some datasets/models (e.g., Drop) show weaker or non-significant OC complementarity beyond sample diversity in thinking model evaluations.
Operational cost is higher than single-answer baselines due to multiple model calls needed for the decomposed reasoning path.
The semantic equivalence scorer depends on dataset-specific metrics that may imperfectly capture true reasoning correctness or partial credit.

Open questions / follow-ons

How does operadic consistency scale to deeper or more complex decomposition trees beyond depth-2 chains?
Can automatic decomposition extraction methods be further improved to enhance OC application to arbitrary reasoning tasks without manual annotation?
How well does OC perform on open-domain generative reasoning tasks or in settings with less deterministic answer equivalences?
Would combining OC with calibration or external verification methods further improve reasoning failure detection robustness?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, operadic consistency offers a novel mechanism to detect when compositional reasoning fails in large language models without requiring ground-truth labels. This is valuable since many security-critical verification or challenge-response systems rely on multi-step reasoning to distinguish humans from bots. OC can serve as a per-query confidence signal that complements traditional uncertainty measures based on answer sampling diversity or self-evaluation. Its strong and uniform correlation with accuracy across models and datasets suggests it could reliably flag suspicious or low-confidence LLM responses in multi-hop challenge scenarios. Moreover, the formal operadic framework provides a principled foundation for designing decomposable challenge questions whose internal consistency can be automatically checked at inference time. Though requiring multiple model calls per query, OC-based checks could be integrated with selective prediction or adaptive challenge difficulty tailoring to improve bot detection accuracy with minimal user friction. However, practical deployment would require accessible decomposition extraction pipelines and cost tradeoffs to be balanced carefully.

Cite

bibtex

@article{arxiv2606_13649,
  title={ Operadic consistency: a label-free signal for compositional reasoning failures in LLMs },
  author={ Nathaniel Bottman and Yinhong Liu and Kyle Richardson },
  journal={arXiv preprint arXiv:2606.13649},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13649}
}

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​