Operads for compositional reasoning in LLMs

Source: arXiv:2606.13634 · Published 2026-06-11 · By Nathaniel Bottman, Kyle Richardson

TL;DR

This paper introduces operads, well-studied mathematical structures from algebraic topology and category theory, as a rigorous framework for modeling question decomposition in large language models (LLMs). Question decomposition, a key strategy to improve multi-step reasoning in LLMs, involves breaking complex queries into simpler sub-queries whose answers are composed to form the final output. Despite its widespread use, this process lacked a precise formal basis. The authors define the questions operad Q, where k-ary operations correspond to question templates with k blanks, and composition corresponds to substituting sub-answers into blanks. They show that a question-answering model can be interpreted as an algebra over Q, mapping composed questions to answers consistently with operadic composition. Beyond framing existing methods, this operadic viewpoint enables new concepts such as operadic consistency—a measure of whether a model’s answers agree across different partial compositions of the question decomposition tree. Their companion paper empirically validates this concept, showing operadic consistency correlates strongly with accuracy across 12 LLMs and 4 multi-hop QA datasets, outperforming standard temperature-based self-consistency baselines. The paper thus provides a mathematically principled language to reason about and improve compositional reasoning in LLMs.

Key findings

Question decomposition can be formalized as a colored operad Q, with operations as question templates and composition as substitution of answers into blanks.
QA models can be interpreted as algebras over the questions operad Q, mapping composed questions to values consistently.
Operadic consistency is defined as agreement of a QA model’s answers across all partial collapses (partial compositions) of a question decomposition tree.
The companion empirical paper shows operadic consistency is strongly correlated with accuracy across 12 instruction-tuned LLMs and 4 multi-hop QA datasets (exact quantitative correlation not provided here).
Operadic consistency outperforms standard temperature-based self-consistency methods for selective prediction at equal inference cost.
The associativity and commutativity axioms of operads ensure that the order of composition embeddings does not affect the final composed question, a property used for correctness and consistency reasoning.
The paper connects operadic ambiguity and algebraic structures to classical formal language theory, showing that operads subsume CFG derivations and enable richer algebraic treatment of ambiguity.

Methodology — deep read

The paper approaches question decomposition through the formalism of operads, a mathematical structure originally from algebraic topology that models operations with multiple inputs and a single output and their compositions.

Threat Model & Assumptions: The paper assumes a setting where complex questions are decomposed into simpler sub-questions arranged in a tree structure. It abstracts away specifics of the adversary and ML capabilities, focusing on formal correctness and consistency rather than security threat modeling.
Data: No new dataset is introduced. The paper models questions and decompositions abstractly as elements of sets Q(k) of k-ary question templates with typed answer slots (colors). Examples are drawn from multi-hop QA tasks.
Architecture / Algorithm:

Defines the questions operad Q as a colored operad where Q(k) is the set of questions with k blanks. Operations compose by plugging one question into another's blank, resulting in another question with arity (k + l - 1).
Provides examples like composing "Who was -'s wife?" with "Who was President at -?" to form nested questions.
A QA model m induces a Q-algebra Vm, which for each composed question q with k blanks and k values v1,...,vk computes the answer by filling blanks, producing a rendered question, and executing m.

Training Regime: Not applicable since this is a mathematical formalism paper. The companion empirical paper trains/evaluates LLMs.
Evaluation Protocol: The key concept introduced is operadic consistency — the QA model's agreement across partial collapses (subsets of composed sub-questions) of a question tree. The companion paper measures correlation of operadic consistency with accuracy across 12 LLMs and 4 multi-hop QA datasets, comparing to baselines including temperature-based self-consistency.
Reproducibility: The formal definitions and examples are given in detail, but no code or datasets are released with this paper. The empirical validation is in their companion paper.

For a concrete example, consider the question “Who was First Lady when World War 2 ended?” decomposed into the tree of questions: Q1: “When did World War 2 end?”; Q2: “Who was President at [A1]?”; Q3: “Who was [A2]’s wife?” The composed question is Q3 ◦1 Q2 ◦1 Q1. The QA model is run on partial collapses such as only Q1 answered, or Q1 and Q2 composed, etc. Operadic consistency measures if the final answers agree regardless of the order and degree of partial composition.

This example illustrates operad composition, QA model algebra, and the definition of operadic consistency as answer agreement across partial question compositions.

Technical innovations

Formalization of question decomposition as a colored operad Q, capturing typed question templates and composition as substitution.
Interpretation of QA models as algebras over the questions operad, linking model answer computation to operadic composition.
Definition of operadic consistency, a novel consistency metric describing when model answers agree across all partial partial compositions of question decomposition trees.
Connection and extension of classical formal language theory (CFGs, derivation trees) to operadic frameworks and their algebras, enabling algebraic treatment of ambiguity and compositionality.
Identification of operadic structures as the natural mathematical home for multi-step reasoning and compositional question answering.

Baselines vs proposed

Operadic consistency vs temperature-based self-consistency: Operadic consistency yields better selective-prediction performance at the same inference cost on multi-hop QA datasets (companion paper results).

Limitations

The main paper is a theoretical and mathematical formalization without empirical evaluation; empirical claims are deferred to a companion paper.
Operadic consistency relies on explicit question decomposition trees whose extraction or annotation can be challenging and requires further automation.
The paper abstracts away real-world noise, partial observability, and ambiguity in natural language question decomposition; semantic equivalences in composition are assumed but not algorithmically guaranteed.
Operadic consistency may not capture all forms of reasoning errors, especially those unrelated to composition order or partial collapses.
The framework assumes typed blanks and compositional structure that may not straightforwardly map to all question answering tasks or open-domain queries.
No adversarial or robustness evaluation is performed; operadic consistency is not yet validated on adversarial or out-of-distribution queries.

Open questions / follow-ons

How to automatically extract operadic question decomposition trees from raw natural language inputs and model reasoning traces?
Can operadic consistency be integrated into training or prompting to actively improve model compositional reliability?
What other algebraic invariants over the questions operad (e.g., cohomological measures) can characterize model reasoning failure modes?
How well does operadic consistency extend to more general reasoning formalisms beyond tree-like question decompositions, such as DAGs or programs?

Why it matters for bot defense

Question decomposition is a foundational technique in improving multi-step reasoning with LLMs, a capability closely related to detecting and resisting automated misuse in bot defense. This paper provides a rigorous algebraic foundation for question decomposition, facilitating principled analysis of how complex queries are broken down and composed. For CAPTCHA and bot-defense practitioners, understanding operadic consistency offers a novel, model-agnostic signal that could audit or improve the reliability of multi-step query answering — reducing errors in challenging multi-hop settings where bots may rely on language models.

Moreover, the formalism may inspire new defense mechanisms by measuring internal consistency of sub-steps in challenge-response scenarios, or identifying inconsistencies symptomatic of automation rather than human reasoning. While the paper is theoretical, its companion empirical validation demonstrates the practical utility of operadic consistency metrics. Overall, the operadic framework enriches the theoretical toolkit available for rigorous compositional evaluations relevant to bot and CAPTCHA challenges involving language understanding and reasoning.

Cite

bibtex

@article{arxiv2606_13634,
  title={ Operads for compositional reasoning in LLMs },
  author={ Nathaniel Bottman and Kyle Richardson },
  journal={arXiv preprint arXiv:2606.13634},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13634}
}

Operads for compositional reasoning in LLMs ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​