COMPOSE: Composing Future Theorems from Citations and Formal Structure

Source: arXiv:2605.30333 · Published 2026-05-28 · By David Busbib, Michael Werman

TL;DR

This paper addresses the challenge of generating plausible future mathematical claims that are both scientifically motivated and logically valid. The key insight is that future mathematical theorems should be grounded in two complementary contexts: the scientific citation graph that captures research progress and directions, and the formal dependency graph of aligned formal theorems that encodes valid logical derivations. Previous approaches model only one source at a time, resulting in either weakly grounded or insufficiently motivated claims. To resolve this, the authors propose COMPOSE, a dual-graph framework that jointly conditions a language model on both the scientific citation graph and the formal theorem dependency graph. The input consists of an anchor paper’s citation context paired with aligned formal theorems extracted from Lean’s Mathlib. This is encoded through dedicated graph neural networks followed by a fusion module which then conditions a math-specialized LLM decoder to generate future theorem-like claims. To support this task, the authors build a large-scale dataset linking 108K scientific-formal graph pairs from arXiv and Mathlib, along with a benchmark of 47K future papers from 2024–2025 for evaluation.

COMPOSE outperforms multiple strong baselines, including text-only and citation-only models, by a significant margin on metrics evaluating retrieval of relevant future papers and theorem grounding. Its generated outputs exhibit higher precision and alignment to formal and informal future theorems, and receive the best scores in human-like LLM-judge evaluations assessing content, depth, precision, and specificity. Ablations confirm that both scientific and formal graph components contribute complementary information that leads to improved generation quality. The work demonstrates that combining scientific citation structure with formal mathematical dependencies yields more mathematically grounded, plausible, and contextually relevant future claims than prior single-source approaches.

Key findings

COMPOSE achieves the largest retrieval Gap (Tgt-Sim minus Neg-Sim) of 0.240, outperforming all baselines on a 47K future paper benchmark.
COMPOSE attains higher Hits@10 (0.508) and Hits@100 (0.808) than strongest baselines, ranking the correct future paper more highly.
Dual-graph conditioning leads to NMCP Precision of 0.560 and Match of 0.730, showing strong formal and informal alignment to future theorems.
Ablation removing the scientific graph drops Hits@10 to 0.135; removing the formal graph drops Hits@10 to 0.390, confirming complementarity.
COMPOSE scores 3.36/5 overall on LLM-judge evaluation, leading in content (3.45), depth (3.52), precision (3.23), and specificity (3.52).
Text-only and prompt-only baselines have high target similarity (~0.47) but also high negative similarity (~0.29), yielding smaller Gaps (~0.17) than COMPOSE.
Dual-graph fusion via bidirectional cross-attention improves retrieval compared to encoding graphs independently without fusion.
Performance gains hold consistently across different decoders: DeepSeek-Math 7B and Mistral 7B backbones.

Threat model

The implicit adversary is a mathematical researcher or automated system attempting to generate novel conjectures that follow logically from existing formal results and align with promising scientific directions. The adversary’s capabilities include access to prior papers, formal theorem libraries, and the dual graph structures. The adversary, however, cannot violate formal theorem dependencies by generating logically invalid claims, nor disregard the research trajectory captured by the scientific citation graph. The model aims to mitigate generation of unsound or unmotivated theorems by grounding across both contexts simultaneously.

Methodology — deep read

The authors formulate grounded future mathematical generation as producing a plausible future theorem-like claim for a given anchor paper, conditioned jointly on its scientific citation context and aligned formal mathematical dependencies. The threat model implicitly assumes an adversary would seek future claims that reflect valid logical extensions and research directions; the model aims to prevent generating claims that are either logically invalid or unrelated to prior context.

Data is curated by linking approximately 108K scientific papers from arXiv (2000–2023) with formal theorems from Lean's Mathlib library via an informal-to-informal alignment approach using the FrenzyMath corpus. For each anchor paper, a scientific citation graph is constructed up to two hops, including up to 5 and 3 cited papers per hop selected by citation-context relevance rather than full citation lists. The graph nodes include paper abstracts and extracted informal theorems, connected by typed edges: citation, paper-to-theorem, and theorem-to-theorem links. The formal graph consists of corresponding Mathlib theorems aligned via dense retrieval (thresholded on similarity) and their directed dependency edges from Mathlib. Root nodes in the formal graph correspond to matched theorems. The authors also build a temporally held-out benchmark of 47K future papers (2024–2025) for evaluation; among these, anchor papers cited by these future papers form a 2,000-sample test set.

COMPOSE uses two dedicated graph neural networks with edge-type-specific message passing to encode the scientific graph and the formal graph, respectively. Paper abstract and informal theorem nodes in the scientific graph are initialized with E5 embeddings, while formal theorem nodes are initialized with stronger DeepSeek-Math embeddings derived from theorem signatures. Each GNN applies directional message passing separately for incoming and outgoing edges, combined with gated residual connections and layer normalization to mitigate oversmoothing. The encoder outputs are projected to a shared latent space and combined with learned type embeddings indicating graph origin. A bidirectional cross-attention fusion layer then exchanges information between the scientific and formal graph node embeddings, fusing them into a single set of contextualized node representations.

A math-specialized pretrained LLM decoder is conditioned on these fused graph embeddings via cross-attention layers inserted periodically (20% of layers) using LoRA adaption for efficient tuning. This allows the decoder to generate theorem-like statements grounded in the joint dual-graph context.

Training proceeds in two stages. Stage 1 optimizes the graph encoders and fusion module with three objectives: (1) link prediction within each graph to capture graph structure, (2) contrastive alignment loss pulling fused graph representations close to the matched target paper's embeddings, and (3) cross-modal alignment between matched informal and formal theorem nodes. Text embeddings are frozen during this stage.

Stage 2 adds the LLM decoder and fine-tunes the full system with an autoregressive generation loss to maximize likelihood of the target future claim, plus a graph margin loss to encourage conditioning on the correct graph rather than a mismatched one, preventing the decoder from ignoring graph inputs.

Evaluation is multi-faceted: retrieval metrics measure cosine similarity between generated claims and target future papers using a fine-tuned embedding evaluator. Metrics include Tgt-Sim (similarity to the correct future paper), Neg-Sim (to unrelated papers), Gap (Tgt-Sim minus Neg-Sim), Hits@k for recall at rank k, and Exp-Sim for similarity to the broader cluster of related papers. Novel Mathematical Claim Prediction (NMCP) metrics measure alignment precision against formal and informal future theorem targets, as well as the fraction of mathematical tokens generated. An LLM-judge (a GPT model frozen before 2024) provides qualitative scoring of content, depth, novelty, precision, and specificity.

Baselines include text-only fine-tuned decoders, models that use only scientific or formal graphs, ablations of cross-attention fusion or stage-1 pretraining, as well as several state-of-the-art published systems for future paper and research direction generation like GIANTS, FutureGen, GoAI, KG-CoI, and ResearchAgent, including GPT-4 API variants.

Reproducibility is partially supported by a public project page, but no explicit code release or frozen weights are mentioned. The dataset construction details and training procedure are extensively documented in the appendix, including graph construction, alignment thresholds, and hyperparameter values.

Concrete example end-to-end: Given an anchor paper, the model retrieves its most relevant cited papers (up to 2 hops), extracts abstracts and theorems, and aligns these to formal Mathlib theorems via FrenzyMath retrieval. Two graphs are generated and encoded by GNNs independently, then fused by bidirectional cross-attention. The fused embeddings condition a math-specialized LLM decoder that autoregressively generates a candidate theorem claim, which is evaluated for similarity to later papers that actually cite the anchor. This overall pipeline validates generating future claims that are simultaneously scientifically plausible and formally grounded.

Technical innovations

Dual-graph encoding that simultaneously models scientific citation context and aligned formal theorem dependencies as complementary sources for future mathematical claim generation.
Bidirectional cross-attention fusion layer that integrates heterogeneous embeddings from scientific and formal graphs into a shared latent representation conditioning the LLM decoder.
Two-stage training combining graph encoder pretraining on link prediction and contrastive alignment with joint graph-conditioned LLM fine-tuning including a graph margin loss.
Leveraging informal-to-informal theorem alignment via FrenzyMath to bridge scientific papers and formal Mathlib theorem dependency graphs for large-scale paired dataset construction.

Datasets

Paired arXiv-Mathlib dataset — 108K paired scientific-formal graph examples — constructed from arXiv mathematics papers (2000–2023) aligned to Lean's Mathlib via FrenzyMath.
Future paper benchmark — 47K mathematical papers — from late 2024–2025 arXiv subset for temporally held-out evaluation.

Baselines vs proposed

Paper-graph-only: Gap = 0.164 vs COMPOSE: Gap = 0.240 (DeepSeek-Math 7B decoder)
Bag-of-Papers (no graph): Gap = 0.128 vs COMPOSE: Gap = 0.240
Text-only (LoRA decoder): Gap = 0.176 vs COMPOSE: Gap = 0.240
Prompt-only: Gap = 0.174 vs COMPOSE: Gap = 0.240
Fixed NN retrieval baseline: Gap = 0.108 vs COMPOSE: Gap = 0.240
GIANTS: Tgt-Sim = 0.489 vs COMPOSE: Tgt-Sim = 0.525 (47K future paper benchmark)
GoAI: Gap = 0.202 vs COMPOSE: Gap = 0.240
CoI-GPT: Gap = 0.176 vs COMPOSE: Gap = 0.240
Ablation removing scientific graph drops Hits@10 to 0.135 vs COMPOSE 0.508
Ablation removing formal graph drops Hits@10 to 0.390 vs COMPOSE 0.508

Limitations

Informal-to-formal theorem alignment is approximate, introducing noise into the formal graph encoder and impacting grounding precision.
Generated future claims are not formally verified by a proof assistant, so correctness relies on proxy metrics like retrieval and LLM-judge evaluation rather than formal proof.
Benchmark relies on automatic main-claim extraction from future papers, which may be noisy for papers with multiple or nonstandard theorem statements.
The system currently generates single-step future claims and does not perform iterative conjecture generation or proof attempts.
Heavy dependence on Lean's Mathlib and FrenzyMath limits applicability to other theorem libraries or domains without similar formal datasets.
Reproducibility limited by lack of explicit public release of full code and pretrained weights.

Open questions / follow-ons

How to integrate formal verification and proof assistant feedback directly into the generation loop to guarantee formal correctness of generated claims?
Can the approach be extended to iterative multi-step conjecture generation coupled with interactive proof attempts?
What is the effect of scaling up graph context beyond two citation hops or incorporating broader mathematical knowledge graphs?
Could the informal-to-formal alignment methods be improved to cover more diverse and complex theorems, reducing noise in formal graphs?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective focused on robustness and advanced reasoning tasks, COMPOSE exemplifies a high-complexity dual-modality conditioning approach that integrates structured domain knowledge (formal theorem dependencies) with contextual scientific evolution (citation graphs) to generate content that is both plausible and logically consistent. For bot-defense practitioners, this highlights the emerging need for defenses that can understand and verify not only superficial language patterns but also underlying logical and relational structures representing domain knowledge. The approach of dual-graph conditioning and fusion with LLM decoders could inspire advanced challenge designs where bots must reason about structured knowledge graphs and logical constraints simultaneously.

Conversely, CAPTCHA or bot-detection systems might leverage insights from COMPOSE’s fusion of structural and contextual signals to better spot AI-generated content that lacks logical grounding or domain coherence. The evaluation protocols measuring alignment to formal structure and citation context suggest new metrics for verifying content authenticity. While COMPOSE operates in mathematics, the principles of formally constraining language generation via aligned knowledge graphs are broadly applicable for designing bot defenses involving domain-expert level logical consistency.

Cite

bibtex

@article{arxiv2605_30333,
  title={ COMPOSE: Composing Future Theorems from Citations and Formal Structure },
  author={ David Busbib and Michael Werman },
  journal={arXiv preprint arXiv:2605.30333},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30333}
}

COMPOSE: Composing Future Theorems from Citations and Formal Structure ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​