Doc-to-Atom: Learning to Compile and Compose Memory Atoms
Source: arXiv:2606.12400 · Published 2026-06-10 · By Xingjian Diao, Wenbo Li, Yashas Malur Saidutta, Avinash Amballa, Lazar Valkov, Srinivas Chappidi
TL;DR
Doc-to-Atom (Doc2Atom) addresses the inefficiencies and limitations of prior document internalization approaches like Doc-to-LoRA, which compress an entire document into a single monolithic low-rank adapter. These prior methods suffer from irrelevant-query interference, limited compositional recall of heterogeneous facts, and poor scalability to long documents because all content must be compressed into fixed-size parameters applied uniformly across queries. Doc2Atom innovates by decomposing each document into semantically typed "knowledge atoms," each compiled into an independent micro-LoRA adapter with a retrieval key. At inference, a lightweight learned query router selects relevant atoms and composes a query-specific adapter injected into a frozen base language model. This compositional parametric memory system trained end-to-end via multi-objective distillation leads to improved accuracy, robust refusal to out-of-scope queries, and reduced memory overhead for document internalization. Experiments on six QA benchmarks show Doc2Atom consistently outperforms Doc-to-LoRA baselines on overall retrieval and reasoning quality, especially on long and multi-hop QA datasets, while providing more interpretable and query-adaptive querying of knowledge. The approach bridges in-context learning and parameter compression paradigms with modular, scalable, and selective memory units.
Key findings
- Doc2Atom achieves up to 8.58 F1 point improvement over retrained Doc-to-LoRA (D2Latom) on Gemma-2-2B-It across six QA datasets (37.99 vs 29.41).
- On Qwen3-4B-Instruct base model, Doc2Atom outperforms D2Latom by 6.42 F1 points overall (35.72 vs 29.30).
- Atomization alone improves Doc-to-LoRA training (D2Lraw → D2Latom) by +0.86 F1 on Gemma and +0.45 F1 on Qwen3, confirming benefit of document decomposition.
- Doc2Atom refusal F1 scores on irrelevant queries exceed 85% on Gemma-2-2B-It and remain highest across all datasets tested, compared to less than 40% for D2L baselines.
- Doc2Atom requires on average 44% less additional GPU memory than D2L baseline on Gemma-2-2B-It (13.43 GB vs 24.02 GB) and 85% less on Qwen3-4B-Instruct (5.02 GB vs 34.62 GB) during internalization compilation.
- Compile/update latency for Doc2Atom is slower than D2L (2.10s vs 0.31s on Gemma), but this step is offline and decoupled from inference.
- Doc2Atom’s two-stage learned query router with provenance keys enables selective retrieval and dynamic composition of relevant atoms per query, reducing interference.
- Multi-objective training with routing supervision, irrelevant query suppression, and knowledge protection losses improves robustness and prevents overwriting base model knowledge.
Threat model
The adversary is a user issuing arbitrary queries to the internalized document memory system but cannot modify the frozen base model or the compiled atom representations. The model assumes no adversarial parameter tampering or poisoning. The threat addressed is minimizing irrelevant-query interference and hallucination by routing only relevant atoms, and reliably refusing to answer queries outside the document scope.
Methodology — deep read
The threat model assumes an adversary that queries the internalized document knowledge but cannot alter the frozen base model or atom compilation process. The system aims to produce query-specific parametric memory that only activates relevant document knowledge without interference or hallucination.
Data comes from six diverse QA benchmarks including SQuAD, DROP, ROPES, 2WikiMultiHopQA, QASPER, and multiple LongBench zero-shot task subsets. Each document is atomized offline via an LLM-driven semantic annotation process dividing the context into minimal, semantically typed knowledge atoms labeled as fact claims, entity attributes, event relations, process steps, or evidence fragments. Each atom is enriched with metadata such as provenance spans, answer-bearing flags, abstraction levels, conflict groups, and confidence scores. The question pool is augmented with synthetically generated irrelevant probes to train refusal behavior.
The core architecture consists of: (1) a shared atom encoder reusing early layers of the frozen base LLM to produce atom embeddings; (2) an atom-level memory compiler neural network that generates per-atom micro-LoRA low-rank adapter factors, provenance keys for retrieval, optional memory key-value prototypes, and sparse masks controlling per-layer write gates; (3) a two-stage learned query router that first retrieves candidate atoms via cosine similarity on provenance keys then optionally reranks them using a frozen cross-encoder incorporating metadata biases; (4) a memory composer that aggregates weighted micro-LoRA factors via routing weights and applies sparse gating to produce a query-specific LoRA adapter injected into the frozen base LLM's designated last layers at runtime.
Training uses a multi-objective end-to-end distillation scheme. The frozen teacher model processes the full context and query, producing target logits. The student model uses only the query plus injected composed LoRA adapter and aims to reproduce teacher logits. Losses include cross entropy and KL distillation on answer tokens; weighted binary cross-entropy routing loss supervised by gold/supporting/distractor atoms; norm penalties on adapter factors for irrelevant queries to suppress activation; knowledge protection L2 or null-space constraints preventing interference with pre-existing base LLM capabilities; sparse regularization encouraging selective gating; symmetric KL divergence loss enforcing composition consistency when combining multiple atoms; and auxiliary losses enforcing conflict group exclusivity and alignment with annotation confidence. A curriculum training schedule gradually increases retrieved atoms, activates robustness and regularization losses.
At inference, query tokens are encoded, top-K relevant atoms retrieved and reranked, micro-LoRA factors composed, sparsely gated, and injected. Memory key-value prototypes are optionally provided as prefixes. The base model then generates the answer without reprocessing the full document text.
One concrete example: a multi-hop QA on a long document is atomized into 20 self-contained fact and relation atoms. The query router retrieves 3 relevant atoms based on provenance keys and ontology priors. The memory composer sums their micro-LoRA factors weighted by routing scores and applies sparse gating. This query-specific adapter is injected into the frozen base model’s last layers. The base model thus produces an answer consistent with the document without attending to all original tokens, achieving better accuracy and lower overhead than a monolithic adapter covering the entire text.
Code release and full training details are referenced in appendices but not explicitly confirmed as public at time of writing. Data treatment involves multiple public QA datasets plus synthetic irrelevant probes. Evaluation includes zero-shot testing on out-of-domain tasks.
Technical innovations
- Decomposition of documents into semantically typed atomic knowledge units internally compiled into micro-LoRA adapters allowing fine-grained compositional parametric memory.
- Two-stage learned query routing that combines scalable approximate key retrieval and cross-encoder reranking with annotation-informed metadata biases for relevant atom selection.
- Multi-objective end-to-end distillation training with objectives for routing accuracy, irrelevant query suppression, knowledge protection, sparse gating, and multi-atom composition consistency.
- Sparse memory injection with per-atom gating controlling which base model layers and modules receive updates, reducing interference and improving modularity.
- Optional micro-KV prototypes per atom preserving local sequential evidence as compact prefixes augmenting the parametric memory.
Datasets
- SQuAD — 100k+ QA pairs — public
- DROP — ~96k QA pairs — public
- ROPES — ~2.4k QA pairs — public
- 2WikiMultiHopQA — 200k+ samples — public
- QASPER — ~4.9k QA samples — public
- LongBench — multiple zero-shot evaluation subsets — public
Baselines vs proposed
- D2Lckpt on Gemma-2-2B-It: Overall F1 = 37.93% vs Doc2Atom = 37.99%
- D2Lraw on Gemma-2-2B-It: Overall F1 = 28.55% vs Doc2Atom = 37.99%
- D2Latom on Gemma-2-2B-It: Overall F1 = 29.41% vs Doc2Atom = 37.99%
- D2Lraw on Qwen3-4B-Instruct: Overall F1 = 28.85% vs Doc2Atom = 35.72%
- D2Latom on Qwen3-4B-Instruct: Overall F1 = 29.30% vs Doc2Atom = 35.72%
- Refusal on irrelevant queries (Gemma-2-2B-It) D2L baselines below 40% F1 vs Doc2Atom 85%+ F1
- Memory usage for internalization compile (Gemma-2-2B-It): D2L 24.02 GB vs Doc2Atom 13.43 GB
- Compile latency (Gemma-2-2B-It): D2L 0.31s vs Doc2Atom 2.10s
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12400.

Fig 1: Doc-to-LoRA vs. Doc-to-Atom. Top: Doc-to-

Fig 2: Doc2Atom data atomization and processing pipeline. Raw context documents are decomposed into

Fig 3: Doc2Atom pipeline. (a) Memory build (offline). The input document is decomposed into typed semantic

Fig 4 (page 4).

Fig 5 (page 4).

Fig 6 (page 4).

Fig 7 (page 4).

Fig 8 (page 4).
Limitations
- Fixed semantic taxonomy and predefined atomization strategy may not generalize across domains or tasks requiring adaptive granularity.
- Current work evaluates only on moderately sized LLMs (2B–4B param range); scalability to larger models remains untested.
- Limited exploration of adversarial or distribution-shifted queries beyond irrelevant probes created heuristically.
- Additional compilation latency overhead from atom encoding and routing steps compared to monolithic adapters.
- No public code or pretrained weights released yet, which limits reproducibility and independent verification.
- Evaluation focuses on QA benchmarks; generalization to other tasks like generation or reasoning outside retrieval is unclear.
Open questions / follow-ons
- Can atom decomposition be made adaptive or learnable rather than fixed by semantic taxonomy?
- How does Doc2Atom scale with much larger base LLMs or extremely large corpora of documents?
- What are the robustness properties under adversarially crafted or out-of-domain queries?
- Can the framework support continual updates or incremental addition/removal of atoms post-compilation?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, Doc2Atom offers a novel way to encode long, heterogeneous document knowledge into modular, query-specific parametric memory accessed efficiently at inference. This compositional memory reduces the risk of irrelevant information leaking through noisy or off-topic queries, which is crucial in adversarial settings where bots may probe for obscure or off-scope knowledge. The selective routing and refusal robustness mechanisms can inspire CAPTCHA systems that dynamically adapt challenge complexity based on relevant user context while curbing unnecessary computational overhead. Additionally, the memory-efficient internalization approach reduces system resource needs compared to naive long-context attention or monolithic adapters, aiding deployment at scale where latency and GPU memory are bottlenecks. Understanding this compositional distillation framework can help defense engineers conceptualize how parametric knowledge representation might enable next-generation user interaction models that balance specificity, interpretability, and computational efficiency under potential malicious querying.
Cite
@article{arxiv2606_12400,
title={ Doc-to-Atom: Learning to Compile and Compose Memory Atoms },
author={ Xingjian Diao and Wenbo Li and Yashas Malur Saidutta and Avinash Amballa and Lazar Valkov and Srinivas Chappidi },
journal={arXiv preprint arXiv:2606.12400},
year={ 2026 },
url={https://arxiv.org/abs/2606.12400}
}