Solve the Loop: Attractor Models for Language and Reasoning

Source: arXiv:2605.12466 · Published 2026-05-12 · By Jacob Fein-Ashley, Paria Rashidinejad

TL;DR

Attractor Models are proposed as an alternative to standard looped Transformers for iterative refinement in language modeling and reasoning. The core idea is to separate computation into a backbone that makes an initial output-space prediction and an attractor module that refines that prediction to a fixed point, with training done through implicit differentiation rather than backpropagating through every recurrent step. This is meant to keep training memory constant in the number of refinement steps, make inference depth adaptive to convergence, and avoid the train-test mismatch of fixed unrolling.

The paper’s empirical claim is that this fixed-point formulation is not just cleaner, but materially better across two very different regimes. On large-scale language modeling, Attractor Models reportedly improve perplexity and downstream accuracy over matched Transformers and looped baselines while using less training compute; on tiny hard-reasoning tasks, the same framework scales to strong Sudoku and maze performance with only ~27M parameters and about 1,000 examples. A notable additional finding is “equilibrium internalization”: after training, the backbone’s initial proposal moves close to the eventual equilibrium, so the solver can sometimes be removed at inference with little loss.

Key findings

On the 140M language-modeling setup, Attractor improves validation perplexity from 21.48 (Transformer) and 19.06 (Parcae) to 18.30, and Lambada perplexity from 127.39 and 80.64 to 68.02 (Table 1).
On the 370M setup, Attractor reaches 14.03 validation PPL and 27.14 Lambada PPL, versus 15.79 / 40.77 for the Transformer and 14.49 / 32.74 for Parcae (Table 1).
On the 770M setup, Attractor gets 12.09 validation PPL and 15.21 Lambada PPL, compared with 13.08 / 22.37 for the Transformer and 12.49 / 19.71 for Parcae (Table 1).
Downstream CORE accuracy improves from 13.00 ± 0.15 to 14.59 ± 0.11 at 140M, from 17.46 ± 0.03 to 20.24 ± 0.09 at 370M, and from 22.42 ± 0.20 to 26.83 ± 0.29 at 770M (Table 1).
The paper claims 25–31% lower training FLOPs than Parcae because the solver often converges before Tmax and the backward pass uses a one-step implicit-gradient approximation (Figure 4).
Peak training memory stays near 4.18 GB for Attractor while Parcae grows with loop count and hits OOM at high depths in the reported comparison (Figure 5).
On Sudoku-Extreme, the 27M Attractor Model reaches 91.4% accuracy, while the 7M version gets 54.3%; the 27M TRM baseline collapses to 0.0% (Table 2).
On Maze-Hard, the 27M Attractor Model reaches 93.1% accuracy, while the 7M version gets 46.7%; the 27M TRM baseline again collapses to 0.0% (Table 2).

Threat model

The relevant setting is a model that must iteratively refine its internal representation to produce language or structured outputs, under constraints of training stability, memory, and compute. The adversary is effectively the optimization and deployment regime itself: fixed-depth unrolling, unstable recurrent dynamics, and train-test mismatch. In the reasoning setting, the model must generalize from very small data to hard algorithmic tasks, but there is no explicit adversary with query access, model access, or a perturbation budget. What the method assumes it cannot do is rely on arbitrary large unrolled depth during training, or on a learned halting mechanism that decouples training and inference behavior.

Methodology — deep read

Threat model and framing: this is not a security paper in the usual adversarial sense, but the relevant “adversary” is any setting where iterative refinement is useful and where fixed unrolling is brittle or expensive. The paper assumes the model must produce a next-token distribution or a full structured output by refining a latent representation, and it compares against standard Transformers and looped/recurrent baselines. For the reasoning tasks, the setting is extremely data-limited (~1,000 examples) and the model must solve Sudoku-Extreme and Maze-Hard in a single forward pass, so the challenge is not an active attacker but scaling stable recurrence without collapsing performance. The paper explicitly assumes that the backbone can produce a meaningful initial output embedding and that the attractor module can be made convergent enough for root finding.

Data and experimental setups: for language modeling, the authors say they follow the nanochat recipe used by Parcae and train on FineWeb-Edu. They compare matched models at 140M, 370M, and 770M parameters, using the same data budget, optimizer, and learning-rate schedule as the Parcae baselines; the only architectural change is the recurrent block. Evaluation includes validation perplexity, Lambada perplexity, and CORE / CORE-Ext accuracy. For hard reasoning, they train on Sudoku-Extreme and Maze-Hard with approximately 1,000 examples each, following the TRM protocol with deep supervision, and require predicting the full output grid in a single direct forward pass rather than autoregressive decoding. The paper does not provide full dataset preprocessing details in the excerpt, but it does state that the language-modeling data comes from FineWeb-Edu and that the reasoning tasks are those introduced in Wang et al. (2025) and Jolicoeur-Martineau (2025).

Architecture and algorithm: Attractor Models consist of a backbone module T_{θb} and an attractor module T_{θa}. The input sequence x is embedded with tied input/output embeddings E(x)=\tilde x. The backbone maps \tilde x to an initial output-space proposal \tilde y0 = T_{θb}(\tilde x). The attractor then repeatedly refines this proposal by \tilde yt+1 = T_{θa}(\tilde yt, \tilde y0), where the initial proposal is injected at every step to keep the attractor proposal-dependent. Instead of choosing a fixed loop count T, the forward pass solves the fixed-point residual A_{θa}(\tilde y;\tilde y0)=T_{θa}(\tilde y,\tilde y0)-\tilde y=0 using a root finder initialized at \tilde y0; the implementation uses Anderson acceleration. The solver stops when the normalized residual |A_{θa}(\tilde yt,\tilde y0)|2 / |\tilde yt|2 < ε or Tmax is reached. The equilibrium \tilde y* is then decoded with the tied unembedding \tilde y*E^T. The main novelty relative to DEQ is that the fixed point lives directly in the output embedding space rather than in a hidden-state equilibrium, and the solver starts from a semantically meaningful proposal rather than zero or noise.

Training regime and optimization: training uses standard next-token cross-entropy on the equilibrium output. The key technical choice is implicit differentiation through the fixed point rather than backpropagating through the full iterative trajectory. The exact gradient involves u=(I−J^T_{\tilde y})^{-1}v, where v=∂L/∂\tilde y*, but the authors adopt the one-step approximation u≈v for the language-modeling experiments to avoid the extra linear solve; this means the backward pass only needs one vector-Jacobian product through the attractor block, so memory does not grow with the number of solver iterations. They explicitly note that this is in the spirit of prior implicit models. For the reasoning tasks, they switch to the phantom-gradient scheme from Geng et al. (2022) instead of the one-step approximation, because the small-data, small-model regime is more sensitive and the one-step surrogate was found to be too crude in TRM-like settings. The excerpt does not report optimizer hyperparameters, batch size, epoch count, random-seed strategy, or hardware, beyond the fact that the same optimizer and LR schedule as Parcae were used for the language-modeling runs.

Evaluation protocol, concrete example, and reproducibility: the language-modeling experiments compare against parameter-matched Transformers and Parcae at three scales, plus a 1.3B Transformer reference in one comparison. Metrics are validation PPL, Lambada PPL, CORE, and CORE-Ext; results are reported as single values with small error bars on CORE metrics (e.g., 14.59 ± 0.11). Figure 4 and Figure 5 evaluate efficiency in training FLOPs and peak memory as the number of loops increases. The reasoning experiments compare against Transformer, HRM, TRM at 7M and 27M, and frontier models (DeepSeek R1, Claude 3.7, o3-mini-high) that score 0% on both tasks in the table. A concrete end-to-end example is the next-token prediction of “The quick brown fox ____”: the backbone first produces a proposal embedding over candidate continuations, the attractor iterates until the residual is small, and the equilibrium is decoded into the final distribution. Reproducibility is reasonably strong on the surface: the paper gives a website and a GitHub repository, and the architecture is spelled out mathematically, but the excerpt does not include exact hyperparameter tables, number of training steps, seeds, or whether all checkpoints and datasets are fully frozen and public. The paper claims detailed hyperparameters are in Appendix C, but those specifics are not present in the provided text.

Technical innovations

Uses a two-stage backbone-plus-attractor design where the recurrent block solves for an equilibrium in output-embedding space rather than being unrolled for a preset depth.
Applies implicit differentiation to the fixed point so training memory is constant in effective recurrence depth, with a one-step gradient approximation used for language modeling.
Injects the backbone’s initial proposal into every attractor iteration, preventing the refinement dynamics from collapsing to a proposal-independent fixed point.
Introduces and empirically characterizes “equilibrium internalization,” where the backbone learns to land near the eventual fixed point so the solver can often be removed at inference.

Datasets

FineWeb-Edu — size not stated in excerpt — public web dataset used via the nanochat/Parcae pretraining recipe
Lambada — size not stated in excerpt — public benchmark
CORE — size not stated in excerpt — benchmark used for downstream accuracy
CORE-Ext — size not stated in excerpt — benchmark used for downstream accuracy
Sudoku-Extreme — approximately 1,000 training examples — source described via prior work (Wang et al., 2025; Jolicoeur-Martineau, 2025)
Maze-Hard — approximately 1,000 training examples — source described via prior work (Wang et al., 2025; Jolicoeur-Martineau, 2025)

Baselines vs proposed

Transformer (140M): val PPL = 21.48 vs proposed: 18.30
Parcae (140M): Lambada PPL = 80.64 vs proposed: 68.02
Transformer (370M): val PPL = 15.79 vs proposed: 14.03
Parcae (370M): Lambada PPL = 32.74 vs proposed: 27.14
Transformer (770M): val PPL = 13.08 vs proposed: 12.09
Parcae (770M): Lambada PPL = 19.71 vs proposed: 15.21
Transformer (140M): CORE = 13.00 ± 0.15 vs proposed: 14.59 ± 0.11
Transformer (370M): CORE = 17.46 ± 0.03 vs proposed: 20.24 ± 0.09
Transformer (770M): CORE = 22.42 ± 0.20 vs proposed: 26.83 ± 0.29
HRM (27M): Sudoku-Extreme = 55.0% vs proposed: 91.4%
HRM (27M): Maze-Hard = 74.5% vs proposed: 93.1%
TRM (7M): Sudoku-Extreme = 74.7% vs proposed: 54.3%
TRM (7M): Maze-Hard = 85.3% vs proposed: 46.7%
TRM (27M): Sudoku-Extreme = 0.0% vs proposed: 91.4%
TRM (27M): Maze-Hard = 0.0% vs proposed: 93.1%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12466.

Fig 2

Fig 2: Comparison of looped language models vs. Attractor Models. Looped language models repeatedly apply a

Fig 3

Fig 3: Overview of Attractor Models. The backbone maps the input embeddings to an output-side proposal

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

The excerpt does not provide full hyperparameter details, batch sizes, training steps, seed counts, or hardware, so exact reproducibility cannot be audited from the text alone.
The one-step implicit-gradient approximation is used for language modeling, which is computationally attractive but may bias gradients relative to exact implicit differentiation; the paper only says Anderson gives marginal quality gains.
The small reasoning benchmark regime is extremely data-limited (~1,000 examples), so the gains may depend heavily on task structure and deep supervision rather than general problem-solving ability.
For the reasoning tasks, the paper swaps to phantom gradients instead of the simpler one-step approximation, which means the best-performing settings are not uniform across regimes.
The excerpt does not show full ablations for all architectural choices, so the relative importance of backbone size, proposal conditioning, solver choice, and equilibrium decoding is only partially supported here.
The claim that the solver can be removed at inference with little degradation is intriguing but, from the excerpt, appears to be demonstrated on a subset of examples rather than as a universal guarantee.

Open questions / follow-ons

How far can equilibrium internalization be pushed before the solver becomes fully unnecessary without losing accuracy on harder distributions?
Would exact implicit differentiation, rather than the one-step approximation, materially improve language-model quality at acceptable cost for larger models?
How does the method behave under distribution shift, longer contexts, or adversarially perturbed reasoning instances where the fixed-point dynamics may be less contractive?
Can the attractor formulation be combined with token-level chain-of-thought or speculative decoding without reintroducing instability or train-test mismatch?

Why it matters for bot defense

For bot defense and CAPTCHA-like systems, the main takeaway is architectural rather than domain-specific: if a task benefits from iterative refinement, you do not necessarily need to pay for explicit unrolling at training or inference. The paper shows a way to keep recurrence adaptive while bounding memory, which is relevant when you want a verifier, solver, or policy module to refine its guess under a tight compute budget.

A bot-defense engineer should also notice the equilibrium-internalization effect. If a model learns to place its initial proposal near the fixed point, then some of the iterative machinery may become unnecessary at deployment time. In practice, that means you would want to monitor whether your “thinking” module is actually doing work or whether the backbone has absorbed it, because that changes both latency and how robust the system is when the refinement budget is reduced. The other practical lesson is cautionary: the strongest gains in the paper come from task structure that seems highly amenable to fixed-point computation, so one should not assume the same pattern will transfer to noisy, open-ended adversarial settings without dedicated evaluation.

Cite

bibtex

@article{arxiv2605_12466,
  title={ Solve the Loop: Attractor Models for Language and Reasoning },
  author={ Jacob Fein-Ashley and Paria Rashidinejad },
  journal={arXiv preprint arXiv:2605.12466},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12466}
}

Solve the Loop: Attractor Models for Language and Reasoning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​