Skip to content

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

Source: arXiv:2605.30337 · Published 2026-05-28 · By Alaa Khamis, Alaa Maalouf

TL;DR

This paper addresses the challenge of performing test-time finetuning (TTFT) of large language models (LLMs) in a way that balances quality and low latency, critical for real-time inference. Existing TTFT approaches involve retrieving relevant examples per query and finetuning on them, but this can be slow due to redundant retrieval and expensive selection methods that try to account for diversity. The authors introduce HullFT, a new geometric approach that models the prompt embedding as a sparse convex combination of a small set of training examples selected via Frank-Wolfe optimization. This convex reconstruction inherently balances relevance and diversity without explicit redundancy penalties. They convert fractional weights into integer multisets to form exact finite training batches, which enables repeated examples to be exploited by a gradient reuse scheme that amortizes finetuning cost. Experiments on 12 subsets of The Pile with GPT-2 show HullFT Pareto-dominates prior methods (kNN and SIFT) across all total runtime budgets ≤4s, especially in latency-critical settings, reducing bits-per-byte (BPB) by 3–6% at given time constraints while also speeding up selection by ~9–12× and finetuning by ~1.5×. Overall, HullFT improves the quality-efficiency tradeoff for TTFT by integrating convex geometry-based data selection with integerization and efficient gradient caching.

Key findings

  • HullFT outperforms state-of-the-art TTFT baselines kNN [5] and SIFT [8] on average BPB% across 12 Pile subsets, achieving 3.83% and 3.44% lower BPB at total runtime budgets T=1.75s and T=2.0s, respectively (Table 1).
  • HullFT is Pareto-dominant for all total runtimes T ≤ 4 seconds, with the BPB advantage growing as budgets tighten; 6.4% lower BPB at T=0.75s (Fig 2, right).
  • HullFT’s selection stage runs 8.8× faster than SIFT at N=50 (0.059s vs 0.524s) and over 12× faster on average across N ∈ [1,50], drastically reducing selection latency.
  • Gradient Reuse technique reduces finetuning forward-backward passes by 1.48× on average at N=50, with only 0.64% average BPB degradations (Table 2).
  • Integerization converts fractional convex weights into multisets with repeated examples, enabling Gradient Reuse and contiguous training on example blocks to amortize gradient computations (Alg. 1, 4).
  • HullFT’s convex hull-based selection produces a support set that is inherently both relevant and diverse without explicit redundancy penalties or expensive greedy heuristics (Fig 7 visualizes diverse support).
  • Applying Gradient Reuse naively to SIFT causes increases in BPB due to reordering; HullFT avoids this penalty as integerization fixes multiplicities in advance (Table 3).
  • On CPU, HullFT’s selection is 25.8× faster than SIFT with substantial end-to-end runtime savings (~89s), while BPB is still competitive (72.05% vs 69.78% for SIFT) (Fig 6).

Threat model

The adversary is an inference-time scenario seeking to improve model quality per test prompt by accessing embeddings and retrieving candidate sequences from a fixed large corpus. They cannot alter model internals or training corpus; their capability is limited to selecting and finetuning on retrieved examples per query. The goal is to select relevant, diverse examples efficiently without redundant computation to optimize user-facing latency and prediction quality. The method assumes honest but latency-constrained use, not malicious exploits or poisoning.

Methodology — deep read

The threat model assumes an adversarial scenario where a base LLM is adapted at test time to each prompt by retrieving relevant examples from a large corpus and finetuning, aiming to improve predictive quality per query. The adversary can only access embeddings and retrieved candidates but cannot directly manipulate or compromise model internals.

Data comes from 12 diverse subsets of The Pile corpus, with candidate pools consisting of the 200 nearest neighbors retrieved per prompt using a precomputed FAISS index and normalized RoBERTa embeddings. For each prompt q, the selection aims to find a small subset of candidates to finetune on, improving BPB.

The core algorithmic contribution frames data selection as an approximate Carathéodory problem: given the prompt embedding q in Rd and candidate pool P in Rd×K, solve a sparse convex approximation min_{w∈Δ_K} ||q - Pw||^2_2 via the projection-free Frank-Wolfe algorithm (Alg. 3). Starting with the point pi maximizing ⟨q,pi⟩, FW iteratively adds points reducing the residual, producing a sparse convex combination w with support size ≤ m and tolerance ε. This approach produces a support set S with fractional weights w, inherently diverse and relevant through geometric approximation rather than explicit redundancy penalties.

Because fractional weights cannot be used directly for finetuning, the authors introduce a geometric integerization procedure (Alg. 1) converting fractional weights w into integer multiplicities c summing to training budget N. Starting with flooring (floor(N·w_j)), they perform a greedy fill to allocate leftover examples to minimize Euclidean reconstruction error ||q - (1/N) ∑ c_j s_j||^2, followed by local swap refinements to improve approximation quality. This establishes an exact training multiset of examples with repeated sequences.

Finetuning exploits the repeated examples by Gradient Reuse (Alg. 4). Gradients for repeated identical examples are cached and reused across consecutive optimizer steps, recalculated every r steps (default r=2) instead of every step, reducing forward-backward passes from N to ⌈N/r⌉ with minimal loss in convergence or quality.

Training is performed with Adam optimizer (lr=5×10^−5) on GPT-2 base, with evaluation using bits-per-byte (BPB%) relative to no adaptation. Experiments sweep over multiset sizes N ∈ [1,50] to explore speed-quality tradeoffs.

Evaluation compares HullFT against two baselines: kNN retrieval (top N nearest neighbors) and SIFT (diversity-aware information-theoretic selection). Metrics emphasize BPB% vs total inference wall-clock time (selection + finetuning), critical for latency-sensitive scenarios. Experiments include 12 Pile subsets with 150 test queries each, cross-validated over subsets but no adversarial or distribution-shift tested. Ablations explore gradient reuse frequency r, integerization passes T, and selection parameters m, ε.

The method is fully reproducible with code released. The data subsets are public (The Pile), though embedding details and FAISS indexes are precomputed and not described in full detail. The integerization and gradient reuse algorithms are described explicitly. Some implementation details (hardware seeds, number epochs) are not fully detailed but training is done on a single NVIDIA A100 GPU.

Example end-to-end: Given a prompt q, retrieve 200 kNN candidates, run Frank-Wolfe to find weights w representing q as a sparse convex combo of ≤ m points, convert w→c integer counts forming repeated training examples, then finetune GPT-2 model with gradient reuse caching across repeated samples before evaluating prompt performance in BPB.

Technical innovations

  • Framing test-time finetuning data selection as a sparse convex approximation problem solvable via Frank-Wolfe, which naturally balances relevance and diversity without explicit redundancy penalty.
  • Geometric integerization converting fractional convex weights into an exact integer multiset preserving geometric approximation quality, enabling precise control over training batch composition.
  • Introducing Gradient Reuse to amortize gradient computations over repeated examples in the training multiset, reducing finetuning computational cost by caching gradients across repeated steps.
  • Demonstrating that combining convex geometry-based selection with integerization and gradient caching yields a two-stage speedup that improves quality-latency tradeoffs beyond prior methods (kNN, SIFT).

Datasets

Baselines vs proposed

  • kNN retrieval: BPB% at T=1.75s = 78.42% avg; HullFT: 74.48% (3.83% lower)
  • SIFT selection: BPB% at T=1.75s = 78.31% avg; HullFT: 74.48% (3.83% lower)
  • kNN retrieval: BPB% at T=2.0s = 77.63% avg; HullFT: 73.12% (3.44% lower)
  • SIFT selection: BPB% at T=2.0s = 76.56%; HullFT: 73.12% (3.44% lower)
  • HullFT selection stage runtime: 0.059s at N=50 vs SIFT: 0.524s (8.8× faster)
  • HullFT finetuning runtime with Gradient Reuse (r=2): 1.48× speedup over baseline at cost of 0.64% BPB increase
  • SIFT with gradient reuse variants: runtime reduced but BPB increased by up to +2.5%, while HullFT avoids this degradation

Limitations

  • HullFT depends on a fixed kNN pre-retrieved candidate pool of 200 sequences; quality is bounded by the recall and coverage of this initial retrieval and embedding model.
  • Selection quality inherits limitations of embedding space geometry; if candidate pool does not contain directions matching query semantics, approximation quality may suffer.
  • The gradient reuse technique assumes repeated identical examples appear consecutively; does not directly extend to methods producing non-contiguous duplicates without reordering.
  • Experiments evaluate only on GPT-2 medium and subsets of The Pile; no evaluation on larger LLMs, diverse domains, or adversarial prompt distributions.
  • No analysis of robustness to adversarial attacks or distribution shifts was performed; primarily focuses on efficiency and reconstruction quality.
  • Training hyperparameters and hardware details not exhaustively documented; potential variability in real deployments.
  • Integerization rounding introduces a small geometric approximation error, which prior work did not address explicitly.

Open questions / follow-ons

  • Can the convex-geometric integerization and gradient reuse techniques scale to much larger models (e.g., GPT-3 or GPT-4) or multimodal models?
  • How sensitive is convex hull-based selection to embedding model quality and different embedding spaces? Could learned or adaptive embeddings improve support sets?
  • Could gradient reuse be extended to handle non-contiguous repeated examples or more complex dependency graph ordering?
  • What is the effect of adversarial or distribution-shifted query prompts on HullFT’s convex approximation and integerization quality?

Why it matters for bot defense

For bot-defense engineers running latency-critical LLM inference with test-time finetuning, HullFT offers a practical approach to speed up per-query adaptation while improving quality. Its convex-geometric formulation automatically balances diversity and relevance, making selected examples more informative and less redundant, which is important to avoid wasting compute on near-duplicate adaptation data. The integerization step creates training batches with repeated examples in contiguous blocks, enabling gradient caching to reduce finetuning overhead substantially. This tandem approach addresses two main latency bottlenecks in TTFT: fast selection and efficient finetuning.

In captcha or bot detection pipelines where prompt-specific language adaptation is helpful (e.g., personalize text-based verification or response filtering), HullFT’s techniques can help maintain overall system responsiveness by minimizing adaptation runtime while maximizing perplexity reduction per query. Its ability to consistently outperform kNN and SIFT-based selection under strict latency budgets is especially relevant where marginal latency increments translate to worse user experience or bot evasion risk. However, practitioners should consider the method’s reliance on a fixed candidate embedding pool and evaluate robustness to prompt distribution shifts typical in adversarial bot inputs.

Cite

bibtex
@article{arxiv2605_30337,
  title={ Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching },
  author={ Alaa Khamis and Alaa Maalouf },
  journal={arXiv preprint arXiv:2605.30337},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30337}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution