UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Source: arXiv:2605.06597 · Published 2026-05-07 · By Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu et al.

TL;DR

UniSD addresses a core tension in LLM post-training: how to improve a model using only its own outputs as supervision, without access to a stronger external teacher. The fundamental difficulty is that self-generated trajectories are open-ended, correctness is task-dependent, and plausible but wrong rationales can corrupt the training signal. Prior self-distillation methods (SDFT, GKD, SSD, OPSD) each tackle isolated aspects of this problem — on-policy sampling, distribution matching, or output filtering — but no prior work systematically studies which components matter, why, and how they interact. UniSD is proposed as a modular, extensible framework that unifies five complementary mechanisms under three axes: supervision reliability (multi-teacher agreement, token-level contrastive learning), representation alignment (feature matching), and training stability (EMA teacher, divergence clipping). Each mechanism targets a distinct failure mode of naive self-distillation.

The paper evaluates UniSD variants and a fully integrated pipeline (UniSD*) across six benchmarks — ScienceQA, GPQA, MBPP, HumanEval, CoS-E, ToolAlpaca — spanning science QA, commonsense reasoning, code generation, and tool use. Six models from three families (Qwen2.5 at 0.5B/1.5B/3B/7B, Llama-3.1-8B, Gemma-3-4B) are tested. The study goes beyond accuracy to also measure gold-completion perplexity fit and base-distribution retention (PPLret), revealing that SFT causes significant distributional drift while reliability-aware self-distillation preserves base model behavior.

UniSD* achieves a +5.4 point overall improvement over the raw Qwen2.5-7B base model and a +2.8 point improvement over the strongest baseline (GKD) with an overall score of 73.3 vs. 70.5. The paper's central empirical finding is that no single component dominates across all tasks: EMA and agreement provide the largest standalone gains, contrastive learning is the most uniformly positive, and divergence clipping acts as a lightweight stabilizer. The interaction study reveals that combining complementary mechanisms consistently outperforms any single mechanism in isolation.

Key findings

UniSD* achieves an overall score of 73.3 on Qwen2.5-7B across six benchmarks, a +5.4 improvement over the raw model (67.9) and +2.8 over the strongest baseline GKD (70.5).
EMA teacher is the strongest single component overall (72.5), with a +16.1 gain on ToolAlpaca (61.8 → 77.9) and +2.4 gains on both MBPP and HumanEval for Qwen2.5-7B.
SFT causes measurable distributional drift: on Qwen2.5-7B, base-scored retention perplexity rises from 1.14 (raw) to 1.68 (SFT), while EMA reduces this to ~1.13, a 33.9% relative reduction vs. SFT.
Token-level contrastive learning is the most consistently positive single component, improving all six benchmarks for Qwen2.5-7B, while other individual components show regressions on at least one benchmark.
Across 18 model-dataset pairs spanning three model families (Qwen2.5, Llama-3.1, Gemma-3), UniSD* improves over raw models in 15 settings, ties in 2, and regresses in only 1 OOD setting.
UniSD* achieves largest scale-specific gain of +7.06 on Qwen2.5-3B on ScienceQA (Fig 3a), suggesting mid-scale models benefit most from reliability-aware self-distillation.
UniSD* reduces mean token-level JSD to the base model from 0.054 (SFT) to 0.041, and UniSD* completions have lower JSD than SFT on 70.3% of examples; base model assigns higher log-probability to UniSD* completions on 60.6% of examples.
More auxiliary teacher contexts does not monotonically improve agreement quality: performance peaks at K=3 (sequence-level, ScienceQA) or K=7 (token-level, ScienceQA) depending on granularity and task, and redundant contexts can dilute the reliability signal.

Threat model

n/a — This is an LLM training methodology paper, not a security paper. The implicit adversary is the noise and unreliability inherent in self-generated supervision signals: the model's own incorrect completions, overconfident predictions on rare tokens, and plausible-but-wrong rationales that can corrupt the training loop. The model is assumed to have no access to a stronger external teacher, and the goal is to improve task performance without that access. No external adversarial attacker, no model extraction threat, and no privacy threat are modeled.

Methodology — deep read

Threat model and assumptions (self-improvement setting, not adversarial security): The adversary in the learning-theoretic sense is the noise inherent in self-generated supervision. The model has no access to a stronger external teacher at any point. It is assumed the base model has already been instruction-tuned and produces plausible but imperfect completions. The student and teacher are the same model (or an EMA copy), and correctness signals come from task-specific evaluation (unit tests for code, answer matching for QA, format adherence for tool use) embedded in the data, not from an external oracle.

Data provenance, size, and splits: Six benchmarks are used. ScienceQA (~13K train, multi-modal QA, but presumably text-only subset used) and CoS-E (commonsense with human rationales, ~9K train) are in-domain training + evaluation sets. MBPP (~374 training problems, Python coding with unit tests) is also used for training. ToolAlpaca (multi-step tool-calling, ~3.9K train) is the fourth in-domain set. GPQA (448 expert-level biology/chemistry/physics MCQs) and HumanEval (164 function-completion problems) are held-out OOD test sets — models trained on ScienceQA are tested on GPQA, and models trained on MBPP are tested on HumanEval. Exact train/validation/test splits and total token counts are not fully specified in the truncated text; dataset statistics are said to be in Table 4. For negative examples used in contrastive learning, incorrect alternatives are constructed by prompting an LLM to generate a plausible wrong answer, corrupting the reasoning in the positive example, or applying lexical perturbations via WordNet, PPDB, and TextAttack.

Architecture and novel components: All experiments use instruction-tuned autoregressive LLMs (Qwen2.5-7B-Instruct as primary). The UniSD framework wraps the standard on-policy training loop with five modular components: (A) Multi-Teacher Agreement: Given a student-sampled completion ŷ, the same trajectory is scored under K auxiliary teacher views π_k(ŷ_t | x, c_k, ŷ_{<t}), where contexts c_k are constructed via retrieval (nearest-neighbor few-shot), random sampling, or induced abstract instructions. Disagreement is measured at token level (variance/range of per-token log-likelihoods across K views) or sequence level (variance of aggregated per-view log-likelihoods). The resulting δ_t or δ_seq is used to compute reliability weights w_t that down-weight high-disagreement positions in the distillation loss. Crucially, all K views share one teacher model and are batched across contexts — no extra model copies are loaded. (B) EMA Teacher: A temporally smoothed teacher is maintained via θ̄_n = β·θ̄_{n-1} + (1-β)·θ_n with β ∈ [0,1]. The EMA teacher defines the target distribution replacing the primary teacher in Eq. 1, preventing transient overconfident predictions from propagating. This addresses temporal drift, complementing agreement's within-step reliability estimation. (C) Token-Level Contrastive Learning: A margin-based auxiliary loss encourages on-policy student completions to be closer (in per-token log-likelihood distance) to correct teacher-conditioned trajectories (d+_t) than to incorrect alternatives (d-_t): L_aux = Σ_t m_t · max(0, γ + d+_t - d-_t). This is a hinge-style margin loss applied token-by-token on completion positions. (D) Feature Matching: An L2 regularization term between student and teacher final-layer hidden states on completion tokens: L_feat = Σ_t m_t ||f^θ_t - f*_t||^2. Two variants are tested: representation-only matching and joint logit+representation matching. This addresses the gap between output distribution alignment and internal representational structure. (E) Divergence Clipping: Token-level KL/JSD divergences are capped at scalar threshold κ before applying reliability weights: D̃_t = min(D_t, κ). This prevents rare high-divergence tokens (e.g., stylistic outliers) from dominating the gradient. When κ is unspecified, the objective reduces to unclipped distillation. Forward-KL, reverse-KL, and weighted JSD are all supported as the base divergence.

Training regime: The primary model is Qwen2.5-7B-Instruct. Scale ablations use Qwen2.5 at 0.5B, 1.5B, 3B. Cross-family experiments use Llama-3.1-8B-Instruct and Gemma-3-4B-it. Specific hyperparameters (learning rate, batch size, number of epochs, EMA β, clipping κ, margin γ, number of auxiliary views K) are partially described: sensitivity analyses over K (shown in Appendix Figs 9 and 10) and γ (γ ∈ {0.01, 1.0} mentioned) are conducted. Training hardware and wall-clock times are partially shown in Fig 5 (left), which plots training time in minutes vs. accuracy on ScienceQA; agreement variants are slower than clipping, which is the most runtime-efficient. Full hyperparameter tables appear to be in the appendix, not included in the truncated text. Seed strategy is not described.

Evaluation protocol: The primary metric is task accuracy (exact match or pass@1 for code) averaged across benchmarks into an 'Overall' score. Secondary metrics include: (1) Gold-completion fit — teacher-forced perplexity PPL_fit on reference completions, measuring whether adaptation makes correct outputs more likely; (2) Base-distribution retention — PPL_ret, computed by scoring adapted model completions under the original base model π_0, measuring distributional drift; (3) Token-level JSD between adapted and base next-token distributions. Baselines are SFT, SDFT, GKD, SSD, and OPSD. Ablations test each UniSD component in isolation (Table 1) and in combination (UniSD*). Cross-scale ablation (Fig 3a) and cross-family ablation (Fig 7) are conducted. No statistical significance tests (e.g., bootstrap confidence intervals) are reported in the visible text. The OOD evaluation (GPQA for science, HumanEval for code) specifically tests transfer without any training on those distributions.

Concrete end-to-end example (ToolAlpaca, Qwen2.5-7B with EMA): The raw model scores 61.8 on ToolAlpaca. At each training step, the student samples a multi-step tool-calling trajectory ŷ from its current policy. The EMA teacher (a smoothed copy of the model) scores the same trajectory as a target distribution. The student minimizes the token-level JSD between its next-token distribution and the EMA teacher's, masked to completion tokens. The EMA teacher updates slowly (θ̄ = β·θ̄ + (1-β)·θ), preventing the teacher from abruptly incorporating a mistake the student just made. After training, EMA achieves 77.9 on ToolAlpaca (+16.1 over raw), suggesting that the strict output-format requirements of tool use benefit strongly from a temporally stable teacher target.

Reproducibility: Code release is not mentioned in the visible text. No frozen weights or model checkpoints are described. The paper uses public benchmark datasets. Negative example construction for contrastive learning involves prompting an LLM (which LLM is not specified in the visible text), introducing a potential reproducibility gap.

Technical innovations

Multi-teacher agreement as a reliability probe: instead of adding more distillation teachers, auxiliary views (retrieved/random/induced contexts) are used solely to estimate cross-view consistency of the same student trajectory, and this disagreement signal is used to weight the primary distillation loss — distinguishing it from ensemble distillation approaches like Born-Again Networks.
Token-level contrastive learning integrated into the on-policy self-distillation loop: a hinge margin loss d+_t vs. d-_t is computed over per-token log-likelihood distances to positive and negative teacher-conditioned trajectories, extending sequence-level contrastive approaches (e.g., DPO, RLHF-style preference learning) to the token level without reward models.
Unified modular formulation of self-distillation under three axes (supervision reliability, representation alignment, training stability) that makes component interactions empirically analyzable, contrasting with prior work (SDFT, GKD, SSD, OPSD) that each address a single axis in isolation.
Base-distribution retention (PPL_ret) as an evaluation metric for self-distillation: scoring adapted-model completions under the original base model quantifies catastrophic forgetting of base behavior, complementing task accuracy — this framing is not standard in prior KD/SFT evaluations.
Divergence clipping at the scalar token level (before reliability weighting) as a minimal, runtime-efficient stabilization mechanism, which the paper shows is orthogonal to and composable with EMA and agreement without requiring additional model copies or forward passes.

Datasets

ScienceQA — ~13K examples (train+test, exact split not specified in visible text) — public benchmark (Lu et al., 2022)
GPQA — 448 questions — public, test-only OOD benchmark (Rein et al., 2023)
CoS-E (CommonsenseQA + explanations) — ~9K examples (exact split not specified) — public benchmark (Rajani et al., 2019)
MBPP — ~374 training problems + test set — public benchmark (Austin et al., 2021)
HumanEval — 164 function-completion problems — public, test-only OOD benchmark (Chen et al., 2021)
ToolAlpaca — ~3.9K examples (exact split not specified in visible text) — public benchmark (Tang et al., 2023)

Baselines vs proposed

Raw (no fine-tuning): Overall = 67.9 vs. UniSD*: 73.3 (Qwen2.5-7B)
SFT: Overall = 68.3 vs. UniSD*: 73.3 (Qwen2.5-7B)
SDFT: Overall = 70.1 vs. UniSD*: 73.3 (Qwen2.5-7B)
GKD: Overall = 70.5 vs. UniSD*: 73.3 (Qwen2.5-7B)
SSD: Overall = 67.3 vs. UniSD*: 73.3 (Qwen2.5-7B)
OPSD: Overall = 68.2 vs. UniSD*: 73.3 (Qwen2.5-7B)
SFT retention perplexity (PPLret, Qwen2.5-7B): 1.68 vs. EMA: ~1.13 (raw baseline: 1.14)
SFT retention perplexity (PPLret, Gemma-3-4B): 3.02 vs. agreement/EMA/contrast variants: 10.57–11.24 (raw: 1.27 for gold-completion PPL; note units differ between PPLfit and PPLret — raw PPLfit for Qwen2.5-7B is 20.74, reduced to 5.7–6.1 by agreement/EMA/contrast)
UniSD* gain over raw on Llama-3.1-8B: +3.1 overall; on Gemma-3-4B: +2.2 overall (Fig 7)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06597.

Fig 1

Fig 1: Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement,

Fig 2

Fig 2: UniSD is a Unified framework for systematically studying Self-Distillation in autoregressive LLMs.

Fig 3

Fig 3 (page 1).

Fig 7

Fig 7: Gains over the original model across Qwen2.5, Llama-3.1, and Gemma-3 on ScienceQA (SQA),

Limitations

No statistical significance testing is reported in the visible text — improvements of ~1-3 points on small benchmark test sets (e.g., GPQA at 448 questions) may not be robust to random seeds or evaluation variance.
Negative example construction for contrastive learning relies on prompting an unspecified LLM to generate plausible incorrect alternatives, introducing an implicit dependency on an external model and making exact reproduction difficult.
The ablation study evaluates components on Qwen2.5-7B-Instruct as the primary model; interaction effects (e.g., does EMA+Agreement always dominate?) may not generalize identically across the smaller 0.5B/1.5B models or different families, and not all component combinations are tested for all model sizes.
Hyperparameters for agreement (K, γ), EMA (β), clipping (κ), and contrastive margin (γ) are sensitive to task and model scale — the paper shows non-monotonic behavior with K — but does not provide a principled selection strategy, limiting out-of-the-box applicability.
OOD evaluation is limited to one held-out benchmark per training domain (GPQA for science, HumanEval for code); broader distribution shift (e.g., domain changes in tool use or commonsense, adversarial prompting, length distribution shift) is not tested.
Code, model weights, and detailed hyperparameter configurations are not confirmed as publicly released in the visible text, limiting reproducibility for practitioners wanting to replicate or extend the framework.
The paper focuses on instruction-tuned base models (already post-trained); behavior of UniSD applied to raw pretrained models or in RLHF/RLAIF pipelines is not studied.

Open questions / follow-ons

Can the multi-teacher agreement mechanism be extended to preference-based or reward-model-free RLHF settings, where correctness signals are binary or sparse rather than token-level log-likelihoods?
The paper shows non-monotonic gains with number of auxiliary contexts K — is there a principled information-theoretic criterion (e.g., mutual information between contexts and trajectory likelihood) for selecting K and context construction strategy without per-task tuning?
UniSD* is evaluated only on models that are already instruction-tuned; how does reliability-aware self-distillation interact with continued pretraining or domain-adaptive pretraining where ground-truth correctness signals are unavailable?
The base-distribution retention metric (PPLret) shows that SFT causes significant drift while self-distillation preserves base behavior — does this retention advantage translate to better multi-task performance or reduced forgetting when UniSD is applied sequentially across multiple domains?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, UniSD is most directly relevant in the context of training behavioral classifiers or challenge-response scoring models where labeled adversarial data is scarce or expensive to obtain from stronger external annotators. The core insight — that reliability-weighted, on-policy self-distillation can outperform static imitation learning while preserving base model distribution — has practical implications for fine-tuning bot-detection models on self-generated hard negatives without requiring a more capable external oracle. The token-level contrastive learning component is particularly relevant: bot-defense models often need to distinguish between near-identical human and bot trajectories (e.g., mouse movement sequences, typing patterns, CAPTCHA solving traces) that share surface structure but differ in key behavioral signals, which is exactly the regime where margin-based contrastive supervision over fine-grained positions is expected to help.

The base-distribution retention finding is also operationally important: standard SFT on bot-labeled data is shown to cause distributional drift (PPLret rising from 1.14 to 1.68 on Qwen2.5-7B), which in a bot-defense context would manifest as a model that overfits to labeled attack patterns seen in training while degrading on novel bot behaviors. The EMA teacher mechanism, which reduces this drift by 33.9% relative to SFT, suggests a practical training recipe for iteratively updating detection models as bot behavior evolves without catastrophic forgetting of previously learned patterns. However, practitioners should note that all evaluations are on NLP benchmarks (QA, code, tool use), not behavioral bot-detection tasks, so direct performance transfer claims cannot be made without domain-specific validation.

Cite

bibtex

@article{arxiv2605_06597,
  title={ UniSD: Towards a Unified Self-Distillation Framework for Large Language Models },
  author={ Yiqiao Jin and Yiyang Wang and Lucheng Fu and Yijia Xiao and Yinyi Luo and Haoxin Liu and B. Aditya Prakash and Josiah Hester and Jindong Wang and Srijan Kumar },
  journal={arXiv preprint arXiv:2605.06597},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06597}
}

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​