On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
By Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville
"On-policy self-distillation with sampled demonstrations boosts pass@1 but reduces diversity and flattens pass@k curves by amplifying biases via a conditional mutual information term."
Abstract
On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollout and the correct rollout used as context. Unlike the ideal optimal on-policy reinforcement learning (RL), which preserves probability ratios among equally correct rollouts, self-distillation can amplify existing probability gaps, concentrating mass on already-dominant modes. On a controlled graph path-finding task and science question-answering benchmarks, self-distilled models match or exceed RL on average performance but exhibit substantially lower functional and semantic diversity, failing on out-of-distribution settings that require diverse strategies.
Technical Analysis & Implementation
Overview§
This paper identifies a hidden cost of on-policy self-distillation with sampled demonstrations: reduced output diversity and flattened pass@k curves, despite improved pass@1 accuracy. The authors trace this to compounding biases in the feedback loop and provide a theoretical analysis showing that the optimal self-distillation policy distorts the base distribution by a pointwise conditional mutual information term.
Methodology§
Self-Distillation with Sampled Demonstrations§
The training alternates between generating rollouts and distilling feedback. The student generates a set of rollouts $\{y_i\}_{i=1}^N$ given input $x$. The teacher (same model, but conditioned on a correct demonstration $y^$) scores each rollout using token-level log-probabilities: $\text{score}(y_i) = \frac{1}{|y_i|} \sum_{t} \log p_\theta(y_{i,t} | x, y^, y_{i,<t})$. The student is then fine-tuned to maximize the score-weighted likelihood:
$$\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i=1}^N \text{score}(y_i) \cdot \log p_\theta(y_i | x) \right]$$
Theoretical Analysis§
The optimal self-distillation policy $\pi_{\text{SD}}$ tilts the base policy $\pi_{\text{base}}$ by:
$$\pi_{\text{SD}}(y|x) \propto \pi_{\text{base}}(y|x) \cdot \exp\left( \mathbb{E}_{y^ \sim \pi_{\text{base}}(\cdot|x)} [ \text{PMI}(y; y^|x) ] \right)$$
where PMI is pointwise mutual information. This term amplifies existing modes, reducing diversity. In contrast, on-policy RL with reward $R(y)$ preserves ratios among equally correct rollouts.
Code Snippet (Illustrative)§
import torch
import torch.nn.functional as F
# Assume model is a transformer with forward(x, context) returning logits
def self_distill_step(model, x, correct_demo, n_rollouts=8):
# Generate student rollouts
with torch.no_grad():
rollouts = model.generate(x, num_return_sequences=n_rollouts)
# Teacher scores each rollout conditioned on correct_demo
scores = []
for y in rollouts:
logits = model(x, context=correct_demo) # teacher forward
log_probs = F.log_softmax(logits, dim=-1)
# gather token-level log probs
token_log_probs = log_probs.gather(-1, y.unsqueeze(-1)).squeeze(-1)
score = token_log_probs.mean().item()
scores.append(score)
scores = torch.tensor(scores)
# Student loss: negative score-weighted log-likelihood
student_logits = model(x, context=None)
student_log_probs = F.log_softmax(student_logits, dim=-1)
loss = - (scores.softmax(dim=0) * student_log_probs.gather(-1, y.unsqueeze(-1)).squeeze(-1)).sum()
return lossKey Results§
- Self-distillation matches or exceeds RL on pass@1 but shows significantly lower diversity (measured by distinct n-grams and task-specific functional diversity).
- Pass@k curves saturate: generating more rollouts does not improve top-k accuracy.
- On out-of-distribution tasks requiring diverse strategies, self-distilled models fail while RL models succeed.
Implications§
Practitioners should be cautious of using on-policy self-distillation when output diversity is critical (e.g., exploration, creative generation). The paper suggests that on-policy RL with a separate reward model preserves diversity better.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.
Related Research
Can Scale Save Us From Plasticity Loss in Large Language Models?
Read Synopsis →Jun 2026Grad Detect: Gradient-Based Hallucination Detection in LLMs
Read Synopsis →Jun 2026Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk