On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Abstract

On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollout and the correct rollout used as context. Unlike the ideal optimal on-policy reinforcement learning (RL), which preserves probability ratios among equally correct rollouts, self-distillation can amplify existing probability gaps, concentrating mass on already-dominant modes. On a controlled graph path-finding task and science question-answering benchmarks, self-distilled models match or exceed RL on average performance but exhibit substantially lower functional and semantic diversity, failing on out-of-distribution settings that require diverse strategies.

Technical Analysis & Implementation

Overview§

This paper identifies a hidden cost of on-policy self-distillation with sampled demonstrations: reduced output diversity and flattened pass@k curves, despite improved pass@1 accuracy. The authors trace this to compounding biases in the feedback loop and provide a theoretical analysis showing that the optimal self-distillation policy distorts the base distribution by a pointwise conditional mutual information term.

Methodology§

Self-Distillation with Sampled Demonstrations§

The training alternates between generating rollouts and distilling feedback. The student generates a set of rollouts $\{y_i\}_{i=1}^N$ given input $x$. The teacher (same model, but conditioned on a correct demonstration $y^$) scores each rollout using token-level log-probabilities: $\text{score}(y_i) = \frac{1}{|y_i|} \sum_{t} \log p_\theta(y_{i,t} | x, y^, y_{i,<t})$. The student is then fine-tuned to maximize the score-weighted likelihood:

$$\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i=1}^N \text{score}(y_i) \cdot \log p_\theta(y_i | x) \right]$$

Theoretical Analysis§

The optimal self-distillation policy $\pi_{\text{SD}}$ tilts the base policy $\pi_{\text{base}}$ by:

$$\pi_{\text{SD}}(y|x) \propto \pi_{\text{base}}(y|x) \cdot \exp\left( \mathbb{E}_{y^ \sim \pi_{\text{base}}(\cdot|x)} [ \text{PMI}(y; y^|x) ] \right)$$

where PMI is pointwise mutual information. This term amplifies existing modes, reducing diversity. In contrast, on-policy RL with reward $R(y)$ preserves ratios among equally correct rollouts.

Code Snippet (Illustrative)§

import torch
import torch.nn.functional as F

# Assume model is a transformer with forward(x, context) returning logits
def self_distill_step(model, x, correct_demo, n_rollouts=8):
    # Generate student rollouts
    with torch.no_grad():
        rollouts = model.generate(x, num_return_sequences=n_rollouts)
    
    # Teacher scores each rollout conditioned on correct_demo
    scores = []
    for y in rollouts:
        logits = model(x, context=correct_demo)  # teacher forward
        log_probs = F.log_softmax(logits, dim=-1)
        # gather token-level log probs
        token_log_probs = log_probs.gather(-1, y.unsqueeze(-1)).squeeze(-1)
        score = token_log_probs.mean().item()
        scores.append(score)
    scores = torch.tensor(scores)
    
    # Student loss: negative score-weighted log-likelihood
    student_logits = model(x, context=None)
    student_log_probs = F.log_softmax(student_logits, dim=-1)
    loss = - (scores.softmax(dim=0) * student_log_probs.gather(-1, y.unsqueeze(-1)).squeeze(-1)).sum()
    return loss

Key Results§

Self-distillation matches or exceeds RL on pass@1 but shows significantly lower diversity (measured by distinct n-grams and task-specific functional diversity).
Pass@k curves saturate: generating more rollouts does not improve top-k accuracy.
On out-of-distribution tasks requiring diverse strategies, self-distilled models fail while RL models succeed.

Implications§

Practitioners should be cautious of using on-policy self-distillation when output diversity is critical (e.g., exploration, creative generation). The paper suggests that on-policy RL with a separate reward model preserves diversity better.

Abstract

Technical Analysis & Implementation

Overview§

Methodology§

Self-Distillation with Sampled Demonstrations§

Theoretical Analysis§

Code Snippet (Illustrative)§

Key Results§

Implications§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Can Scale Save Us From Plasticity Loss in Large Language Models?

Grad Detect: Gradient-Based Hallucination Detection in LLMs

Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

Accelerate your workflow with Feedalyze