arrow_backBack to research feed
llmPublished: June 24, 2026

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

By Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville

Research TL;DR

"On-policy self-distillation with sampled demonstrations boosts pass@1 but reduces diversity and flattens pass@k curves by amplifying biases via a conditional mutual information term."

Abstract

On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollout and the correct rollout used as context. Unlike the ideal optimal on-policy reinforcement learning (RL), which preserves probability ratios among equally correct rollouts, self-distillation can amplify existing probability gaps, concentrating mass on already-dominant modes. On a controlled graph path-finding task and science question-answering benchmarks, self-distilled models match or exceed RL on average performance but exhibit substantially lower functional and semantic diversity, failing on out-of-distribution settings that require diverse strategies.

Technical Analysis & Implementation

Overview§

This paper identifies a hidden cost of on-policy self-distillation with sampled demonstrations: reduced output diversity and flattened pass@k curves, despite improved pass@1 accuracy. The authors trace this to compounding biases in the feedback loop and provide a theoretical analysis showing that the optimal self-distillation policy distorts the base distribution by a pointwise conditional mutual information term.

Methodology§

Self-Distillation with Sampled Demonstrations§

The training alternates between generating rollouts and distilling feedback. The student generates a set of rollouts $\{y_i\}_{i=1}^N$ given input $x$. The teacher (same model, but conditioned on a correct demonstration $y^$) scores each rollout using token-level log-probabilities: $\text{score}(y_i) = \frac{1}{|y_i|} \sum_{t} \log p_\theta(y_{i,t} | x, y^, y_{i,<t})$. The student is then fine-tuned to maximize the score-weighted likelihood:

$$\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i=1}^N \text{score}(y_i) \cdot \log p_\theta(y_i | x) \right]$$

Theoretical Analysis§

The optimal self-distillation policy $\pi_{\text{SD}}$ tilts the base policy $\pi_{\text{base}}$ by:

$$\pi_{\text{SD}}(y|x) \propto \pi_{\text{base}}(y|x) \cdot \exp\left( \mathbb{E}_{y^ \sim \pi_{\text{base}}(\cdot|x)} [ \text{PMI}(y; y^|x) ] \right)$$

where PMI is pointwise mutual information. This term amplifies existing modes, reducing diversity. In contrast, on-policy RL with reward $R(y)$ preserves ratios among equally correct rollouts.

Code Snippet (Illustrative)§

import torch
import torch.nn.functional as F

# Assume model is a transformer with forward(x, context) returning logits
def self_distill_step(model, x, correct_demo, n_rollouts=8):
    # Generate student rollouts
    with torch.no_grad():
        rollouts = model.generate(x, num_return_sequences=n_rollouts)
    
    # Teacher scores each rollout conditioned on correct_demo
    scores = []
    for y in rollouts:
        logits = model(x, context=correct_demo)  # teacher forward
        log_probs = F.log_softmax(logits, dim=-1)
        # gather token-level log probs
        token_log_probs = log_probs.gather(-1, y.unsqueeze(-1)).squeeze(-1)
        score = token_log_probs.mean().item()
        scores.append(score)
    scores = torch.tensor(scores)
    
    # Student loss: negative score-weighted log-likelihood
    student_logits = model(x, context=None)
    student_log_probs = F.log_softmax(student_logits, dim=-1)
    loss = - (scores.softmax(dim=0) * student_log_probs.gather(-1, y.unsqueeze(-1)).squeeze(-1)).sum()
    return loss

Key Results§

  • Self-distillation matches or exceeds RL on pass@1 but shows significantly lower diversity (measured by distinct n-grams and task-specific functional diversity).
  • Pass@k curves saturate: generating more rollouts does not improve top-k accuracy.
  • On out-of-distribution tasks requiring diverse strategies, self-distilled models fail while RL models succeed.

Implications§

Practitioners should be cautious of using on-policy self-distillation when output diversity is critical (e.g., exploration, creative generation). The paper suggests that on-policy RL with a separate reward model preserves diversity better.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk