DemoPSD: Disagreement-Modulated Policy Self-Distillation
By Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Shuang Qiu, Linqi Song
"Selectively blends teacher and student distributions via a reverse-KL barycenter target, mitigating privileged info leakage and preserving exploration in LLM self-distillation."
Abstract
On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce **DemoPSD**, a novel framework that resolves such problems through the idea of *selective adoption of teacher guidance*. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a *reverse-KL barycenter target*, a weighted geometric combination of the teacher and student distributions, that naturally balances learning from the teacher with preserving the student's own reasoning capacity. We measure the difference between their distributions and use such a discrepancy to adaptively control the blending at each token position. We provably show that DemoPSD achieves **(1)** *leakage attenuation*, i.e., effective mitigation of privileged information leakage; and **(2)** *exploration preservation*, i.e., preservation of exploration capacity under dense token-level distillation. Extensive experiments on SciKnowEval across four scientific fields show that DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA benchmarks.
Technical Analysis & Implementation
Overview§
DemoPSD addresses two key issues in on-policy self-distillation (OPSD) for LLMs: privileged information leakage and suppression of exploration. The core innovation is a token-level adaptive blending of teacher and student distributions using a reverse-KL barycenter target, which balances learning from the teacher with preserving the student's own reasoning capacity.
Methodology§
Reverse-KL Barycenter Target§
Let $p_t$ and $p_s$ denote the teacher and student distributions over tokens. At each position $i$, DemoPSD targets a weighted geometric combination:
$$ q_i \propto p_{t,i}^{\lambda_i} \cdot p_{s,i}^{1-\lambda_i} $$
where $\lambda_i \in [0,1]$ controls the blending. This is equivalent to finding the point on the geodesic between $p_t$ and $p_s$ under the reverse KL divergence, i.e., minimizing $\text{KL}(q \,||\, p_t)$ and $\text{KL}(q \,||\, p_s)$ in a Pareto sense.
Disagreement-Modulated Blending§
$\lambda_i$ is set based on the disagreement between $p_t$ and $p_s$, measured via Jensen-Shannon divergence:
$$ d_i = \text{JSD}(p_{t,i} \,||\, p_{s,i}) $$
A high $d_i$ indicates the teacher is overconfident or leaking privileged info; thus $\lambda_i$ is reduced to rely more on the student. Concretely,
$$ \lambda_i = \sigma\left( \frac{d_i - \tau}{\gamma} \right) $$
where $\sigma$ is the sigmoid, $\tau$ a threshold, and $\gamma$ a temperature. This allows selective adoption: only tokens where the teacher and student agree (low disagreement) are heavily distilled.
Training Objective§
The student is trained to minimize:
$$ \mathcal{L} = -\sum_i \sum_{y} q_i(y) \log p_{s,i}(y) $$
which is the cross-entropy against the barycenter target. Since $q$ depends on $p_s$ through $\lambda_i$, the gradient flows through both the target and the student parameters.
Theoretical Guarantees§
DemoPSD provably attenuates privileged information leakage because the target distribution is anchored to the student's own distribution wherever $d_i$ is high, preventing the student from encoding answer-dependent shortcuts. Furthermore, because $q$ retains a fraction of the student's original distribution, exploration capacity is preserved, as quantified by the entropy of the student's training distribution.
Implementation Details§
- The teacher uses standard causal masking (privileged access to future tokens), while the student uses the same architecture but without future context.
- Disagreement is computed per token using activations from the final layer before softmax.
- The hyperparameters $\tau$ and $\gamma$ control the sharpness of blending; typical values: $\tau=0.5$, $\gamma=0.1$.
Code Snippet§
import torch
import torch.nn.functional as F
def compute_barycenter_target(teacher_logits, student_logits, tau=0.5, gamma=0.1):
# teacher_logits, student_logits: (batch, seq_len, vocab)
teacher_probs = F.softmax(teacher_logits, dim=-1)
student_probs = F.softmax(student_logits, dim=-1)
# Jensen-Shannon divergence per token
m = 0.5 * (teacher_probs + student_probs)
jsd = 0.5 * (F.kl_div(m.log(), teacher_probs, reduction='none').sum(-1) +
F.kl_div(m.log(), student_probs, reduction='none').sum(-1))
# Blending coefficient
lambda_i = torch.sigmoid((jsd - tau) / gamma).unsqueeze(-1)
# Reverse-KL barycenter (geometric mean in probability space)
barycenter = (teacher_probs ** lambda_i) * (student_probs ** (1 - lambda_i))
barycenter = barycenter / barycenter.sum(-1, keepdim=True)
return barycenterResults§
On SciKnowEval (four scientific QA datasets), DemoPSD outperforms GRPO and SDPO in both in-domain accuracy and cross-domain generalization to GPQA. Training entropy remains higher than baseline OPSD, confirming preserved exploration. Ablations show that fixing $\lambda=0.5$ degrades performance, validating the advantage of adaptive blending.
Conclusion§
DemoPSD offers a principled way to mitigate overfitting and information leakage in self-distillation for LLMs, with theoretical guarantees and empirical gains.
Interactive LLM Token & Cost Calculator
Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.
Cost Breakdown (USD)
API Pricing Comparison (per Million Tokens)
| Model | Input | Output |
|---|---|---|
| GPT-5 | $1.25 | $10.00 |
| GPT-5.5 | $5.00 | $30.00 |
| GLM 4.7 Flash | $0.06 | $0.40 |
| GPT-5.2-Codex | $1.75 | $14.00 |
| Claude Opus 4 | $15.00 | $75.00 |
| Seed 1.6 Flash | $0.07 | $0.30 |
| Seed 1.6 | $0.25 | $2.00 |
| DeepSeek V3.1 | $0.21 | $0.79 |
| Mistral Medium 3.1 | $0.40 | $2.00 |
| o1 | $15.00 | $60.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 5 | $2.00 | $10.00 |
| Claude Opus 4.6 | $5.00 | $25.00 |
| Gemini 3.1 Pro | $2.00 | $12.00 |
| Gemini 3.1 Flash | $0.25 | $1.50 |
| Grok 4.20 | $1.25 | $2.50 |
| GPT-4o | $2.50 | $10.00 |
| Nano Banana 2 Lite (Gemini 3.1 Flash Lite Image) | $0.25 | $1.50 |
| Claude Opus 4.7 (Fast) | $30.00 | $150.00 |
| Gemini 3.1 Flash Lite | $0.25 | $1.50 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| o3 Mini | $1.10 | $4.40 |
| DeepSeek R1 | $0.70 | $2.50 |
| GLM 4.5V | $0.60 | $1.80 |
| GPT-5 Chat | $1.25 | $10.00 |
| GPT-5 Nano | $0.05 | $0.40 |
| gpt-oss-120b | $0.03 | $0.15 |
| GPT Chat Latest | $5.00 | $30.00 |
| Qwen 2.5 72B | $0.40 | $0.80 |
| Mistral Medium 3.5 | $1.50 | $7.50 |
| Anthropic Claude Haiku Latest | $1.00 | $5.00 |
| Claude Sonnet 4.5 | $3.00 | $15.00 |
| MoonshotAI Kimi Latest | $0.66 | $3.41 |
| GPT-5 Mini | $0.25 | $2.00 |
| Qwen 2.5-Coder 32B | $0.35 | $0.70 |
| Google Gemini Flash Latest | $1.50 | $9.00 |
| Anthropic Claude Sonnet Latest | $2.00 | $10.00 |
| Qwen3.5 Plus 2026-04-20 | $0.30 | $1.80 |
| gpt-oss-20b | $0.03 | $0.14 |
| Claude Opus 4.1 | $15.00 | $75.00 |
| DeepSeek V3 0324 | $0.24 | $0.90 |
| o1-pro | $150.00 | $600.00 |
| Mistral Small 3.1 24B | $0.35 | $0.56 |
| Qwen3.6 Flash | $0.19 | $1.13 |
| Qwen3.6 27B | $0.28 | $2.40 |
| Llama 4 Scout | $0.10 | $0.30 |
| Mistral Small 3 | $0.07 | $0.20 |
| Mistral Large 3 | $0.50 | $1.50 |
| GPT-5.5 Pro | $30.00 | $180.00 |
| DeepSeek V4 Flash | $0.09 | $0.18 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Claude Opus 4.8 | $5.00 | $25.00 |
| Hy3 preview | $0.06 | $0.21 |
| GPT-5.4 Image 2 | $8.00 | $15.00 |
| Claude Opus 4.5 | $5.00 | $25.00 |
| DeepSeek V4 Pro | $0.43 | $0.87 |
| Command R+ | $2.50 | $10.00 |
| Command R | $0.15 | $0.60 |
| MiniMax M2.7 | $0.18 | $0.72 |
| GPT-5.4 Nano | $0.20 | $1.25 |
| GPT-5.4 Mini | $0.75 | $4.50 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Mistral Small 4 | $0.15 | $0.60 |
| GLM 5 Turbo | $1.20 | $4.00 |
| Llama 4 Maverick | $0.15 | $0.60 |
| Llama 3.3 70B Instruct | $0.10 | $0.32 |
| Yi-Lightning | $0.15 | $0.30 |
| ERNIE 4.0 | $1.20 | $2.40 |
| Doubao Pro | $0.80 | $1.60 |
| Mistral Large 2 | $0.60 | $1.80 |
| Mixtral 8x22B | $0.50 | $1.00 |
| GPT-5.3-Codex | $1.75 | $14.00 |
| Gemini 3.1 Pro Preview | $2.00 | $12.00 |
| Llama 3.1 405B | $0.80 | $0.80 |
| Llama 3.1 8B | $0.04 | $0.04 |
| Qwen3.5 Plus 2026-02-15 | $0.26 | $1.56 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
| Gemini 3.5 Flash | $1.50 | $9.00 |
| GPT-4.1 | $2.00 | $8.00 |
| Step 3.5 Flash | $0.10 | $0.30 |
| Llama 3.2 11B Vision | $0.34 | $0.34 |
| Kimi K2.5 | $0.38 | $2.02 |
| Claude 3.5 Sonnet v2 | $3.00 | $15.00 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
| Hunyuan Pro | $0.60 | $1.20 |
| DeepSeek V3.2 | $0.23 | $0.34 |
| Nano Banana Pro (Gemini 3 Pro Image Preview) | $2.00 | $12.00 |
| GPT-5.1 | $1.25 | $10.00 |
| GPT-5.1 Chat | $1.25 | $10.00 |
| GPT-5.1-Codex | $1.25 | $10.00 |
| GPT-5.1-Codex-Mini | $0.25 | $2.00 |
| Kimi K2 Thinking | $0.60 | $2.50 |
| GPT-5 Image Mini | $2.50 | $2.00 |
| Nano Banana 2 (Gemini 3.1 Flash Image) | $0.50 | $3.00 |
| Nano Banana Pro (Gemini 3 Pro Image) | $2.00 | $12.00 |
| Claude Opus 4.8 (Fast) | $10.00 | $50.00 |
| Qwen3.7 Max | $1.25 | $3.75 |
| Grok Build 0.1 | $1.00 | $2.00 |
| Grok 4.3 | $1.25 | $2.50 |
| Google Gemini Pro Latest | $2.00 | $12.00 |
| Qwen3.6 35B A3B | $0.14 | $1.00 |
| Qwen3.6 Max Preview | $1.04 | $6.24 |
| Claude Opus Latest | $5.00 | $25.00 |
| Kimi K2.6 | $0.66 | $3.41 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| GLM 5.1 | $0.97 | $3.04 |
| Gemma 4 26B A4B | $0.06 | $0.33 |
| Gemma 4 31B | $0.12 | $0.35 |
| Qwen3.6 Plus | $0.33 | $1.95 |
| GLM 5V Turbo | $1.20 | $4.00 |
| Grok 4.20 Multi-Agent | $1.25 | $2.50 |
| Grok 4.20 | $1.25 | $2.50 |
| Lyria 3 Pro Preview | $0.00 | $0.00 |
| Lyria 3 Clip Preview | $0.00 | $0.00 |
| KAT-Coder-Pro V2 | $0.30 | $1.20 |
| Qwen Plus 0728 | $0.26 | $0.78 |
| Qwen3 235B A22B Thinking 2507 | $0.15 | $1.50 |
| Qwen3 Coder 480B A35B | $0.22 | $1.80 |
| UI-TARS 7B | $0.10 | $0.20 |
| Gemini 2.5 Flash Lite | $0.10 | $0.40 |
| Qwen3 235B A22B Instruct 2507 | $0.09 | $0.10 |
| Hunyuan A13B Instruct | $0.14 | $0.57 |
| ERNIE 4.5 VL 424B A47B | $0.42 | $1.25 |
| Mistral Small 3.2 24B | $0.07 | $0.20 |
| MiniMax M1 | $0.40 | $2.20 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
| o3 Pro | $20.00 | $80.00 |
| Gemini 2.5 Pro Preview 06-05 | $1.25 | $10.00 |
| R1 0528 | $0.50 | $2.15 |
| Gemma 3n 4B | $0.06 | $0.12 |
| Seed-2.0-Lite | $0.25 | $2.00 |
| Qwen3.5-122B-A10B | $0.26 | $2.08 |
| Qwen3.5-Flash | $0.07 | $0.26 |
| Gemini 3.1 Pro Preview Custom Tools | $2.00 | $12.00 |
| Qwen3.5 397B A17B | $0.39 | $2.45 |
| MiniMax M2.5 | $0.12 | $0.48 |
| GLM 5 | $0.60 | $1.92 |
| Qwen3 Max Thinking | $0.78 | $3.90 |
| Qwen3 Coder Next | $0.11 | $0.80 |
| MiniMax M2-her | $0.30 | $1.20 |
| GPT Audio | $2.50 | $10.00 |
| GPT Audio Mini | $0.60 | $2.40 |
| MiniMax M2.1 | $0.30 | $1.20 |
| GLM 4.7 | $0.40 | $1.75 |
| Gemini 3 Flash Preview | $0.50 | $3.00 |
| GPT-5.2 Chat | $1.75 | $14.00 |
| Kimi K2 0711 | $0.57 | $2.30 |
| GPT-5.2 Pro | $21.00 | $168.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Devstral 2 2512 | $0.40 | $2.00 |
| GLM 4.6V | $0.30 | $0.90 |
| GPT-5.1-Codex-Max | $1.25 | $10.00 |
| Ministral 3 14B 2512 | $0.20 | $0.20 |
| Ministral 3 8B 2512 | $0.15 | $0.15 |
| Ministral 3 3B 2512 | $0.10 | $0.10 |
| Mistral Large 3 2512 | $0.50 | $1.50 |
| Mistral Medium 3 | $0.40 | $2.00 |
| Gemini 2.5 Pro Preview 05-06 | $1.25 | $10.00 |
| Llama Guard 4 12B | $0.18 | $0.18 |
| Qwen3 30B A3B | $0.12 | $0.50 |
| Qwen3 8B | $0.12 | $0.46 |
| Qwen3 235B A22B | $0.46 | $1.82 |
| o4 Mini High | $1.10 | $4.40 |
| o3 | $2.00 | $8.00 |
| o4 Mini | $1.10 | $4.40 |
| GPT-4.1 Mini | $0.40 | $1.60 |
| GPT-4.1 Nano | $0.10 | $0.40 |
| Llama 4 Maverick | $0.15 | $0.60 |
| Qwen3 VL 8B Thinking | $0.12 | $1.36 |
| Qwen3 VL 8B Instruct | $0.12 | $0.46 |
| GPT-5 Image | $10.00 | $10.00 |
| o3 Deep Research | $10.00 | $40.00 |
| o4 Mini Deep Research | $2.00 | $8.00 |
| Nano Banana (Gemini 2.5 Flash Image) | $0.30 | $2.50 |
| Qwen3 VL 30B A3B Thinking | $0.13 | $1.56 |
| Qwen3 VL 30B A3B Instruct | $0.13 | $0.52 |
| GPT-5 Pro | $15.00 | $120.00 |
| GLM 4.6 | $0.43 | $1.74 |
| DeepSeek V3.2 Exp | $0.27 | $0.41 |
| Gemini 2.5 Flash Lite Preview 09-2025 | $0.10 | $0.40 |
| Qwen3 VL 235B A22B Thinking | $0.26 | $2.60 |
| Qwen3 VL 235B A22B Instruct | $0.20 | $0.88 |
| Qwen3 Max | $0.78 | $3.90 |
| Qwen3 Coder Plus | $0.65 | $3.25 |
| GPT-5 Codex | $1.25 | $10.00 |
| DeepSeek V3.1 Terminus | $0.27 | $0.95 |
| Qwen3 Coder Flash | $0.20 | $0.97 |
| GLM 5.2 | $0.91 | $2.86 |
| Kimi K2.7 Code | $0.74 | $3.50 |
| Claude Fable Latest | $10.00 | $50.00 |
| Claude Fable 5 | $10.00 | $50.00 |
| Qwen3.7 Plus | $0.32 | $1.28 |
| MiniMax M3 | $0.30 | $1.20 |
| Step 3.7 Flash | $0.20 | $1.15 |
| Qwen3.5-9B | $0.10 | $0.15 |
| GPT-5.4 Pro | $30.00 | $180.00 |
| GPT-5.4 | $2.50 | $15.00 |
| GPT-5.3 Chat | $1.75 | $14.00 |
| Gemini 3.1 Flash Lite Preview | $0.25 | $1.50 |
| Seed-2.0-Mini | $0.10 | $0.40 |
| Nano Banana 2 (Gemini 3.1 Flash Image Preview) | $0.50 | $3.00 |
| Qwen3.5-35B-A3B | $0.14 | $1.00 |
| Qwen3.5-27B | $0.20 | $1.56 |
| Voxtral Small 24B 2507 | $0.10 | $0.30 |
| gpt-oss-safeguard-20b | $0.07 | $0.30 |
| MiniMax M2 | $0.26 | $1.02 |
| Qwen3 VL 32B Instruct | $0.10 | $0.42 |
| Qwen3 14B | $0.10 | $0.24 |
| Codestral 2508 | $0.30 | $0.90 |
| Qwen3 Coder 30B A3B Instruct | $0.07 | $0.27 |
| Qwen3 30B A3B Instruct 2507 | $0.05 | $0.19 |
| GLM 4.5 | $0.60 | $2.20 |
| GLM 4.5 Air | $0.13 | $0.85 |
| Qwen3 32B | $0.08 | $0.28 |
| Qwen-Plus | $0.26 | $0.78 |
| Qwen3 Next 80B A3B Thinking | $0.10 | $0.78 |
| Qwen3 Next 80B A3B Instruct | $0.09 | $1.10 |
| Qwen Plus 0728 (thinking) | $0.26 | $0.78 |
| Kimi K2 0905 | $0.60 | $2.50 |
| Qwen3 30B A3B Thinking 2507 | $0.13 | $1.56 |
| Llama 3.1 70B Instruct | $0.40 | $0.40 |
| Gemma 3 4B | $0.05 | $0.10 |
| Gemma 3 12B | $0.05 | $0.15 |
| Command A | $2.50 | $10.00 |
| GPT-4o-mini Search Preview | $0.15 | $0.60 |
| GPT-4o Search Preview | $2.50 | $10.00 |
| Gemma 3 27B | $0.08 | $0.16 |
| Saba | $0.20 | $0.60 |
| o3 Mini High | $1.10 | $4.40 |
| Qwen2.5 VL 72B Instruct | $0.80 | $1.00 |
| R1 Distill Llama 70B | $0.80 | $0.80 |
| R1 | $0.70 | $2.50 |
| MiniMax-01 | $0.20 | $1.10 |
| DeepSeek V3 | $0.20 | $0.80 |
| Command R7B (12-2024) | $0.04 | $0.15 |
| Llama 3.3 70B Instruct | $0.10 | $0.32 |
| GPT-4o (2024-11-20) | $2.50 | $10.00 |
| Mistral Large 2407 | $2.00 | $6.00 |
| Qwen2.5 Coder 32B Instruct | $0.66 | $1.00 |
| Qwen2.5 7B Instruct | $0.04 | $0.10 |
| GPT-3.5 Turbo | $0.50 | $1.50 |
| Llama 3.2 3B Instruct | $0.05 | $0.34 |
| Llama 3.2 1B Instruct | $0.03 | $0.20 |
| Llama 3.2 11B Vision Instruct | $0.34 | $0.34 |
| Qwen2.5 72B Instruct | $0.36 | $0.40 |
| Command R (08-2024) | $0.15 | $0.60 |
| GPT-4o (2024-08-06) | $2.50 | $10.00 |
| Llama 3.1 8B Instruct | $0.02 | $0.03 |
| Mistral Nemo | $0.02 | $0.03 |
| GPT-4o-mini (2024-07-18) | $0.15 | $0.60 |
| Gemma 2 27B | $0.65 | $0.65 |
| GPT-4o (2024-05-13) | $5.00 | $15.00 |
| Llama 3 8B Instruct | $0.14 | $0.14 |
| Mixtral 8x22B Instruct | $2.00 | $6.00 |
| GPT-4 Turbo | $10.00 | $30.00 |
| Mistral Large | $2.00 | $6.00 |
| GPT-3.5 Turbo (older v0613) | $1.00 | $2.00 |
| GPT-4 Turbo Preview | $10.00 | $30.00 |
| GPT-3.5 Turbo Instruct | $1.50 | $2.00 |
| GPT-3.5 Turbo 16k | $3.00 | $4.00 |
| GPT-4 | $30.00 | $60.00 |