DanceOPD: On-Policy Generative Field Distillation
By Wei Zhou, Xiongwei Zhu, Zelin Xu, Bo Dong, Lixue Gong, Yongyuan Liang, Meng Chu, Leigang Qu, Lingdong Kong, Wei Liu, Tat-Seng Chua
"DanceOPD uses on-policy distillation to compose multiple image generation capabilities (T2I, local/global editing) by routing samples to expert velocity fields trained on student rollouts, avoiding interference while leveraging flow-matching."
Abstract
Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.
Technical Analysis & Implementation
Overview§
DanceOPD proposes an on-policy generative field distillation framework for flow-matching models. It enables a single student model to learn multiple image generation capabilities (text-to-image, local editing, global editing, classifier-free guidance, realism enhancement) by distilling from separate expert velocity fields. The key innovation is that the student is trained on its own rollout states (on-policy) rather than fixed data, allowing it to effectively compose expert behaviors without degrading individual performance.
Core Methodology§
Flow-Matching Preliminaries§
Flow-matching models define a velocity field $v(x_t, t)$ that transports samples from a noise distribution $p_1$ to a data distribution $p_0$ along a probability flow. The training objective for a single capability is: $$\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1} \| v_{\theta}(x_t, t) - u_t(x_t | x_0, x_1) \|^2$$ where $u_t$ is the target velocity (often derived from a linear interpolation between noise and data).
On-Policy Distillation§
DanceOPD defines each capability (e.g., T2I, editing) as a separate velocity field $v_{\text{cap}}$. The student model $v_{\theta}$ learns from multiple experts by minimizing: $$\mathcal{L}_{\text{OPD}} = \mathbb{E}_{t, x_t^{\text{rollout}}} \| v_{\theta}(x_t^{\text{rollout}}, t) - v_{\text{expert}}(x_t^{\text{rollout}}, t) \|^2$$ where $x_t^{\text{rollout}}$ are states sampled from the student's own generation trajectory (on-policy). For each sample, a routing mechanism selects which expert to distill from (e.g., based on task label or a learned router).
Multi-Capability Composition§
To compose multiple fields, DanceOPD uses a simple linear combination rule: $$v_{\text{composite}}(x_t, t) = w_1 v_{\text{T2I}} + w_2 v_{\text{edit}} + \dots$$ where weights can be fixed or dynamically adjusted. Alternatively, the student can directly learn to absorb operator-defined fields (like CFG) by adding them as additional experts.
Implementation Details§
- Base model: Flow-matching architecture (e.g., DiT or similar U-Net) with sinusoidal timestep conditioning.
- Experts: Pre-trained velocity fields for each capability. T2I expert is a standard text-conditioned flow model; editing experts are fine-tuned on paired edit data.
- Training: Student is initialized from a pretrained T2I model. Rollouts are generated by the student in each training iteration (using a few steps of Euler integration). The distillation loss is applied for each expert on corresponding routing conditions.
- Routing: A simple classifier (e.g., learned from a small amount of labeled data) predicts which expert to use for each sample during training. For inference, user provides explicit task flags.
Code Snippet (PyTorch-like)§
class DanceOPD(nn.Module):
def __init__(self, student, experts, router):
super().__init__()
self.student = student # velocity model v_theta
self.experts = nn.ModuleList(experts) # frozen expert fields
self.router = router # light classifier
def forward(self, x_1, condition):
# Sample noise
t = torch.rand((x_1.shape[0], 1))
x_t = (1 - t) * x_1 + t * torch.randn_like(x_1) # linear interpolation
# Student rollout (simplified: one step for illustration)
with torch.no_grad():
v_student = self.student(x_t, t, condition)
x_next = x_t + (1/1000) * v_student # one Euler step
# Get expert velocities on student state
task_idx = self.router(condition) # determine which expert
v_expert = self.experts[task_idx](x_next, t, condition)
# Compute distillation loss
loss = F.mse_loss(self.student(x_next, t, condition), v_expert)
return lossExperiments§
DanceOPD is evaluated on T2I (MS-COCO, FID), local editing (quantitative edit success rate), global editing (style transfer, CLIP score), and CFG absorption (performance with vs without CFG). Results show that the student trained with on-policy distillation maintains T2I quality while achieving strong editing capabilities, outperforming multi-task training and off-policy distillation baselines.
Key Takeaways§
- On-policy sampling from student's own trajectory is crucial for stability and performance when distilling multiple velocity fields.
- The framework can absorb additional operator-defined fields (e.g., CFG) without retraining experts.
- Simple linear combination of expert velocities works well for composing capabilities, but routing requires task labels during training.