LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training
By Shun Lei, Huaicheng Zhang, Dapeng Wu, Yaoxun Xu, Lishi Zuo, Wei Tan, Hangting Chen, Guangzheng Li, Jianwei Yu, Zhiyong Wu, Dong Yu
"Proposes hybrid LLM-Diffusion framework with hierarchical token modeling and multi-stage preference-aligned training (SFT + offline DPO + semi-online DPO) to improve both audio quality and alignment in full-length song generation."
Abstract
Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination but obscures track-specific details, whereas dual-track prediction improves acoustics but requires longer sequences and weakens global planning. We present LeVo 2, a hybrid LLM-Diffusion framework for controllable full-length song generation. LeVo 2 formulates this trade-off as hierarchical modeling: LeLM first predicts mixed tokens for semantic planning, then predicts vocal and accompaniment tokens in parallel for track-specific refinement, while a diffusion-based Music Codec reconstructs full-length waveforms. A central contribution of this extended version is an aesthetics-guided training schedule for alignment. During pre-training, an automated music aesthetic evaluation framework assigns musicality-tier conditions to large-scale data, providing musicality priors before preference alignment. Progressive post-training applies SFT, large-scale offline DPO, and closed-loop semi-online DPO to separately improve generation quality, controllability, and musicality. Modular extension then trains the Track-Specific LM for acoustic refinement while preserving the aligned semantic planner. This schedule separates musicality learning, controllability alignment, and acoustic refinement, mitigating optimization conflict and the limitations of static offline preference pairs. Expert listening tests and objective evaluations show that LeVo 2 outperforms open-source baselines across six subjective dimensions, and approaches leading commercial systems on several listening metrics. Ablations validate the effects of the training strategy, aesthetics guidance, scaling, and hierarchical architecture.
Technical Analysis & Implementation
Technical Breakdown§
Core Architecture: LeVo 2§
LeVo 2 comprises three main components: 1. LeLM (Language Model for Lyrics & Music): Predicts mixed tokens (combining vocal and accompaniment discrete codes) for global semantic planning, ensuring long-term coherence. 2. Track-Specific LM (TS-LM): Takes the mixed-token prediction as context and predicts separate vocal and accompaniment tokens in parallel, refining acoustic details. 3. Music Codec: A diffusion-based decoder that reconstructs the full-length waveform from the discrete tokens.
Hierarchical Token Modeling§
The key innovation is addressing the trade-off between mixed-token (coarser, better global structure) and dual-track (finer, but more tokens) representation. LeVo 2 first uses mixed tokens for planning, then parallel track-specific tokens for detail. This is formulated as: \[P(\text{audio}|\text{lyrics}, \text{prompt}) = \sum_{\text{mixed}} P(\text{mixed}) \prod_{t} P(\text{vocal}_t, \text{acc}_t | \text{mixed}, \text{context})\]
Progressive Post-Training with Aesthetic Guidance§
The training schedule applies aesthetics-guided alignment to improve musicality without conflicting with acoustic refinement. 1. Pre-training: An automated music aesthetic evaluation framework assigns musicality-tier conditions (low/medium/high) to data. This conditions LeLM during pre-training, giving it musicality priors. 2. Progressive Alignment:
- SFT: Supervised fine-tuning on high-quality data.
- Offline DPO: Large-scale direct preference optimization using static preference pairs (preferred vs. dispreferred samples). Loss:
\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log\sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]
- Semi-online DPO: The model generates candidate outputs, which are automatically scored by a learned aesthetic reward model; top and bottom halves form online preference pairs. This adapts to model improvements.
3. Modular Fine-Tuning: After alignment, the TS-LM is fine-tuned for track-specific refinement while the LeLM stays frozen to preserve alignment benefits.
Implementation Details§
- Tokens are 1024-codebook vectors from a pretrained audio codec.
- LeLM is based on a decoder-only Transformer (1.3B parameters).
- TS-LM uses cross-attention to the LeLM's outputs.
- Diffusion Music Codec uses a U-Net with conditioning from the discrete tokens.
Code Snippet (PyTorch-style DPO loss)§
def dpo_loss(policy_logps, ref_logps, preferred_ids):
# policy_logps: log probabilities from policy model for both preferred and dispreferred
# ref_logps: log probabilities from reference model (frozen)
# preferred_ids: binary mask 1 for preferred, 0 for dispreferred
beta = 0.1
log_ratio = policy_logps - ref_logps
loss = -F.logsigmoid(beta * (log_ratio[preferred_ids] - log_ratio[~preferred_ids])).mean()
return lossResults§
Expert listening tests show LeVo 2 outperforms open-source baselines on melody, vocal clarity, accompaniment quality, and overall musicality. Ablations confirm the effectiveness of each training stage, aesthetic conditioning, and hierarchical modeling.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.