LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

Abstract

Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination but obscures track-specific details, whereas dual-track prediction improves acoustics but requires longer sequences and weakens global planning. We present LeVo 2, a hybrid LLM-Diffusion framework for controllable full-length song generation. LeVo 2 formulates this trade-off as hierarchical modeling: LeLM first predicts mixed tokens for semantic planning, then predicts vocal and accompaniment tokens in parallel for track-specific refinement, while a diffusion-based Music Codec reconstructs full-length waveforms. A central contribution of this extended version is an aesthetics-guided training schedule for alignment. During pre-training, an automated music aesthetic evaluation framework assigns musicality-tier conditions to large-scale data, providing musicality priors before preference alignment. Progressive post-training applies SFT, large-scale offline DPO, and closed-loop semi-online DPO to separately improve generation quality, controllability, and musicality. Modular extension then trains the Track-Specific LM for acoustic refinement while preserving the aligned semantic planner. This schedule separates musicality learning, controllability alignment, and acoustic refinement, mitigating optimization conflict and the limitations of static offline preference pairs. Expert listening tests and objective evaluations show that LeVo 2 outperforms open-source baselines across six subjective dimensions, and approaches leading commercial systems on several listening metrics. Ablations validate the effects of the training strategy, aesthetics guidance, scaling, and hierarchical architecture.

Technical Analysis & Implementation

Technical Breakdown§

Core Architecture: LeVo 2§

LeVo 2 comprises three main components: 1. LeLM (Language Model for Lyrics & Music): Predicts mixed tokens (combining vocal and accompaniment discrete codes) for global semantic planning, ensuring long-term coherence. 2. Track-Specific LM (TS-LM): Takes the mixed-token prediction as context and predicts separate vocal and accompaniment tokens in parallel, refining acoustic details. 3. Music Codec: A diffusion-based decoder that reconstructs the full-length waveform from the discrete tokens.

Hierarchical Token Modeling§

The key innovation is addressing the trade-off between mixed-token (coarser, better global structure) and dual-track (finer, but more tokens) representation. LeVo 2 first uses mixed tokens for planning, then parallel track-specific tokens for detail. This is formulated as: \[P(\text{audio}|\text{lyrics}, \text{prompt}) = \sum_{\text{mixed}} P(\text{mixed}) \prod_{t} P(\text{vocal}_t, \text{acc}_t | \text{mixed}, \text{context})\]

Progressive Post-Training with Aesthetic Guidance§

The training schedule applies aesthetics-guided alignment to improve musicality without conflicting with acoustic refinement. 1. Pre-training: An automated music aesthetic evaluation framework assigns musicality-tier conditions (low/medium/high) to data. This conditions LeLM during pre-training, giving it musicality priors. 2. Progressive Alignment:

SFT: Supervised fine-tuning on high-quality data.
Offline DPO: Large-scale direct preference optimization using static preference pairs (preferred vs. dispreferred samples). Loss:

\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log\sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]

Semi-online DPO: The model generates candidate outputs, which are automatically scored by a learned aesthetic reward model; top and bottom halves form online preference pairs. This adapts to model improvements.

3. Modular Fine-Tuning: After alignment, the TS-LM is fine-tuned for track-specific refinement while the LeLM stays frozen to preserve alignment benefits.

Implementation Details§

Tokens are 1024-codebook vectors from a pretrained audio codec.
LeLM is based on a decoder-only Transformer (1.3B parameters).
TS-LM uses cross-attention to the LeLM's outputs.
Diffusion Music Codec uses a U-Net with conditioning from the discrete tokens.

Code Snippet (PyTorch-style DPO loss)§

def dpo_loss(policy_logps, ref_logps, preferred_ids):
    # policy_logps: log probabilities from policy model for both preferred and dispreferred
    # ref_logps: log probabilities from reference model (frozen)
    # preferred_ids: binary mask 1 for preferred, 0 for dispreferred
    beta = 0.1
    log_ratio = policy_logps - ref_logps
    loss = -F.logsigmoid(beta * (log_ratio[preferred_ids] - log_ratio[~preferred_ids])).mean()
    return loss

Results§

Expert listening tests show LeVo 2 outperforms open-source baselines on melody, vocal clarity, accompaniment quality, and overall musicality. Ablations confirm the effectiveness of each training stage, aesthetic conditioning, and hierarchical modeling.

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Core Architecture: LeVo 2§

Hierarchical Token Modeling§

Progressive Post-Training with Aesthetic Guidance§

Implementation Details§

Code Snippet (PyTorch-style DPO loss)§

Results§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

Learning Action Priors for Cross-embodiment Robot Manipulation

Real-Time Voice AI Hears but Does Not Listen