Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Abstract

Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.

Technical Analysis & Implementation

Overview§

RiVER (Ranking-induced VERifiable framework) trains LLMs on score-based optimization tasks where ground-truth solutions are unavailable. It uses deterministic execution feedback as continuous-valued supervision and addresses two key challenges in applying group-relative RL to continuous rewards: scale dominance and frequency dominance.

Methodology§

Group-Relative Policy Optimization with Continuous Rewards§

The base RL algorithm is a variant of GRPO (Group Relative Policy Optimization). For each input instance $x$, the policy $\pi_\theta$ generates $G$ candidate solutions $\{y_i\}_{i=1}^G$. Each solution receives a continuous score $s_i = f(y_i)$ from an execution environment (e.g., runtime, memory usage, or heuristic score). The standard GRPO advantage is computed as:

$$A_i = \frac{s_i - \mu}{\sigma}$$

where $\mu$ and $\sigma$ are mean and std of scores within the group. However, uncalibrated scores across instances lead to scale dominance (varying magnitudes) and frequency dominance (common suboptimal solutions outweigh rare good ones).

Calibrated Reward Shaping§

RiVER introduces two calibrations:

1. Scale Calibration: For each instance, scores are normalized using instance-wise statistics. Let $s_{\text{min}}$, $s_{\text{max}}$, and $s_{\text{med}}$ be the minimum, maximum, and median of scores for that instance. The calibrated reward $\tilde{s}_i$ is:

$$\tilde{s}_i = \frac{s_i - s_{\text{med}}}{s_{\text{max}} - s_{\text{min}} + \epsilon}$$

This maps scores to $[-1, 1]$ range, mitigating scale dominance.

2. Rank Emphasis: To combat frequency dominance, RiVER emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. The final advantage for solution $i$ is:

$$A_i^{\text{RiVER}} = \frac{\exp(\tau \cdot r_i)}{\sum_j \exp(\tau \cdot r_j)} \cdot \tilde{s}_i$$

where $r_i$ is the rank of solution $i$ (1 for best) and $\tau$ is a temperature parameter that controls emphasis strength.

Training Objective§

The policy gradient loss is:

$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i=1}^G A_i^{\text{RiVER}} \log \pi_\theta(y_i | x) \right]$$

KL divergence with a reference model $\pi_{\text{ref}}$ is added to prevent reward hacking:

$$\mathcal{L} = \mathcal{L}_{\text{RL}} + \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})$$

Implementation Sketch§

# Simplified training loop with RiVER reward calibration
import torch

def river_advantage(scores, tau=2.0, eps=1e-8):
    # scores: tensor of shape (G,) for one instance
    min_s = scores.min()
    max_s = scores.max()
    med_s = scores.median()
    # scale calibration
    calibrated = (scores - med_s) / (max_s - min_s + eps)
    # rank emphasis (inverse rank: 1 is best)
    ranks = scores.argsort(descending=True).argsort() + 1  # ranks start at 1
    exp_ranks = torch.exp(-tau * ranks.float())
    weights = exp_ranks / exp_ranks.sum()
    advantages = weights * calibrated
    return advantages

# Inside training loop
for batch in dataloader:
    outputs = model.generate(input_ids, num_return_sequences=G, ...)
    scores = evaluate(outputs)  # deterministic execution feedback
    advantages = torch.stack([river_advantage(s) for s in scores])  # per instance
    log_probs = compute_log_probs(outputs)
    loss = - (advantages * log_probs).mean() + beta * kl_divergence(...)
    loss.backward()
    optimizer.step()

Key Contributions§

First framework to train LLMs purely from score-based feedback without ground-truth labels
Identifies and solves scale and frequency dominance in group-relative RL with continuous rewards
Demonstrates transfer from heuristic optimization tasks to exact-solution coding benchmarks (LiveCodeBench, USACO)

Results§

RiVER improves Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4% in ALE rating, and yields 2.4% and 3.5% absolute improvement on exact-solution benchmarks, while baselines trained with raw scores fail to transfer.

Abstract

Technical Analysis & Implementation

Overview§

Methodology§

Group-Relative Policy Optimization with Continuous Rewards§

Calibrated Reward Shaping§

Training Objective§

Implementation Sketch§

Key Contributions§

Results§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment

Reconciling safety and utility in reinforcement learning alignment

Direct preference optimization: Your language model is secretly a reward model