arrow_backBack to research feed
alignmentPublished: June 25, 2026

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

By Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, Yuxiong He

Research TL;DR

"RiVER trains LLMs on score-based tasks without ground-truth solutions via calibrated reward shaping that mitigates scale and frequency dominance, improving general coding ability on exact-solution benchmarks."

Abstract

Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.

Technical Analysis & Implementation

Overview§

RiVER (Ranking-induced VERifiable framework) trains LLMs on score-based optimization tasks where ground-truth solutions are unavailable. It uses deterministic execution feedback as continuous-valued supervision and addresses two key challenges in applying group-relative RL to continuous rewards: scale dominance and frequency dominance.

Methodology§

Group-Relative Policy Optimization with Continuous Rewards§

The base RL algorithm is a variant of GRPO (Group Relative Policy Optimization). For each input instance $x$, the policy $\pi_\theta$ generates $G$ candidate solutions $\{y_i\}_{i=1}^G$. Each solution receives a continuous score $s_i = f(y_i)$ from an execution environment (e.g., runtime, memory usage, or heuristic score). The standard GRPO advantage is computed as:

$$A_i = \frac{s_i - \mu}{\sigma}$$

where $\mu$ and $\sigma$ are mean and std of scores within the group. However, uncalibrated scores across instances lead to scale dominance (varying magnitudes) and frequency dominance (common suboptimal solutions outweigh rare good ones).

Calibrated Reward Shaping§

RiVER introduces two calibrations:

1. Scale Calibration: For each instance, scores are normalized using instance-wise statistics. Let $s_{\text{min}}$, $s_{\text{max}}$, and $s_{\text{med}}$ be the minimum, maximum, and median of scores for that instance. The calibrated reward $\tilde{s}_i$ is:

$$\tilde{s}_i = \frac{s_i - s_{\text{med}}}{s_{\text{max}} - s_{\text{min}} + \epsilon}$$

This maps scores to $[-1, 1]$ range, mitigating scale dominance.

2. Rank Emphasis: To combat frequency dominance, RiVER emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. The final advantage for solution $i$ is:

$$A_i^{\text{RiVER}} = \frac{\exp(\tau \cdot r_i)}{\sum_j \exp(\tau \cdot r_j)} \cdot \tilde{s}_i$$

where $r_i$ is the rank of solution $i$ (1 for best) and $\tau$ is a temperature parameter that controls emphasis strength.

Training Objective§

The policy gradient loss is:

$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i=1}^G A_i^{\text{RiVER}} \log \pi_\theta(y_i | x) \right]$$

KL divergence with a reference model $\pi_{\text{ref}}$ is added to prevent reward hacking:

$$\mathcal{L} = \mathcal{L}_{\text{RL}} + \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})$$

Implementation Sketch§

# Simplified training loop with RiVER reward calibration
import torch

def river_advantage(scores, tau=2.0, eps=1e-8):
    # scores: tensor of shape (G,) for one instance
    min_s = scores.min()
    max_s = scores.max()
    med_s = scores.median()
    # scale calibration
    calibrated = (scores - med_s) / (max_s - min_s + eps)
    # rank emphasis (inverse rank: 1 is best)
    ranks = scores.argsort(descending=True).argsort() + 1  # ranks start at 1
    exp_ranks = torch.exp(-tau * ranks.float())
    weights = exp_ranks / exp_ranks.sum()
    advantages = weights * calibrated
    return advantages

# Inside training loop
for batch in dataloader:
    outputs = model.generate(input_ids, num_return_sequences=G, ...)
    scores = evaluate(outputs)  # deterministic execution feedback
    advantages = torch.stack([river_advantage(s) for s in scores])  # per instance
    log_probs = compute_log_probs(outputs)
    loss = - (advantages * log_probs).mean() + beta * kl_divergence(...)
    loss.backward()
    optimizer.step()

Key Contributions§

  • First framework to train LLMs purely from score-based feedback without ground-truth labels
  • Identifies and solves scale and frequency dominance in group-relative RL with continuous rewards
  • Demonstrates transfer from heuristic optimization tasks to exact-solution coding benchmarks (LiveCodeBench, USACO)

Results§

RiVER improves Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4% in ALE rating, and yields 2.4% and 3.5% absolute improvement on exact-solution benchmarks, while baselines trained with raw scores fail to transfer.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.