Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Abstract

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.

Technical Analysis & Implementation

Summary§

This paper introduces progress advantage, a step-level score for LLM agents derived directly from the RL post-training pipeline without needing a separate reward model. The key insight: under a stochastic Markov decision process, the log-probability ratio between the RL-trained policy $\pi_{\theta}$ and the reference policy $\pi_{\text{ref}}$ recovers the optimal advantage function $A^*(s_t, a_t)$. This signal is annotation-free, domain-agnostic, and available as a byproduct of standard RL fine-tuning.

Core Methodology§

Mathematical Derivation§

In a Markov decision process (MDP) with stochastic transitions, the optimal Q-function satisfies $$Q^(s_t, a_t) = r_t + \gamma \mathbb{E}_{s_{t+1}}[\max_{a'} Q^(s_{t+1}, a')]$$ The advantage is $A^(s_t, a_t) = Q^(s_t, a_t) - V^(s_t)$. The authors show that under the optimal policy, the log-probability ratio between the policy and a reference policy equals the advantage: $$\log \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\text{ref}}(a_t|s_t)} = A^(s_t, a_t)$$ This holds when the RL objective is to maximize expected cumulative reward with KL regularization against the reference policy, i.e., $\max_\pi \mathbb{E}[\sum_t (r_t - \beta \log \frac{\pi(a_t|s_t)}{\pi_{\text{ref}}(a_t|s_t)})]$. In practice, $\pi_{\theta}$ is the fine-tuned policy after RL training (e.g., PPO), and $\pi_{\text{ref}}$ is the initial policy.

Applications§

Test-time scaling: Use progress advantage to select best-of-N trajectories.
Uncertainty quantification: Low advantage indicates high uncertainty.
Failure attribution: Identify steps with negative advantage as likely failure points.

Implementation§

Below is a simplified PyTorch snippet to compute progress advantage from a trained RL policy:

import torch
import torch.nn.functional as F

def compute_progress_advantage(logits_theta, logits_ref, actions):
    """
    Args:
        logits_theta: (batch, seq_len, vocab) from RL-trained policy
        logits_ref: (batch, seq_len, vocab) from reference policy
        actions: (batch, seq_len) token ids
    Returns:
        advantages: (batch, seq_len) step-level advantages
    """
    log_prob_theta = F.log_softmax(logits_theta, dim=-1).gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    log_prob_ref = F.log_softmax(logits_ref, dim=-1).gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    advantages = log_prob_theta - log_prob_ref  # log ratio
    return advantages

Experiments§

Five benchmarks (WebShop, ALFWorld, etc.) and four model families (Llama, Mistral, etc.). Progress advantage consistently outperforms confidence-based baselines (e.g., softmax probabilities) and, despite no task-specific training, surpasses dedicated trained reward models on test-time scaling and failure attribution.

Takeaways§

No need for explicit reward model training; advantage is a free byproduct of RL fine-tuning.
Works across diverse agent tasks and model sizes.
Practical: easy to compute from standard policy checkpoints.

Model	Input	Output
DeepSeek-V3	$0.14	$0.28
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
Gemini 1.5 Pro	$1.25	$5.00

Model

Input

Output

DeepSeek-V3

$0.14

$0.28

GPT-4o

$2.50

$10.00

Claude 3.5 Sonnet

$3.00

$15.00

Gemini 1.5 Pro

$1.25

$5.00

Abstract

Technical Analysis & Implementation

Summary§

Core Methodology§

Mathematical Derivation§

Applications§

Implementation§

Experiments§

Takeaways§

Interactive LLM Token & Cost Calculator

Cost Breakdown (USD)

API Pricing Comparison (per Million Tokens)

Related Research

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

World Models in Pieces: Structural Certification for General Agents

Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

Accelerate your workflow with Feedalyze