arrow_backBack to research feed
agentsPublished: June 24, 2026

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

By Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, Tanwi Mallick, Sharon Li

Research TL;DR

"Shows log-probability ratio between RL-trained and reference policies equals optimal advantage, providing annotation-free step-level scoring for LLM agents."

Abstract

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.

Technical Analysis & Implementation

Summary§

This paper introduces progress advantage, a step-level score for LLM agents derived directly from the RL post-training pipeline without needing a separate reward model. The key insight: under a stochastic Markov decision process, the log-probability ratio between the RL-trained policy $\pi_{\theta}$ and the reference policy $\pi_{\text{ref}}$ recovers the optimal advantage function $A^*(s_t, a_t)$. This signal is annotation-free, domain-agnostic, and available as a byproduct of standard RL fine-tuning.

Core Methodology§

Mathematical Derivation§

In a Markov decision process (MDP) with stochastic transitions, the optimal Q-function satisfies $$Q^(s_t, a_t) = r_t + \gamma \mathbb{E}_{s_{t+1}}[\max_{a'} Q^(s_{t+1}, a')]$$ The advantage is $A^(s_t, a_t) = Q^(s_t, a_t) - V^(s_t)$. The authors show that under the optimal policy, the log-probability ratio between the policy and a reference policy equals the advantage: $$\log \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\text{ref}}(a_t|s_t)} = A^(s_t, a_t)$$ This holds when the RL objective is to maximize expected cumulative reward with KL regularization against the reference policy, i.e., $\max_\pi \mathbb{E}[\sum_t (r_t - \beta \log \frac{\pi(a_t|s_t)}{\pi_{\text{ref}}(a_t|s_t)})]$. In practice, $\pi_{\theta}$ is the fine-tuned policy after RL training (e.g., PPO), and $\pi_{\text{ref}}$ is the initial policy.

Applications§

  • Test-time scaling: Use progress advantage to select best-of-N trajectories.
  • Uncertainty quantification: Low advantage indicates high uncertainty.
  • Failure attribution: Identify steps with negative advantage as likely failure points.

Implementation§

Below is a simplified PyTorch snippet to compute progress advantage from a trained RL policy:

import torch
import torch.nn.functional as F

def compute_progress_advantage(logits_theta, logits_ref, actions):
    """
    Args:
        logits_theta: (batch, seq_len, vocab) from RL-trained policy
        logits_ref: (batch, seq_len, vocab) from reference policy
        actions: (batch, seq_len) token ids
    Returns:
        advantages: (batch, seq_len) step-level advantages
    """
    log_prob_theta = F.log_softmax(logits_theta, dim=-1).gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    log_prob_ref = F.log_softmax(logits_ref, dim=-1).gather(-1, actions.unsqueeze(-1)).squeeze(-1)
    advantages = log_prob_theta - log_prob_ref  # log ratio
    return advantages

Experiments§

Five benchmarks (WebShop, ALFWorld, etc.) and four model families (Llama, Mistral, etc.). Progress advantage consistently outperforms confidence-based baselines (e.g., softmax probabilities) and, despite no task-specific training, surpasses dedicated trained reward models on test-time scaling and failure attribution.

Takeaways§

  • No need for explicit reward model training; advantage is a free byproduct of RL fine-tuning.
  • Works across diverse agent tasks and model sizes.
  • Practical: easy to compute from standard policy checkpoints.
Interactive SEO Tool

Interactive LLM Token & Cost Calculator

Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.

Context Window64,000 tokens
Visual Tokenizer Chunks
Language models do not read text like humans. Instead, they process text in chunks called tokens. A token can be a single character, a syllable, a word, or even part of a word (like the "ing" in "walking"). On average, 1 token is equivalent to about 4 characters or 0.75 words of English text.
Estimated Token Count124

Cost Breakdown (USD)

Input Cost (Prompt):$0.000017
Output Cost (Generated):$0.000035
Total Est. Cost:$0.000052
Context Window Capacity0.1938%

API Pricing Comparison (per Million Tokens)

ModelInputOutput
DeepSeek-V3$0.14$0.28
GPT-4o$2.50$10.00
Claude 3.5 Sonnet$3.00$15.00
Gemini 1.5 Pro$1.25$5.00
INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk