QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Abstract

LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.

Technical Analysis & Implementation

QVal: Training-Free Evaluation of Dense Supervision Signals§

Problem§

LLM agents in long-horizon tasks (e.g., web navigation, robotics) suffer from sparse outcome rewards. Dense supervision methods (e.g., confidence scores, embedding similarities) aim to give intermediate rewards but are expensive to evaluate: downstream training conflates signal quality with engineering choices.

Core Methodology: QVal Score§

QVal measures Q-alignment: how well a dense supervision signal $\mathrm{score}(s,a)$ orders actions according to the Q-values $Q^*(s,a)$ of a strong reference policy.

1. Reference Policy Training: Train a policy (e.g., via PPO or behavior cloning) using outcome-only rewards to obtain Q-values for each state-action pair across collected trajectories. This is done once per environment. 2. Score Extraction: For each evaluated dense supervision method, extract scores for the same state-action pairs. 3. Rank Correlation: Compute Spearman's rank correlation coefficient $\rho$ between the method's scores and the reference Q-values: $$ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2-1)} $$ where $d_i$ is the difference in ranks of the two scores for pair $i$. Higher $\rho$ indicates better alignment with optimal action ordering.

Implementation Details§

Environments: 4 diverse tasks (WebShop, ALFWorld, etc.) with trajectory lengths up to hundreds of steps.
Methods: 21 methods spanning 7 families (e.g., prompting, self-distillation, embedding similarity, intrinsic confidence).
Backbones: 6 open-weight LLMs (e.g., Llama 2, Mistral).
Key Finding: Simple prompting baselines (e.g., "rate this action from 0 to 10") often outperform complex learned methods.

Code Snippet§

import numpy as np
from scipy.stats import spearmanr

def qval_score(method_scores: np.ndarray, q_values: np.ndarray) -> float:
    """
    Compute QVal score as Spearman correlation between method scores and Q-values.
    """
    rho, _ = spearmanr(method_scores, q_values)
    return rho

# Example usage:
# reference_policy_q = get_reference_q_values(trajectories)  # precomputed
# method_scores = my_dense_supervision_method(trajectories)
# qval = qval_score(method_scores, reference_policy_q)

Key Results§

Performance clusters by method family (prompting > self-distillation > embedding similarity).
Findings are consistent across model sizes and environments.
QVal requires no training of the evaluated method, reducing cost ~100x compared to full pipeline evaluation.

Model	Input	Output
DeepSeek-V3	$0.14	$0.28
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
Gemini 1.5 Pro	$1.25	$5.00

Model

Input

Output

DeepSeek-V3

$0.14

$0.28

GPT-4o

$2.50

$10.00

Claude 3.5 Sonnet

$3.00

$15.00

Gemini 1.5 Pro

$1.25

$5.00

Abstract

Technical Analysis & Implementation

QVal: Training-Free Evaluation of Dense Supervision Signals§

Problem§

Core Methodology: QVal Score§

Implementation Details§

Code Snippet§

Key Results§

Interactive LLM Token & Cost Calculator

Cost Breakdown (USD)

API Pricing Comparison (per Million Tokens)

Related Research

Self-Evolving World Models for LLM Agent Planning

GROW$^2$: Grounding Which and Where for Robot Tool Use

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

Accelerate your workflow with Feedalyze