QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
By Sergio Hernández-Gutiérrez, Matteo Merler, Ilze Amanda Auzina, Joschka Strüber, Ameya Prabhu, Matthias Bethge
"Introduces QVal, a training-free benchmark to evaluate dense supervision signals for long-horizon LLM agents by measuring Q-alignment with a reference policy, enabling cheap comparison of methods."
Abstract
LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.
Technical Analysis & Implementation
QVal: Training-Free Evaluation of Dense Supervision Signals§
Problem§
LLM agents in long-horizon tasks (e.g., web navigation, robotics) suffer from sparse outcome rewards. Dense supervision methods (e.g., confidence scores, embedding similarities) aim to give intermediate rewards but are expensive to evaluate: downstream training conflates signal quality with engineering choices.
Core Methodology: QVal Score§
QVal measures Q-alignment: how well a dense supervision signal $\mathrm{score}(s,a)$ orders actions according to the Q-values $Q^*(s,a)$ of a strong reference policy.
1. Reference Policy Training: Train a policy (e.g., via PPO or behavior cloning) using outcome-only rewards to obtain Q-values for each state-action pair across collected trajectories. This is done once per environment. 2. Score Extraction: For each evaluated dense supervision method, extract scores for the same state-action pairs. 3. Rank Correlation: Compute Spearman's rank correlation coefficient $\rho$ between the method's scores and the reference Q-values: $$ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2-1)} $$ where $d_i$ is the difference in ranks of the two scores for pair $i$. Higher $\rho$ indicates better alignment with optimal action ordering.
Implementation Details§
- Environments: 4 diverse tasks (WebShop, ALFWorld, etc.) with trajectory lengths up to hundreds of steps.
- Methods: 21 methods spanning 7 families (e.g., prompting, self-distillation, embedding similarity, intrinsic confidence).
- Backbones: 6 open-weight LLMs (e.g., Llama 2, Mistral).
- Key Finding: Simple prompting baselines (e.g., "rate this action from 0 to 10") often outperform complex learned methods.
Code Snippet§
import numpy as np
from scipy.stats import spearmanr
def qval_score(method_scores: np.ndarray, q_values: np.ndarray) -> float:
"""
Compute QVal score as Spearman correlation between method scores and Q-values.
"""
rho, _ = spearmanr(method_scores, q_values)
return rho
# Example usage:
# reference_policy_q = get_reference_q_values(trajectories) # precomputed
# method_scores = my_dense_supervision_method(trajectories)
# qval = qval_score(method_scores, reference_policy_q)Key Results§
- Performance clusters by method family (prompting > self-distillation > embedding similarity).
- Findings are consistent across model sizes and environments.
- QVal requires no training of the evaluated method, reducing cost ~100x compared to full pipeline evaluation.
Interactive LLM Token & Cost Calculator
Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.
Cost Breakdown (USD)
API Pricing Comparison (per Million Tokens)
| Model | Input | Output |
|---|---|---|
| DeepSeek-V3 | $0.14 | $0.28 |
| GPT-4o | $2.50 | $10.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Related Research
Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk