When are likely answers right? On Sequence Probability and Correctness in LLMs

Abstract

Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.

Technical Analysis & Implementation

Overview§

This paper investigates the relationship between the conditional sequence probability $P(y|x)$ of a text continuation $y$ given a prompt $x$ and its correctness across multiple axes: decoding methods, hyperparameters, different prompt-answer pairs, and repeated responses to the same prompt. The central question is whether maximizing sequence probability (e.g., via beam search or temperature scaling) reliably improves accuracy.

Key Findings§

1. Across prompt-answer pairs: Higher $P(y|x)$ correlates with correctness within a fixed dataset. This suggests that for a given prompt, the model's most likely sequences tend to be correct more often. 2. Across decoding methods/hyperparameters: Changing hyperparameters or switching decoding methods to increase $P(y|x)$ does not reliably improve accuracy. For example, lowering temperature increases probability but may hurt accuracy. 3. Across repeated responses (same prompt): Sequence probability is a poor indicator of correctness for different stochastic samples from the same prompt. A high-probability sample is not necessarily more correct than a low-probability one.

Methodology§

Models: Llama-2, Mistral, and other open-weight LLMs of varying sizes.
Benchmarks: MMLU, HellaSwag, GSM8K, and other reasoning/QA tasks.
Decoding methods: greedy decoding, temperature sampling, top-k, top-p (nucleus), beam search.
Metric: For each prompt and response, compute sequence probability $P(y|x) = \prod_{t=1}^{T} P(y_t | y_{<t}, x)$. Correctness is defined as exact match or task-specific evaluation.

Analysis: They compute correlation coefficients (Spearman) and perform regression to quantify the alignment between probability and correctness. They also measure calibration: do probability values match empirical accuracy?

Implications for Practitioners§

Do not rely on increasing sequence probability (e.g., by lowering temperature) to boost accuracy; it may backfire.
For self-consistency or self-improvement without a verifier, sequence probability is not a reliable signal to select among multiple candidate responses.
Higher probability across different prompts can be a useful signal for dataset filtering or curriculum learning.

Code Illustration: Computing Sequence Probability§

Below is a minimal PyTorch snippet to compute the log-probability of a generated sequence given a model and prompt:

import torch
import torch.nn.functional as F

def sequence_log_prob(model, tokenizer, prompt, response):
    # Tokenize input and response
    inputs = tokenizer(prompt, return_tensors="pt")
    response_ids = tokenizer(response, return_tensors="pt").input_ids[:, 1:]  # exclude start token
    
    # Compute logits for the response
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits  # shape: (1, seq_len, vocab_size)
    
    # Shift logits to align with response tokens
    log_probs = F.log_softmax(logits[:, -len(response_ids):], dim=-1)
    # Gather the log probabilities of the actual tokens
    token_log_probs = log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)
    total_log_prob = token_log_probs.sum().item()
    return total_log_prob  # sum of log-probabilities (negative)

This captures the core idea of measuring $\log P(y|x)$.

Conclusion§

The paper clarifies the nuanced relationship between sequence probability and correctness, providing actionable warnings for practitioners using likelihood-based decoding or self-consistency.

Abstract

Technical Analysis & Implementation

Overview§

Key Findings§

Methodology§

Implications for Practitioners§

Code Illustration: Computing Sequence Probability§

Conclusion§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Grad Detect: Gradient-Based Hallucination Detection in LLMs

Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

Accelerate your workflow with Feedalyze