arrow_backBack to research feed
llmPublished: June 25, 2026

When are likely answers right? On Sequence Probability and Correctness in LLMs

By Johannes Zenn, Jonas Geiping

Research TL;DR

"Quantifies when higher sequence probability aligns with correctness in LLM decoding; finds it predicts accuracy across prompts but not across decoding methods or repeated responses."

Abstract

Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.

Technical Analysis & Implementation

Overview§

This paper investigates the relationship between the conditional sequence probability $P(y|x)$ of a text continuation $y$ given a prompt $x$ and its correctness across multiple axes: decoding methods, hyperparameters, different prompt-answer pairs, and repeated responses to the same prompt. The central question is whether maximizing sequence probability (e.g., via beam search or temperature scaling) reliably improves accuracy.

Key Findings§

1. Across prompt-answer pairs: Higher $P(y|x)$ correlates with correctness within a fixed dataset. This suggests that for a given prompt, the model's most likely sequences tend to be correct more often. 2. Across decoding methods/hyperparameters: Changing hyperparameters or switching decoding methods to increase $P(y|x)$ does not reliably improve accuracy. For example, lowering temperature increases probability but may hurt accuracy. 3. Across repeated responses (same prompt): Sequence probability is a poor indicator of correctness for different stochastic samples from the same prompt. A high-probability sample is not necessarily more correct than a low-probability one.

Methodology§

  • Models: Llama-2, Mistral, and other open-weight LLMs of varying sizes.
  • Benchmarks: MMLU, HellaSwag, GSM8K, and other reasoning/QA tasks.
  • Decoding methods: greedy decoding, temperature sampling, top-k, top-p (nucleus), beam search.
  • Metric: For each prompt and response, compute sequence probability $P(y|x) = \prod_{t=1}^{T} P(y_t | y_{<t}, x)$. Correctness is defined as exact match or task-specific evaluation.
  • Analysis: They compute correlation coefficients (Spearman) and perform regression to quantify the alignment between probability and correctness. They also measure calibration: do probability values match empirical accuracy?

Implications for Practitioners§

  • Do not rely on increasing sequence probability (e.g., by lowering temperature) to boost accuracy; it may backfire.
  • For self-consistency or self-improvement without a verifier, sequence probability is not a reliable signal to select among multiple candidate responses.
  • Higher probability across different prompts can be a useful signal for dataset filtering or curriculum learning.

Code Illustration: Computing Sequence Probability§

Below is a minimal PyTorch snippet to compute the log-probability of a generated sequence given a model and prompt:

import torch
import torch.nn.functional as F

def sequence_log_prob(model, tokenizer, prompt, response):
    # Tokenize input and response
    inputs = tokenizer(prompt, return_tensors="pt")
    response_ids = tokenizer(response, return_tensors="pt").input_ids[:, 1:]  # exclude start token
    
    # Compute logits for the response
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits  # shape: (1, seq_len, vocab_size)
    
    # Shift logits to align with response tokens
    log_probs = F.log_softmax(logits[:, -len(response_ids):], dim=-1)
    # Gather the log probabilities of the actual tokens
    token_log_probs = log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)
    total_log_prob = token_log_probs.sum().item()
    return total_log_prob  # sum of log-probabilities (negative)

This captures the core idea of measuring $\log P(y|x)$.

Conclusion§

The paper clarifies the nuanced relationship between sequence probability and correctness, providing actionable warnings for practitioners using likelihood-based decoding or self-consistency.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk