Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Abstract

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips. We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%. A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-input decoder-noise floor in verified cells. Capability predicts but does not eliminate flips; the best model still flips on 13.4% of trials. In our Gemini mitigation tests, training-free prompt changes are modality-conditional and do not transfer from text to visual reasoning. These results suggest that prompt-level mitigation alone is unlikely to provide general order robustness, motivating future work on training-time and architectural approaches. We propose cross-ordering flip rate as a standard reporting axis for MLLMs.

Technical Analysis & Implementation

Overview§

This paper identifies and quantifies order sensitivity in multimodal large language models (MLLMs) — the phenomenon that shuffling the presentation order of inputs (options, evidence chunks, document ranks, images, or mixed modalities) can change the model's answer. The authors propose Facet-Probe, a systematic audit framework covering five ordering facets, and analyze 18 frontier and open-weight MLLMs using a Bayesian item-response model to separate noise from systematic bias.

Facet-Probe Methodology§

Each facet corresponds to a different type of input order: 1. Option ordering: Shuffling the order of answer choices (e.g., A/B/C to C/A/B). 2. Evidence-chunk ordering: Changing the order of text evidence passages. 3. Document-rank ordering: Permuting the ranking of retrieved documents. 4. Image-set ordering: Rearranging the order of images in a multimodal input. 5. Mixed-modality ordering: Simultaneously shuffling multiple modalities.

For each facet, the model is evaluated on a set of items (questions with multiple orderings). A flip occurs when the model's answer changes due to order variation.

Bayesian Item-Response Model§

To disentangle ordering noise from per-facet bias, the authors employ a Bayesian hierarchical model. For each item $i$ and ordering $j$, let $Y_{ij} = 1$ if the model's answer matches a reference (canonical) answer, and $0$ otherwise. The model is:

$$ \text{logit}(P(Y_{ij}=1)) = \theta_i - \beta_{f(j)} - \epsilon_{ij} $$

where:

$\theta_i \sim \mathcal{N}(0, \sigma_\theta^2)$ is the item-level difficulty.
$\beta_{f(j)}$ is the bias due to facet $f$ (e.g., option shuffling).
$\epsilon_{ij} \sim \mathcal{N}(0, \sigma_\epsilon^2)$ is random noise.

This model yields per-facet flip rates adjusted for item-level variability.

Key Results§

No model is order-invariant: per-facet flip rates range from 24% to 50% across models.
A same-ordering control (using Gemini at temperature 0) shows that ordering excess (additional flips due to order changes) is significant beyond decoder noise.
Capability does not eliminate flips; the best model (Gemini Ultra) still flips on 13.4% of trials.
Prompt-level mitigation tests show that simple instructions (e.g., "ignore order") reduce flips only within-modality and do not transfer from text to visual reasoning.

Implementation Sketch§

Below is a simplified Python snippet illustrating the flipping detection logic:

import itertools

def compute_flip_rate(model, items, facet_orderings):
    """
    model: function that takes (question, ordering) and returns answer string
    items: list of dicts with 'question', 'reference_answer', and lists of orderings per facet
    """
    total_flips = 0
    total_pairs = 0
    for item in items:
        answers = []
        for ordering in facet_orderings[item['facet']]:
            ans = model(item['question'], ordering)
            answers.append(ans)
        # count flips between all pairs of orderings
        for a, b in itertools.combinations(answers, 2):
            if a != b:
                total_flips += 1
            total_pairs += 1
    return total_flips / total_pairs

Conclusion§

The paper establishes cross-ordering flip rate as a critical reliability metric for MLLMs. The finding that prompt-level mitigation is insufficient suggests the need for training-time or architectural interventions to achieve general order robustness.

Abstract

Technical Analysis & Implementation

Overview§

Facet-Probe Methodology§

Bayesian Item-Response Model§

Key Results§

Implementation Sketch§

Conclusion§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Learning Action Priors for Cross-embodiment Robot Manipulation

Real-Time Voice AI Hears but Does Not Listen

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation