arrow_backBack to research feed
multimodalPublished: June 24, 2026

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

By Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli

Research TL;DR

"Introduces Facet-Probe, a five-facet audit revealing that all 18 tested MLLMs are order-sensitive with flip rates 24-50%. Bayesian item-response model quantifies per-facet bias; prompt mitigation is modality-conditional and insufficient."

Abstract

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips. We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%. A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-input decoder-noise floor in verified cells. Capability predicts but does not eliminate flips; the best model still flips on 13.4% of trials. In our Gemini mitigation tests, training-free prompt changes are modality-conditional and do not transfer from text to visual reasoning. These results suggest that prompt-level mitigation alone is unlikely to provide general order robustness, motivating future work on training-time and architectural approaches. We propose cross-ordering flip rate as a standard reporting axis for MLLMs.

Technical Analysis & Implementation

Overview§

This paper identifies and quantifies order sensitivity in multimodal large language models (MLLMs) — the phenomenon that shuffling the presentation order of inputs (options, evidence chunks, document ranks, images, or mixed modalities) can change the model's answer. The authors propose Facet-Probe, a systematic audit framework covering five ordering facets, and analyze 18 frontier and open-weight MLLMs using a Bayesian item-response model to separate noise from systematic bias.

Facet-Probe Methodology§

Each facet corresponds to a different type of input order: 1. Option ordering: Shuffling the order of answer choices (e.g., A/B/C to C/A/B). 2. Evidence-chunk ordering: Changing the order of text evidence passages. 3. Document-rank ordering: Permuting the ranking of retrieved documents. 4. Image-set ordering: Rearranging the order of images in a multimodal input. 5. Mixed-modality ordering: Simultaneously shuffling multiple modalities.

For each facet, the model is evaluated on a set of items (questions with multiple orderings). A flip occurs when the model's answer changes due to order variation.

Bayesian Item-Response Model§

To disentangle ordering noise from per-facet bias, the authors employ a Bayesian hierarchical model. For each item $i$ and ordering $j$, let $Y_{ij} = 1$ if the model's answer matches a reference (canonical) answer, and $0$ otherwise. The model is:

$$ \text{logit}(P(Y_{ij}=1)) = \theta_i - \beta_{f(j)} - \epsilon_{ij} $$

where:

  • $\theta_i \sim \mathcal{N}(0, \sigma_\theta^2)$ is the item-level difficulty.
  • $\beta_{f(j)}$ is the bias due to facet $f$ (e.g., option shuffling).
  • $\epsilon_{ij} \sim \mathcal{N}(0, \sigma_\epsilon^2)$ is random noise.

This model yields per-facet flip rates adjusted for item-level variability.

Key Results§

  • No model is order-invariant: per-facet flip rates range from 24% to 50% across models.
  • A same-ordering control (using Gemini at temperature 0) shows that ordering excess (additional flips due to order changes) is significant beyond decoder noise.
  • Capability does not eliminate flips; the best model (Gemini Ultra) still flips on 13.4% of trials.
  • Prompt-level mitigation tests show that simple instructions (e.g., "ignore order") reduce flips only within-modality and do not transfer from text to visual reasoning.

Implementation Sketch§

Below is a simplified Python snippet illustrating the flipping detection logic:

import itertools

def compute_flip_rate(model, items, facet_orderings):
    """
    model: function that takes (question, ordering) and returns answer string
    items: list of dicts with 'question', 'reference_answer', and lists of orderings per facet
    """
    total_flips = 0
    total_pairs = 0
    for item in items:
        answers = []
        for ordering in facet_orderings[item['facet']]:
            ans = model(item['question'], ordering)
            answers.append(ans)
        # count flips between all pairs of orderings
        for a, b in itertools.combinations(answers, 2):
            if a != b:
                total_flips += 1
            total_pairs += 1
    return total_flips / total_pairs

Conclusion§

The paper establishes cross-ordering flip rate as a critical reliability metric for MLLMs. The finding that prompt-level mitigation is insufficient suggests the need for training-time or architectural interventions to achieve general order robustness.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.