Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
By Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli
"Introduces Facet-Probe, a five-facet audit revealing that all 18 tested MLLMs are order-sensitive with flip rates 24-50%. Bayesian item-response model quantifies per-facet bias; prompt mitigation is modality-conditional and insufficient."
Abstract
Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips. We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%. A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-input decoder-noise floor in verified cells. Capability predicts but does not eliminate flips; the best model still flips on 13.4% of trials. In our Gemini mitigation tests, training-free prompt changes are modality-conditional and do not transfer from text to visual reasoning. These results suggest that prompt-level mitigation alone is unlikely to provide general order robustness, motivating future work on training-time and architectural approaches. We propose cross-ordering flip rate as a standard reporting axis for MLLMs.
Technical Analysis & Implementation
Overview§
This paper identifies and quantifies order sensitivity in multimodal large language models (MLLMs) — the phenomenon that shuffling the presentation order of inputs (options, evidence chunks, document ranks, images, or mixed modalities) can change the model's answer. The authors propose Facet-Probe, a systematic audit framework covering five ordering facets, and analyze 18 frontier and open-weight MLLMs using a Bayesian item-response model to separate noise from systematic bias.
Facet-Probe Methodology§
Each facet corresponds to a different type of input order: 1. Option ordering: Shuffling the order of answer choices (e.g., A/B/C to C/A/B). 2. Evidence-chunk ordering: Changing the order of text evidence passages. 3. Document-rank ordering: Permuting the ranking of retrieved documents. 4. Image-set ordering: Rearranging the order of images in a multimodal input. 5. Mixed-modality ordering: Simultaneously shuffling multiple modalities.
For each facet, the model is evaluated on a set of items (questions with multiple orderings). A flip occurs when the model's answer changes due to order variation.
Bayesian Item-Response Model§
To disentangle ordering noise from per-facet bias, the authors employ a Bayesian hierarchical model. For each item $i$ and ordering $j$, let $Y_{ij} = 1$ if the model's answer matches a reference (canonical) answer, and $0$ otherwise. The model is:
$$ \text{logit}(P(Y_{ij}=1)) = \theta_i - \beta_{f(j)} - \epsilon_{ij} $$
where:
- $\theta_i \sim \mathcal{N}(0, \sigma_\theta^2)$ is the item-level difficulty.
- $\beta_{f(j)}$ is the bias due to facet $f$ (e.g., option shuffling).
- $\epsilon_{ij} \sim \mathcal{N}(0, \sigma_\epsilon^2)$ is random noise.
This model yields per-facet flip rates adjusted for item-level variability.
Key Results§
- No model is order-invariant: per-facet flip rates range from 24% to 50% across models.
- A same-ordering control (using Gemini at temperature 0) shows that ordering excess (additional flips due to order changes) is significant beyond decoder noise.
- Capability does not eliminate flips; the best model (Gemini Ultra) still flips on 13.4% of trials.
- Prompt-level mitigation tests show that simple instructions (e.g., "ignore order") reduce flips only within-modality and do not transfer from text to visual reasoning.
Implementation Sketch§
Below is a simplified Python snippet illustrating the flipping detection logic:
import itertools
def compute_flip_rate(model, items, facet_orderings):
"""
model: function that takes (question, ordering) and returns answer string
items: list of dicts with 'question', 'reference_answer', and lists of orderings per facet
"""
total_flips = 0
total_pairs = 0
for item in items:
answers = []
for ordering in facet_orderings[item['facet']]:
ans = model(item['question'], ordering)
answers.append(ans)
# count flips between all pairs of orderings
for a, b in itertools.combinations(answers, 2):
if a != b:
total_flips += 1
total_pairs += 1
return total_flips / total_pairsConclusion§
The paper establishes cross-ordering flip rate as a critical reliability metric for MLLMs. The finding that prompt-level mitigation is insufficient suggests the need for training-time or architectural interventions to achieve general order robustness.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.