When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors
By Yuqing Yang, Qi Zhu, Zhen Han, Boran Han, Zhengyuan Shen, Shuai Wang, Vassilis N. Ioannidis, Huzefa Rangwala
"Systematically measures data referencing errors (DREs) in LLMs on table tasks; trains a lightweight critic model (4B parameters) to detect DREs, improving accuracy up to 12% via rejection sampling."
Abstract
While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.
Technical Analysis & Implementation
Overview§
This paper presents the first systematic evaluation of Data Referencing Errors (DREs) in large language models (LLMs) when processing tabular data. DREs occur when an LLM incorrectly cites or omits a value from a table, even when it understands the table structure. The authors show that DREs are pervasive across models (1.7B–20B parameters) and tasks. They propose a critic-based approach to mitigate DREs: training a lightweight 4B-parameter classifier that detects DREs in generated reasoning steps, which is then used for filtering and rejection sampling to improve answer accuracy.
DRE Definition and Measurement§
A DRE is defined as any incorrect reference to a table cell in the model's output (e.g., "The revenue in 2020 is $5.2M" when the true value is $6.1M). Formally, let $T$ be a table with rows $R$ and columns $C$. A reasoning step $s$ generated by an LLM contains a set of referenced cell values $\mathcal{V}_s = \{v_1, v_2, \dots\}$. A DRE occurs if $\exists v \in \mathcal{V}_s$ such that $v \neq \text{ground truth cell value}$. The authors evaluate DRE rates by extracting numerical claims from model outputs and comparing them against the table.
Critic Model for DRE Detection§
The key contribution is a trained critic model that takes as input the table, the question, and a candidate model output, and outputs a binary label (DRE present or not). The critic is based on a 4B-parameter language model fine-tuned on a synthetic dataset of correct and incorrect reasoning chains.
Training Data Generation§
For a given table $T$ and question $Q$, sample outputs from a base LLM (e.g., LLaMA-2 7B). For each output, automatically verify correctness by comparing each extracted claim against the table. This yields pairs $(X, y)$ where $X = (T, Q, \text{output})$ and $y \in \{0, 1\}$ (1 indicates a DRE).
Critic Architecture§
The critic is a standard encoder-style classifier: a pretrained LLM with a linear head on the final hidden state. Let $h_{\text{CLS}}$ be the representation of a special token, then:
$$\hat{y} = \sigma(W h_{\text{CLS}} + b)$$
where $\sigma$ is the sigmoid function. Trained with binary cross-entropy loss:
$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$
Implementation Details§
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
class DRE_Critic(nn.Module):
def __init__(self, model_name='meta-llama/Llama-2-7b-hf', hidden_size=4096):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
self.classifier = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
# Use the last hidden state of the [CLS] token (or first token)
cls_hidden = outputs.last_hidden_state[:, 0, :] # shape: (batch, hidden_size)
logits = self.classifier(cls_hidden).squeeze(-1) # shape: (batch,)
return torch.sigmoid(logits)Inference-time Mitigation§
The critic is used in two ways: 1. Filtering: Generate multiple candidate outputs (e.g., $k=10$) from the LLM, score each with the critic, and select the one with the lowest DRE probability. 2. Rejection Sampling: Sample outputs until the critic predicts no DRE, then use that output.
Both methods improve final answer accuracy. On average, accuracy increases by up to 12% across benchmarks (e.g., TabFact, WikiSQL, etc.).
Results and Analysis§
- DREs occur in 20–40% of generated reasoning chains across models.
- The 4B critic achieves an average F1 of 78.2% on in-distribution and out-of-distribution DRE detection.
- Critic-based filtering is more effective than simple majority voting or self-consistency for table reasoning tasks.
Conclusion§
This work highlights the importance of data referencing fidelity in LLM table reasoning. The proposed lightweight critic offers a practical method to improve reliability without retraining the base model.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.
Related Research
Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
Read Synopsis →Jun 2026One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
Read Synopsis →Jun 2026Beyond Surface Forms: A Comprehensive, Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM-Based Coded Language Detection
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk