When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

Overview§

This paper presents the first systematic evaluation of Data Referencing Errors (DREs) in large language models (LLMs) when processing tabular data. DREs occur when an LLM incorrectly cites or omits a value from a table, even when it understands the table structure. The authors show that DREs are pervasive across models (1.7B–20B parameters) and tasks. They propose a critic-based approach to mitigate DREs: training a lightweight 4B-parameter classifier that detects DREs in generated reasoning steps, which is then used for filtering and rejection sampling to improve answer accuracy.

DRE Definition and Measurement§

A DRE is defined as any incorrect reference to a table cell in the model's output (e.g., "The revenue in 2020 is $5.2M" when the true value is $6.1M). Formally, let $T$ be a table with rows $R$ and columns $C$. A reasoning step $s$ generated by an LLM contains a set of referenced cell values $\mathcal{V}_s = \{v_1, v_2, \dots\}$. A DRE occurs if $\exists v \in \mathcal{V}_s$ such that $v \neq \text{ground truth cell value}$. The authors evaluate DRE rates by extracting numerical claims from model outputs and comparing them against the table.

Critic Model for DRE Detection§

The key contribution is a trained critic model that takes as input the table, the question, and a candidate model output, and outputs a binary label (DRE present or not). The critic is based on a 4B-parameter language model fine-tuned on a synthetic dataset of correct and incorrect reasoning chains.

Training Data Generation§

For a given table $T$ and question $Q$, sample outputs from a base LLM (e.g., LLaMA-2 7B). For each output, automatically verify correctness by comparing each extracted claim against the table. This yields pairs $(X, y)$ where $X = (T, Q, \text{output})$ and $y \in \{0, 1\}$ (1 indicates a DRE).

Critic Architecture§

The critic is a standard encoder-style classifier: a pretrained LLM with a linear head on the final hidden state. Let $h_{\text{CLS}}$ be the representation of a special token, then:

$$\hat{y} = \sigma(W h_{\text{CLS}} + b)$$

where $\sigma$ is the sigmoid function. Trained with binary cross-entropy loss:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$

Implementation Details§

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

class DRE_Critic(nn.Module):
    def __init__(self, model_name='meta-llama/Llama-2-7b-hf', hidden_size=4096):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Linear(hidden_size, 1)
    
    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        # Use the last hidden state of the [CLS] token (or first token)
        cls_hidden = outputs.last_hidden_state[:, 0, :]  # shape: (batch, hidden_size)
        logits = self.classifier(cls_hidden).squeeze(-1)  # shape: (batch,)
        return torch.sigmoid(logits)

Inference-time Mitigation§

The critic is used in two ways: 1. Filtering: Generate multiple candidate outputs (e.g., $k=10$) from the LLM, score each with the critic, and select the one with the lowest DRE probability. 2. Rejection Sampling: Sample outputs until the critic predicts no DRE, then use that output.

Both methods improve final answer accuracy. On average, accuracy increases by up to 12% across benchmarks (e.g., TabFact, WikiSQL, etc.).

Results and Analysis§

DREs occur in 20–40% of generated reasoning chains across models.
The 4B critic achieves an average F1 of 78.2% on in-distribution and out-of-distribution DRE detection.
Critic-based filtering is more effective than simple majority voting or self-consistency for table reasoning tasks.

Conclusion§

This work highlights the importance of data referencing fidelity in LLM table reasoning. The proposed lightweight critic offers a practical method to improve reliability without retraining the base model.

Abstract

Technical Analysis & Implementation

Overview§

DRE Definition and Measurement§

Critic Model for DRE Detection§

Training Data Generation§

Critic Architecture§

Implementation Details§

Inference-time Mitigation§

Results and Analysis§

Conclusion§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Beyond Surface Forms: A Comprehensive, Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM-Based Coded Language Detection

Accelerate your workflow with Feedalyze