Hallucination in World Models is Predictable and Preventable

Hallucination in World Models is Predictable and Preventable§

Core Methodology§

The paper identifies that hallucination in generative world models stems from insufficient data coverage in the state-action space. Three hallucination modes are characterized:

1. Perceptual hallucination: visual artifacts due to low coverage of action-free dynamics. 2. Action-marginalized hallucination: drift from true dynamics under low-coverage actions. 3. Scene-diverging hallucination: accumulation of errors in low-coverage regions over rollouts.

The authors propose coverage signals computed from the training dataset to predict where the model will fail. Let $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}$ be the training dataset of state-action-next state triplets. For a given transition $(s, a, s')$, the state-action coverage score is defined as:

$$ C(s, a) = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[\|s - s_i\| < \epsilon_s \land \|a - a_i\| < \epsilon_a] $$

where $\epsilon_s, \epsilon_a$ are thresholds and $N$ is dataset size. This kernel density estimate is used to predict three signals:

Perceptual hallucination score: $H_{\text{perc}} = f_\theta(s)$ from a trained classifier.
Action-marginalized score: $H_{\text{act}} = \mathbb{E}_{s' \sim p_\phi(\cdot|s,a)}[C(s, a, s')]$.
Scene-diverging score: $H_{\text{div}} = \mathbb{E}_{\tau}[\sum_t \log(1 - C(s_t, a_t))]$.

Training Procedure§

A 350M-parameter variational world model (based on DreamerV2) is trained on MMBench2, a 427-hour, 210-task dataset with ground-truth actions and rewards. The model consists of:

An encoder $\text{Enc}: s \to z$
A transition model $\text{Trans}: (z, a) \to z'$
A decoder $\text{Dec}: z \to \hat{s}$

Coverage-aware sampling is used at training time: each transition is weighted by $w = \max(\delta, 1 - C(s,a))$, where $\delta$ is a small constant to avoid zero weight. This forces the model to focus on low-coverage transitions.

For online mitigation, the hallucination predictors are used as curiosity rewards during rollouts in a new environment. The reward is $r_{\text{curiosity}} = \alpha \cdot H(s,a)$. Data collected under high reward is added to the replay buffer for finetuning.

Code Snippet§

import torch
import torch.nn as nn

class CoverageAwareSampler:
    def __init__(self, dataset, eps_s=0.1, eps_a=0.1):
        self.states = torch.cat([s.unsqueeze(0) for s, _, _ in dataset])
        self.actions = torch.cat([a.unsqueeze(0) for _, a, _ in dataset])
        self.eps_s = eps_s
        self.eps_a = eps_a

    def coverage(self, s, a):
        # Compute kernel density estimate (simplified)
        dist_s = torch.cdist(s.unsqueeze(0), self.states)
        dist_a = torch.cdist(a.unsqueeze(0), self.actions)
        mask = (dist_s < self.eps_s) & (dist_a < self.eps_a)
        return mask.sum(dim=1).float() / len(self.states)

    def sample(self, batch_size):
        # Weighted sampling by inverse coverage
        weights = 1.0 - self.coverage(self.states, self.actions) + 1e-6
        indices = torch.multinomial(weights, batch_size, replacement=True)
        return [(self.states[i], self.actions[i], self.next_states[i]) for i in indices]

Key Results§

Hallucination predictors achieve high accuracy (AUC > 0.9) in identifying failure rollout segments.
Finetuning with as few as 50 real trajectories reduces hallucination by 40% in unseen environments.
The method outperforms standard data augmentation and ensemble-based uncertainty estimation.

Abstract

Technical Analysis & Implementation