arrow_backBack to research feed
multimodalPublished: July 2, 2026

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

By Liyan Tang, Fangcong Yin, Greg Durrett

Research TL;DR

"Proposes VRRL, an RL training framework for VLMs that uses random prefix masking and buffered roll-ins to enforce visually grounded self-reflection, improving OOD accuracy on grounding and navigation tasks."

Abstract

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.

Technical Analysis & Implementation

Technical Overview§

The paper addresses the failure of vision-language models (VLMs) to correct errors when reasoning about visual inputs, especially under distribution shift. They propose VRRL (Visually grounded Reflection via Reinforcement Learning), a two-component RL training framework.

Core Methodology§

1. Masked Prefix Trajectories: During training, the model generates a chain-of-thought (CoT) sequence of tokens. With probability $p$, a random prefix of this trajectory (including the initial visual tokens) is masked out. The model is then tasked to continue from that point, forcing it to recover from incorrect intermediate predictions. This prevents the model from relying on early mistakes and encourages revisiting visual evidence.

2. Buffered Roll-Ins: An experience replay buffer stores trajectories (states, actions, rewards) from earlier training phases. These are sampled and rolled into the current policy's training batch, exposing the model to diverse failure states it must learn to correct. This is analogous to imitation learning from past mistakes, but within an RL objective.

The training uses a standard PPO objective with a policy loss:

$$\mathcal{L}_{PPO} = -\mathbb{E}_{t}\left[\min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance ratio, and $\hat{A}_t$ is the advantage estimate. The value loss and entropy bonus are added as usual.

Key modification: The reward is designed to signal correctness of final answer and intermediate grounding steps. For instance, in visual grounding tasks, reward is given for correctly localizing objects and answering questions.

Implementation Details§

A high-level PyTorch-style training loop (conceptual):

# assume VLM model with policy and value heads
optimizer = Adam(model.parameters())
buffer = ReplayBuffer(max_size=10000)

for epoch in range(num_epochs):
    for batch in data_loader:
        # Generate trajectories with current policy
        trajs = model.generate(batch.images, batch.questions, max_length=128)
        
        # Apply prefix masking with probability p
        for traj in trajs:
            if random.random() < p:
                mask_len = random.randint(1, len(traj)//2)
                traj.states[:mask_len] = MASK_TOKEN  # mask tokens
                traj.actions[:mask_len] = None  # ignore loss on masked part
        
        # Compute advantages and returns via GAE
        advantages = compute_gae(trajs.rewards, trajs.values, gamma=0.99, lam=0.95)
        
        # Add trajectories to buffer
        buffer.push(trajs)
        
        # Sample from buffer for buffered roll-ins
        if len(buffer) > batch_size:
            rollin_batch = buffer.sample(batch_size)
            trajs = trajs + rollin_batch  # combine
        
        # PPO update on combined trajectories
        for _ in range(ppo_epochs):
            for minibatch in minibatches(trajs):
                log_probs, values = model(minibatch.states)
                ratios = exp(log_probs - minibatch.old_log_probs)
                policy_loss = -min(ratios * advantages, 
                                   clamp(ratios, 1-eps, 1+eps) * advantages)
                value_loss = mse_loss(values, minibatch.returns)
                loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
                loss.backward(); optimizer.step()
    # Update old policy for next iteration
    model_old.load_state_dict(model.state_dict())
    buffer.clear()  # optional

Experiments§

Evaluated on TableVQA, ChartQA, and SpatialVLM (navigation). Baselines include:

  • Off-the-shelf BLIP-2, InstructBLIP
  • Fine-tuned via standard RL (PPO) and reflection-oriented FT (e.g., using GPT-4 generated corrections)

Results show VRRL improves OOD accuracy by 8-15% absolute over baselines, particularly on adversarial distribution shifts (different table layouts, unseen chart types, novel maze configurations). Ablations confirm both components contribute.

Key Equations§

Advantage estimation (GAE):

$$\hat{A}_t = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k}$$ where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$.

Reward shaping for grounding: $r = \mathbb{I}[\text{correct answer}] + \alpha \cdot \text{IoU}(\text{pred bbox}, \text{gt bbox})$.

Conclusion§

VRRL introduces a simple yet effective RL recipe to force VLMs to visually ground their self-reflection, leading to robust generalization under domain shift.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

SHARE RESEARCH: