Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Technical Overview§

The paper addresses the failure of vision-language models (VLMs) to correct errors when reasoning about visual inputs, especially under distribution shift. They propose VRRL (Visually grounded Reflection via Reinforcement Learning), a two-component RL training framework.

Core Methodology§

1. Masked Prefix Trajectories: During training, the model generates a chain-of-thought (CoT) sequence of tokens. With probability $p$, a random prefix of this trajectory (including the initial visual tokens) is masked out. The model is then tasked to continue from that point, forcing it to recover from incorrect intermediate predictions. This prevents the model from relying on early mistakes and encourages revisiting visual evidence.

2. Buffered Roll-Ins: An experience replay buffer stores trajectories (states, actions, rewards) from earlier training phases. These are sampled and rolled into the current policy's training batch, exposing the model to diverse failure states it must learn to correct. This is analogous to imitation learning from past mistakes, but within an RL objective.

The training uses a standard PPO objective with a policy loss:

$$\mathcal{L}_{PPO} = -\mathbb{E}_{t}\left[\min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance ratio, and $\hat{A}_t$ is the advantage estimate. The value loss and entropy bonus are added as usual.

Key modification: The reward is designed to signal correctness of final answer and intermediate grounding steps. For instance, in visual grounding tasks, reward is given for correctly localizing objects and answering questions.

Implementation Details§

A high-level PyTorch-style training loop (conceptual):

# assume VLM model with policy and value heads
optimizer = Adam(model.parameters())
buffer = ReplayBuffer(max_size=10000)

for epoch in range(num_epochs):
    for batch in data_loader:
        # Generate trajectories with current policy
        trajs = model.generate(batch.images, batch.questions, max_length=128)
        
        # Apply prefix masking with probability p
        for traj in trajs:
            if random.random() < p:
                mask_len = random.randint(1, len(traj)//2)
                traj.states[:mask_len] = MASK_TOKEN  # mask tokens
                traj.actions[:mask_len] = None  # ignore loss on masked part
        
        # Compute advantages and returns via GAE
        advantages = compute_gae(trajs.rewards, trajs.values, gamma=0.99, lam=0.95)
        
        # Add trajectories to buffer
        buffer.push(trajs)
        
        # Sample from buffer for buffered roll-ins
        if len(buffer) > batch_size:
            rollin_batch = buffer.sample(batch_size)
            trajs = trajs + rollin_batch  # combine
        
        # PPO update on combined trajectories
        for _ in range(ppo_epochs):
            for minibatch in minibatches(trajs):
                log_probs, values = model(minibatch.states)
                ratios = exp(log_probs - minibatch.old_log_probs)
                policy_loss = -min(ratios * advantages, 
                                   clamp(ratios, 1-eps, 1+eps) * advantages)
                value_loss = mse_loss(values, minibatch.returns)
                loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
                loss.backward(); optimizer.step()
    # Update old policy for next iteration
    model_old.load_state_dict(model.state_dict())
    buffer.clear()  # optional

Experiments§

Evaluated on TableVQA, ChartQA, and SpatialVLM (navigation). Baselines include:

Off-the-shelf BLIP-2, InstructBLIP
Fine-tuned via standard RL (PPO) and reflection-oriented FT (e.g., using GPT-4 generated corrections)

Results show VRRL improves OOD accuracy by 8-15% absolute over baselines, particularly on adversarial distribution shifts (different table layouts, unseen chart types, novel maze configurations). Ablations confirm both components contribute.

Key Equations§

Advantage estimation (GAE):

$$\hat{A}_t = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k}$$ where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$.

Reward shaping for grounding: $r = \mathbb{I}[\text{correct answer}] + \alpha \cdot \text{IoU}(\text{pred bbox}, \text{gt bbox})$.

Conclusion§

VRRL introduces a simple yet effective RL recipe to force VLMs to visually ground their self-reflection, leading to robust generalization under domain shift.

Abstract

Technical Analysis & Implementation

Technical Overview§

Core Methodology§

Implementation Details§

Experiments§

Key Equations§

Conclusion§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training