Towards Robustness against Typographic Attack with Training-free Concept Localization

Summary§

The paper addresses typographic attacks on CLIP-based vision encoders (ViTs), where irrelevant text in images biases representations toward lexical meaning. The authors propose a training-free method leveraging mechanistic interpretability to localize and mitigate lexical encoding.

Methodology§

Sampling-based Interpretation§

They interpret hidden states by sampling likely concepts (e.g., object classes) from the model's output distribution. For each attention head, they measure the contribution to semantic vs. lexical prediction using a probabilistic metric:

$$ \text{Attribution}(h) = \frac{1}{N} \sum_{i=1}^N \left( p_h(\text{concept}_i | x) - p_h(\text{lexical}_i | x) \right) $$

where $p_h$ is the output probability after masking other heads. Heads with higher attribution to lexical meaning are candidates for intervention.

Circuit Mining§

They identify critical heads by analyzing attention patterns and gradients. The circuit consists of a subset of heads that disproportionately activate on text tokens. Formally, they compute the importance score for each head $h$:

$$ I(h) = \mathbb{E}_x \left[ \left| \frac{\partial \mathcal{L}_{\text{lex}} }{\partial A_h} \right| \right] $$

where $A_h$ is the attention map of head $h$, and $\mathcal{L}_{\text{lex}}$ is a loss encouraging lexical prediction.

Intervention§

Without retraining, they adjust attention weights for identified heads by scaling down contributions from text-related tokens. For a given head, the modified attention logits become:

$$ A'_h = A_h - \lambda \cdot M \odot A_h $$

where $M$ is a binary mask indicating token positions with high lexical relevance (detected via OCR-like heuristics or attention entropy), and $\lambda$ is a scaling factor.

Implementation Details§

All experiments use ViT-B/32 from CLIP (openai) and LVLMs (LLaVA, InstructBLIP).
The mask $M$ is generated by thresholding the entropy of attention weights: tokens with low entropy (high focus) on the classification token are considered lexical.
$\lambda$ is set to 0.3 in experiments.

Code Snippet (Intervention Forward Pass)§

def forward_with_intervention(model, image, text_mask, lambda_=0.3):
    # model: ViT transformer with attention heads
    x = model.patch_embed(image)
    for block in model.blocks:
        attn = block.attn
        B, N, C = x.shape
        qkv = attn.qkv(x).reshape(B, N, 3, attn.num_heads, C // attn.num_heads)
        q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
        attn_scores = (q @ k.transpose(-2, -1)) / (C // attn.num_heads)**0.5
        # apply intervention: scale down attention to text tokens
        # text_mask: boolean mask of shape (B, N) indicating text positions
        if text_mask is not None:
            # only intervene on heads identified as lexical (head_ids)
            head_ids = [5, 8, 11]  # example from paper
            for h in head_ids:
                attn_scores[:, h] = attn_scores[:, h] - lambda_ * text_mask.unsqueeze(1) * attn_scores[:, h]
        attn_probs = attn.softmax(attn_scores)
        x = attn_probs @ v
        # ... rest of block
    return x

Results§

On object classification (ImageNet-10), intervention improves accuracy under TA from 43% to 67%.
On VQA (RIO-Bench), applying intervention to vision encoders of LLaVA and InstructBLIP boosts accuracy by 15-20% under typographic attack.
The method generalizes across architectures and tasks, requiring no additional training.

Abstract

Technical Analysis & Implementation

Summary§

Methodology§

Sampling-based Interpretation§

Circuit Mining§

Intervention§

Implementation Details§

Code Snippet (Intervention Forward Pass)§

Results§

Related Research

Hallucination in World Models is Predictable and Preventable

Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

DanceOPD: On-Policy Generative Field Distillation