arrow_backBack to research feed
visionPublished: July 2, 2026

Towards Robustness against Typographic Attack with Training-free Concept Localization

By Bohan Liu, Wenqian Ye, Guangzhi Xiong, Zhenghao He, Sanchit Sinha, Aidong Zhang

Research TL;DR

"Training-free mechanistic interpretability method identifies lexical-biased attention heads in ViTs and adjusts weights to mitigate typographic attacks, outperforming supervised defenses."

Abstract

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.

Technical Analysis & Implementation

Summary§

The paper addresses typographic attacks on CLIP-based vision encoders (ViTs), where irrelevant text in images biases representations toward lexical meaning. The authors propose a training-free method leveraging mechanistic interpretability to localize and mitigate lexical encoding.

Methodology§

Sampling-based Interpretation§

They interpret hidden states by sampling likely concepts (e.g., object classes) from the model's output distribution. For each attention head, they measure the contribution to semantic vs. lexical prediction using a probabilistic metric:

$$ \text{Attribution}(h) = \frac{1}{N} \sum_{i=1}^N \left( p_h(\text{concept}_i | x) - p_h(\text{lexical}_i | x) \right) $$

where $p_h$ is the output probability after masking other heads. Heads with higher attribution to lexical meaning are candidates for intervention.

Circuit Mining§

They identify critical heads by analyzing attention patterns and gradients. The circuit consists of a subset of heads that disproportionately activate on text tokens. Formally, they compute the importance score for each head $h$:

$$ I(h) = \mathbb{E}_x \left[ \left| \frac{\partial \mathcal{L}_{\text{lex}} }{\partial A_h} \right| \right] $$

where $A_h$ is the attention map of head $h$, and $\mathcal{L}_{\text{lex}}$ is a loss encouraging lexical prediction.

Intervention§

Without retraining, they adjust attention weights for identified heads by scaling down contributions from text-related tokens. For a given head, the modified attention logits become:

$$ A'_h = A_h - \lambda \cdot M \odot A_h $$

where $M$ is a binary mask indicating token positions with high lexical relevance (detected via OCR-like heuristics or attention entropy), and $\lambda$ is a scaling factor.

Implementation Details§

  • All experiments use ViT-B/32 from CLIP (openai) and LVLMs (LLaVA, InstructBLIP).
  • The mask $M$ is generated by thresholding the entropy of attention weights: tokens with low entropy (high focus) on the classification token are considered lexical.
  • $\lambda$ is set to 0.3 in experiments.

Code Snippet (Intervention Forward Pass)§

def forward_with_intervention(model, image, text_mask, lambda_=0.3):
    # model: ViT transformer with attention heads
    x = model.patch_embed(image)
    for block in model.blocks:
        attn = block.attn
        B, N, C = x.shape
        qkv = attn.qkv(x).reshape(B, N, 3, attn.num_heads, C // attn.num_heads)
        q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
        attn_scores = (q @ k.transpose(-2, -1)) / (C // attn.num_heads)**0.5
        # apply intervention: scale down attention to text tokens
        # text_mask: boolean mask of shape (B, N) indicating text positions
        if text_mask is not None:
            # only intervene on heads identified as lexical (head_ids)
            head_ids = [5, 8, 11]  # example from paper
            for h in head_ids:
                attn_scores[:, h] = attn_scores[:, h] - lambda_ * text_mask.unsqueeze(1) * attn_scores[:, h]
        attn_probs = attn.softmax(attn_scores)
        x = attn_probs @ v
        # ... rest of block
    return x

Results§

  • On object classification (ImageNet-10), intervention improves accuracy under TA from 43% to 67%.
  • On VQA (RIO-Bench), applying intervention to vision encoders of LLaVA and InstructBLIP boosts accuracy by 15-20% under typographic attack.
  • The method generalizes across architectures and tasks, requiring no additional training.
SHARE RESEARCH: