arrow_backBack to research feed
llmPublished: June 30, 2026

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

By Zifan Carl Guo, Laura Ruis, Jacob Andreas, Belinda Z. Li

Research TL;DR

"Training LMs to explain their predictions using fixed counterfactual supervision yields explanations that paradoxically track the model's own evolving behavior, enabling scalable introspection without updated labels."

Abstract

When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models' counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different families, frequently produce explanations more faithful to their own current behaviors than to those of their training targets. This "introspective" coupling between LM explanations and behaviors occurs when training explanations remain sufficiently correlated with current behaviors over the course of training, even as behaviors themselves shift. We also show that introspective coupling tracks behavior shifts: when explanation training is provided concurrently with other post-training objectives, explanations track those shifts without requiring updated supervision. This phenomenon appears in multiple tasks, including sycophancy and refusal, and is robust to label noise. Overall, our results show that even fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.

Technical Analysis & Implementation

Overview§

This paper studies the phenomenon of introspective coupling in language models: when trained to generate counterfactual explanations using fixed supervision (e.g., from an earlier checkpoint), models produce explanations that are more faithful to their own current behavior than to the training targets. This allows explanation quality to improve automatically as the model's behavior shifts, providing a scalable form of self-monitoring.

Core Methodology§

The authors define a counterfactual explanation for a model $f_\theta$ on input $x$ with modification $x'$ as the difference in output: $$ \Delta_\theta(x, x') = f_\theta(x') - f_\theta(x). $$

They train an explanation model $h_\phi$ to predict this difference using a fixed target $\Delta_{\text{target}}$ obtained from an earlier checkpoint $\theta_0$ (or a different but behaviorally similar model). The training loss is: $$ \mathcal{L}(\phi) = \mathbb{E}_{(x, x')} \left[ \| h_\phi(x) - \Delta_{\text{target}}(x, x') \|^2 \right]. $$

After training, they measure faithfulness as the correlation between $h_\phi(x)$ and the current model's true counterfactual effect $\Delta_\theta(x, x')$. Despite never seeing the current $\Delta_\theta$, the explanations become highly correlated with it—this is introspective coupling.

Key Insight§

Introspective coupling occurs because the training signal $\Delta_{\text{target}}$ is correlated with the current $\Delta_\theta$ throughout training. Even as the model's behavior shifts (due to additional fine-tuning), the explanation model tracks those shifts without updated supervision. The effect is robust to label noise and appears across tasks like sycophancy and refusal.

Implementation Details§

  • Models: Decoder-only LMs (e.g., GPT-2, LLaMA) of varying sizes.
  • Counterfactual generation: For each input, a critical feature is removed (e.g., delete a sentence, mask a token) to produce $x'$.
  • Explanation model: A separate small LM or a linear probe on top of the base model's hidden states. It is trained over a fixed dataset of $(x, x', \Delta_{\text{target}})$ pairs.
  • Training: Standard supervised learning with MSE loss; no RL or human feedback. The base model $f_\theta$ may be concurrently trained on another objective (e.g., standard cross-entropy or RLHF), causing $\Delta_\theta$ to drift.

Code Snippet (PyTorch)§

import torch
import torch.nn as nn

# Base LM (frozen for explanation training)
class BaseLM(nn.Module):
    def forward(self, input_ids):
        # return logits or scalar output
        ...

# Explanation model (e.g., linear or small MLP)
class ExplanationModel(nn.Module):
    def __init__(self, hidden_dim, out_dim=1):
        super().__init__()
        self.proj = nn.Linear(hidden_dim, out_dim)
    def forward(self, hidden):
        return self.proj(hidden).squeeze(-1)

# Training loop
base_model = BaseLM(pretrained)
expl_model = ExplanationModel(hidden_dim=768)
optimizer = torch.optim.Adam(expl_model.parameters(), lr=1e-4)

# Fixed targets from earlier checkpoint
for (x_orig, x_mod, delta_target) in dataloader:
    with torch.no_grad():
        h_orig = base_model.get_hidden(x_orig)
        h_mod = base_model.get_hidden(x_mod)
    delta_pred = expl_model(h_mod) - expl_model(h_orig)  # alternative: predict directly
    loss = F.mse_loss(delta_pred, delta_target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Results§

  • On a sycophancy task (model learns to agree with user over time), explanations track the behavioral shift even when trained on pre-shift data.
  • On refusal (model learns to reject harmful requests), same coupling observed.
  • Faithfulness (Pearson correlation) increases over training time, reaching >0.8 in many settings.
  • The effect persists even if the training targets are corrupted with up to 50% noise.

Significance§

This work demonstrates a pathway to scalable introspection: fixed, once-collected explanation data can yield faithful explanations that automatically adapt to model changes. It suggests that self-explanation training can serve as a form of continuous monitoring without requiring human labels for each model update.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk