LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Abstract

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.

Technical Analysis & Implementation

Technical Breakdown§

Problem & Motivation§

Existing unlearning benchmarks evaluate only output-level erasure, leaving it unclear whether knowledge is truly removed from model parameters. LACUNA addresses this by providing ground-truth parameter-level localization of memorized PII.

Methodology: LACUNA Testbed§

Synthetic PII Injection: LACUNA constructs synthetic individuals with associated PII (name, SSN, etc.) and injects this knowledge into specific, predefined parameters $\theta_{target}$ of an OLMo-based model (1B and 7B) via masked continual pretraining. The injection process uses a masked language modeling objective:

$$\mathcal{L}_{inject} = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}_{PII}} \left[ \sum_{m \in M} \log p_\theta(x_m | \mathbf{x}_{\setminus m}) \right]$$

where $M$ is the set of masked tokens corresponding to PII values. Training is restricted to only update $\theta_{target}$, ensuring that the PII knowledge is localized. Precision is measured by how accurately an unlearning method modifies $\theta_{target}$.

Evaluation: Unlearning methods (e.g., gradient ascent, NPO, etc.) are applied, and both output-level metrics (verbatim recall) and parameter-level metrics (L2 distance of $\theta_{target}$ from its injected values, consistency of gradient updates) are reported. Resurfacing attacks test whether the erased knowledge can be recovered via fine-tuning on auxiliary data.

Key Findings§

Current SOTA methods achieve low verbatim recall (output-level success) but have high parameter-level imprecision (target weights remain close to injected values).
Precise localization enables simple gradient-based unlearning (e.g., gradient ascent on $\theta_{target}$ only) to achieve both strong erasure and resistance to resurfacing.
Imprecise methods are vulnerable: even after unlearning, fine-tuning on public data can reactivate memorized PII.

Code Snippet: Simulated Injection Process§

import torch
import torch.nn as nn

class OLMoBlock(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.fc = nn.Linear(d_model, d_model)

# Assume we have a pretrained model and we want to inject into layer[5].fc.weight
target_params = model.layers[5].fc.weight  # shape: (d_model, d_model)
optimizer = torch.optim.SGD([target_params], lr=1e-4)

# Synthetic PII batch: masked tokens
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
mask = torch.zeros_like(input_ids).bool()
mask[:, 10:15] = True  # mask PII tokens

# Forward through model, compute MLM loss only on masked positions
outputs = model(input_ids)
logits = outputs.logits  # (batch, seq, vocab)
loss = mlm_loss(logits, input_ids, mask)

# Backward and update only target_params
loss.backward()
optimizer.step()

Conclusion§

LACUNA demonstrates that output-level evaluation is insufficient for unlearning. Precise parameter localization is both necessary and sufficient for robust erasure. This testbed provides a standardized way to benchmark and improve localization-based unlearning methods.

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Problem & Motivation§

Methodology: LACUNA Testbed§

Key Findings§

Code Snippet: Simulated Injection Process§

Conclusion§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Online Safety Monitoring for LLMs

Distributed Attacks in Persistent-State AI Control

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States