arrow_backBack to research feed
alignmentPublished: July 2, 2026

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

By Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach, Verna Dankers

Research TL;DR

"Introduces LACUNA testbed that injects PII into known LLM parameters via masked continual pretraining, revealing that current unlearning methods are imprecise and vulnerable to resurfacing attacks, while precise localization enables robust erasure."

Abstract

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.

Technical Analysis & Implementation

Technical Breakdown§

Problem & Motivation§

Existing unlearning benchmarks evaluate only output-level erasure, leaving it unclear whether knowledge is truly removed from model parameters. LACUNA addresses this by providing ground-truth parameter-level localization of memorized PII.

Methodology: LACUNA Testbed§

Synthetic PII Injection: LACUNA constructs synthetic individuals with associated PII (name, SSN, etc.) and injects this knowledge into specific, predefined parameters $\theta_{target}$ of an OLMo-based model (1B and 7B) via masked continual pretraining. The injection process uses a masked language modeling objective:

$$\mathcal{L}_{inject} = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}_{PII}} \left[ \sum_{m \in M} \log p_\theta(x_m | \mathbf{x}_{\setminus m}) \right]$$

where $M$ is the set of masked tokens corresponding to PII values. Training is restricted to only update $\theta_{target}$, ensuring that the PII knowledge is localized. Precision is measured by how accurately an unlearning method modifies $\theta_{target}$.

Evaluation: Unlearning methods (e.g., gradient ascent, NPO, etc.) are applied, and both output-level metrics (verbatim recall) and parameter-level metrics (L2 distance of $\theta_{target}$ from its injected values, consistency of gradient updates) are reported. Resurfacing attacks test whether the erased knowledge can be recovered via fine-tuning on auxiliary data.

Key Findings§

  • Current SOTA methods achieve low verbatim recall (output-level success) but have high parameter-level imprecision (target weights remain close to injected values).
  • Precise localization enables simple gradient-based unlearning (e.g., gradient ascent on $\theta_{target}$ only) to achieve both strong erasure and resistance to resurfacing.
  • Imprecise methods are vulnerable: even after unlearning, fine-tuning on public data can reactivate memorized PII.

Code Snippet: Simulated Injection Process§

import torch
import torch.nn as nn

class OLMoBlock(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.fc = nn.Linear(d_model, d_model)

# Assume we have a pretrained model and we want to inject into layer[5].fc.weight
target_params = model.layers[5].fc.weight  # shape: (d_model, d_model)
optimizer = torch.optim.SGD([target_params], lr=1e-4)

# Synthetic PII batch: masked tokens
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
mask = torch.zeros_like(input_ids).bool()
mask[:, 10:15] = True  # mask PII tokens

# Forward through model, compute MLM loss only on masked positions
outputs = model(input_ids)
logits = outputs.logits  # (batch, seq, vocab)
loss = mlm_loss(logits, input_ids, mask)

# Backward and update only target_params
loss.backward()
optimizer.step()

Conclusion§

LACUNA demonstrates that output-level evaluation is insufficient for unlearning. Precise parameter localization is both necessary and sufficient for robust erasure. This testbed provides a standardized way to benchmark and improve localization-based unlearning methods.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

SHARE RESEARCH: