Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

Technical Breakdown§

Core Methodology§

Sparse autoencoders (SAEs) decompose dense activations of vision foundation models into sparse, monosemantic latent features. The Top‑k SAE retains only the $k$ largest activations per input, enforcing hard sparsity but suffering from a fixed budget $k$ that may overfit to training conditions and limit robustness. This work introduces two soft sparsity regularizers applied to pre‑Top‑k activations to complement the hard constraint.

Regularizers§

Let $\mathbf{z} \in \mathbb{R}^d$ be the pre‑activation latent vector (before Top‑k). Define the set of batch‑active units $\mathcal{A}$ as those selected by Top‑k at least once in a batch. The two regularizers are:

1. ℓ₁ off‑support penalty: penalizes unselected units that are batch‑active but not in the top‑k for a given sample: $$ \mathcal{L}_{\text{off}} = \beta \sum_{i \in \mathcal{A} \setminus \text{Top‑k}(\mathbf{z}, k)} |z_i| $$ where $\text{Top‑k}(\mathbf{z}, k)$ denotes the indices of the $k$ largest entries. This encourages the model to keep only truly important features within the top‑k.

2. ℓ₁/ℓ₂ ratio penalty: scale‑invariant, applied to the full vector $\mathbf{z}$ for batch‑active units: $$ \mathcal{L}_{\text{ratio}} = \gamma \frac{\|\mathbf{z}_\mathcal{A}\|_1}{\|\mathbf{z}_\mathcal{A}\|_2} $$ where $\mathbf{z}_\mathcal{A}$ is the sub‑vector of batch‑active units. Minimizing this ratio concentrates activation mass onto fewer latents, effectively reducing the number of active units.

Both regularizers are added to the standard reconstruction loss $\mathcal{L}_{\text{rec}}$ (e.g., MSE).

Implementation Details§

The training loop follows the standard Top‑k SAE setup:

class TopkSAE(nn.Module):
    def __init__(self, input_dim, latent_dim, k, beta=0.1, gamma=0.1):
        super().__init__()
        self.encoder = nn.Linear(input_dim, latent_dim)
        self.decoder = nn.Linear(latent_dim, input_dim)
        self.k = k
        self.beta = beta
        self.gamma = gamma

    def forward(self, x):
        z_pre = self.encoder(x)  # (batch, latent_dim)
        # Top-k selection (hard sparsity)
        topk_vals, topk_idx = torch.topk(z_pre, self.k, dim=1)
        z_hard = torch.zeros_like(z_pre)
        z_hard.scatter_(1, topk_idx, topk_vals)
        x_recon = self.decoder(z_hard)
        return x_recon, z_pre, topk_idx

    def loss(self, x, x_recon, z_pre, topk_idx):
        rec_loss = F.mse_loss(x_recon, x)
        # batch-active units mask
        batch_indicator = (z_pre.abs() > 0).float()  # potentially use batch stats
        # For off-support: units that are batch-active but not selected per sample
        off_mask = batch_indicator * (~F.one_hot(topk_idx, num_classes=z_pre.size(-1)).sum(dim=1).bool()).float()
        reg_off = self.beta * (z_pre.abs() * off_mask).sum(dim=1).mean()
        # l1/l2 ratio on batch-active units
        z_batch_active = z_pre * batch_indicator
        l1 = z_batch_active.abs().sum(dim=1)
        l2 = z_batch_active.norm(p=2, dim=1)
        reg_ratio = self.gamma * (l1 / (l2 + 1e-8)).mean()
        return rec_loss + reg_off + reg_ratio

Results & Insights§

Both regularizers consistently improve monosemanticity (measured by interpretability scores) across DINOv2, MAE, and CLIP on ImageNet and CIFAR‑10, without hurting reconstruction MSE.
The ℓ₁/ℓ₂ penalty further concentrates information into fewer latents (lower effective k), making reconstruction more robust to inference‑time choice of k.
Linear probing with small budgets benefits from the ℓ₁/ℓ₂ penalty, indicating more compact and discriminative features.
Key finding: Hard architectural sparsity and soft sparsity regularization are complementary; they address different limitations and together yield better SAEs.

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Core Methodology§

Regularizers§

Implementation Details§

Results & Insights§

Related Research

DanceOPD: On-Policy Generative Field Distillation

Hallucination in World Models is Predictable and Preventable

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis