Online Safety Monitoring for LLMs

Online Safety Monitoring for LLMs§

Problem Statement§

Despite alignment training, LLMs still produce unsafe outputs at deployment. Real-time monitoring is essential. This paper proposes a lightweight online monitor that thresholds a safety verifier signal, calibrated via risk control.

Method§

Let $X_t$ be the LLM output at time $t$, and $V_t \in [0,1]$ be a verifier score (e.g., from a safety classifier) where lower values indicate unsafe. The monitor issues an alarm when $V_t < \lambda$, with threshold $\lambda$ calibrated to control the risk of missing unsafe outputs. Using a calibration set $\{(V_i, Y_i)\}_{i=1}^n$ with ground-truth safety labels $Y_i \in \{0,1\}$ (1=unsafe), define the empirical risk for a threshold $\lambda$ as:

$$ \hat{R}(\lambda) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}(V_i \geq \lambda \text{ and } Y_i = 1) $$

This is the false negative rate (unsafe outputs passing undetected). To guarantee $\mathbb{E}[R(\lambda)] \leq \alpha$ with high probability, use risk control (conformal risk control) to choose $\hat{\lambda}$:

$$ \hat{\lambda} = \inf\left\{ \lambda \in [0,1] : \hat{R}(\lambda) + \frac{c}{\sqrt{n}} \leq \alpha \right\} $$

where $c$ is a constant from concentration inequalities (e.g., Hoeffding). At runtime, the monitor continuously evaluates $V_t$ and raises an alarm on first $t$ where $V_t < \hat{\lambda}$.

Code Snippet§

import numpy as np

def calibrate_threshold(cal_scores, cal_labels, alpha=0.1, c=1.0):
    """
    cal_scores: array of verifier scores (lower = unsafe)
    cal_labels: array of ground truth (1=unsafe, 0=safe)
    alpha: target false negative rate
    c: safety margin constant
    """
    n = len(cal_scores)
    lambdas = np.sort(cal_scores)
    best_lambda = 0.0
    for lam in lambdas:
        fnr = np.mean((cal_scores >= lam) & (cal_labels == 1))
        if fnr + c / np.sqrt(n) <= alpha:
            best_lambda = lam
        else:
            break
    return best_lambda

# Monitor
class SafetyMonitor:
    def __init__(self, verifier, threshold):
        self.verifier = verifier
        self.threshold = threshold
    
    def check(self, output):
        score = self.verifier(output)  # lower = unsafe
        if score < self.threshold:
            return True  # alarm
        return False

Experiments§

Tested on mathematical reasoning (GSM8K) and red teaming (Jailbreak prompts). The simple threshold monitor achieves comparable false negative rate and average detection delay to more complex sequential probability ratio tests (SPRT). Calibration is done offline with ~1000 samples.

Key Takeaways§

Simplicity: only requires a verifier and a calibrated threshold, no runtime adaptation.
Statistical guarantee: risk control ensures the false negative rate is bounded.
Competitive performance against adaptive methods.

Abstract

Technical Analysis & Implementation