Online Safety Monitoring for LLMs
By Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron, Eric Nalisnick
"Simple threshold-based monitor with risk control calibration competes with sequential hypothesis testing for online LLM safety."
Abstract
Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.
Technical Analysis & Implementation
Online Safety Monitoring for LLMs§
Problem Statement§
Despite alignment training, LLMs still produce unsafe outputs at deployment. Real-time monitoring is essential. This paper proposes a lightweight online monitor that thresholds a safety verifier signal, calibrated via risk control.
Method§
Let $X_t$ be the LLM output at time $t$, and $V_t \in [0,1]$ be a verifier score (e.g., from a safety classifier) where lower values indicate unsafe. The monitor issues an alarm when $V_t < \lambda$, with threshold $\lambda$ calibrated to control the risk of missing unsafe outputs. Using a calibration set $\{(V_i, Y_i)\}_{i=1}^n$ with ground-truth safety labels $Y_i \in \{0,1\}$ (1=unsafe), define the empirical risk for a threshold $\lambda$ as:
$$ \hat{R}(\lambda) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}(V_i \geq \lambda \text{ and } Y_i = 1) $$
This is the false negative rate (unsafe outputs passing undetected). To guarantee $\mathbb{E}[R(\lambda)] \leq \alpha$ with high probability, use risk control (conformal risk control) to choose $\hat{\lambda}$:
$$ \hat{\lambda} = \inf\left\{ \lambda \in [0,1] : \hat{R}(\lambda) + \frac{c}{\sqrt{n}} \leq \alpha \right\} $$
where $c$ is a constant from concentration inequalities (e.g., Hoeffding). At runtime, the monitor continuously evaluates $V_t$ and raises an alarm on first $t$ where $V_t < \hat{\lambda}$.
Code Snippet§
import numpy as np
def calibrate_threshold(cal_scores, cal_labels, alpha=0.1, c=1.0):
"""
cal_scores: array of verifier scores (lower = unsafe)
cal_labels: array of ground truth (1=unsafe, 0=safe)
alpha: target false negative rate
c: safety margin constant
"""
n = len(cal_scores)
lambdas = np.sort(cal_scores)
best_lambda = 0.0
for lam in lambdas:
fnr = np.mean((cal_scores >= lam) & (cal_labels == 1))
if fnr + c / np.sqrt(n) <= alpha:
best_lambda = lam
else:
break
return best_lambda
# Monitor
class SafetyMonitor:
def __init__(self, verifier, threshold):
self.verifier = verifier
self.threshold = threshold
def check(self, output):
score = self.verifier(output) # lower = unsafe
if score < self.threshold:
return True # alarm
return FalseExperiments§
Tested on mathematical reasoning (GSM8K) and red teaming (Jailbreak prompts). The simple threshold monitor achieves comparable false negative rate and average detection delay to more complex sequential probability ratio tests (SPRT). Calibration is done offline with ~1000 samples.
Key Takeaways§
- Simplicity: only requires a verifier and a calibrated threshold, no runtime adaptation.
- Statistical guarantee: risk control ensures the false negative rate is bounded.
- Competitive performance against adaptive methods.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.