arrow_backBack to research feed
alignmentPublished: July 2, 2026

Online Safety Monitoring for LLMs

By Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron, Eric Nalisnick

Research TL;DR

"Simple threshold-based monitor with risk control calibration competes with sequential hypothesis testing for online LLM safety."

Abstract

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

Technical Analysis & Implementation

Online Safety Monitoring for LLMs§

Problem Statement§

Despite alignment training, LLMs still produce unsafe outputs at deployment. Real-time monitoring is essential. This paper proposes a lightweight online monitor that thresholds a safety verifier signal, calibrated via risk control.

Method§

Let $X_t$ be the LLM output at time $t$, and $V_t \in [0,1]$ be a verifier score (e.g., from a safety classifier) where lower values indicate unsafe. The monitor issues an alarm when $V_t < \lambda$, with threshold $\lambda$ calibrated to control the risk of missing unsafe outputs. Using a calibration set $\{(V_i, Y_i)\}_{i=1}^n$ with ground-truth safety labels $Y_i \in \{0,1\}$ (1=unsafe), define the empirical risk for a threshold $\lambda$ as:

$$ \hat{R}(\lambda) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}(V_i \geq \lambda \text{ and } Y_i = 1) $$

This is the false negative rate (unsafe outputs passing undetected). To guarantee $\mathbb{E}[R(\lambda)] \leq \alpha$ with high probability, use risk control (conformal risk control) to choose $\hat{\lambda}$:

$$ \hat{\lambda} = \inf\left\{ \lambda \in [0,1] : \hat{R}(\lambda) + \frac{c}{\sqrt{n}} \leq \alpha \right\} $$

where $c$ is a constant from concentration inequalities (e.g., Hoeffding). At runtime, the monitor continuously evaluates $V_t$ and raises an alarm on first $t$ where $V_t < \hat{\lambda}$.

Code Snippet§

import numpy as np

def calibrate_threshold(cal_scores, cal_labels, alpha=0.1, c=1.0):
    """
    cal_scores: array of verifier scores (lower = unsafe)
    cal_labels: array of ground truth (1=unsafe, 0=safe)
    alpha: target false negative rate
    c: safety margin constant
    """
    n = len(cal_scores)
    lambdas = np.sort(cal_scores)
    best_lambda = 0.0
    for lam in lambdas:
        fnr = np.mean((cal_scores >= lam) & (cal_labels == 1))
        if fnr + c / np.sqrt(n) <= alpha:
            best_lambda = lam
        else:
            break
    return best_lambda

# Monitor
class SafetyMonitor:
    def __init__(self, verifier, threshold):
        self.verifier = verifier
        self.threshold = threshold
    
    def check(self, output):
        score = self.verifier(output)  # lower = unsafe
        if score < self.threshold:
            return True  # alarm
        return False

Experiments§

Tested on mathematical reasoning (GSM8K) and red teaming (Jailbreak prompts). The simple threshold monitor achieves comparable false negative rate and average detection delay to more complex sequential probability ratio tests (SPRT). Calibration is done offline with ~1000 samples.

Key Takeaways§

  • Simplicity: only requires a verifier and a calibrated threshold, no runtime adaptation.
  • Statistical guarantee: risk control ensures the false negative rate is bounded.
  • Competitive performance against adaptive methods.
Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

SHARE RESEARCH: