alignmentPublished: February 4, 2026

Reconciling safety and utility in reinforcement learning alignment

By Sarah Meade, Alex Johnson, Liam Patel

Research TL;DR

"Proposes a optimization framework to mitigate over-refusal in aligned LLMs. Balances safety bounds against instruction utility."

Abstract

Safety constraints in RLHF often lead to over-refusal and decreased utility. We present a Pareto-optimization framework that balances alignment constraints against task performance, ensuring models remain helpful while refusing malicious queries.

Read full paper on arXiv →

Reconciling safety and utility in reinforcement learning alignment

Abstract

Related Research

Direct preference optimization: Your language model is secretly a reward model

Constitutional AI: Harmlessness from AI feedback