alignmentPublished: February 4, 2026
Reconciling safety and utility in reinforcement learning alignment
By Sarah Meade, Alex Johnson, Liam Patel
Research TL;DR
"Proposes a optimization framework to mitigate over-refusal in aligned LLMs. Balances safety bounds against instruction utility."
Abstract
Safety constraints in RLHF often lead to over-refusal and decreased utility. We present a Pareto-optimization framework that balances alignment constraints against task performance, ensuring models remain helpful while refusing malicious queries.