arrow_backBack to research feed
alignmentPublished: May 29, 2023

Direct preference optimization: Your language model is secretly a reward model

By Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Research TL;DR

"Proposes Direct Preference Optimization (DPO) as an alternative to PPO-based RLHF. Simplifies alignment by optimizing the policy directly from human preference data."

Abstract

We present Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight algorithm for steering LLMs to align with human preferences. DPO avoids the instability of traditional RLHF by mathematically optimizing the policy directly from preference data without training an explicit reward model.

Read full paper on arXiv →