alignmentPublished: December 15, 2022

Constitutional AI: Harmlessness from AI feedback

By Yuntao Bai, Saurav Kadavath, Sandeep Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen

Research TL;DR

"Introduces Constitutional AI (CAI) for training harmless assistants. Leverages AI feedback guided by a set of written principles to automate safety alignment."

Abstract

We study methods to train a harmless AI assistant using unsupervised self-improvement, steered by a list of rules or principles called a \"constitution\". The resulting model is trained to criticize and revise its own responses using AI feedback, removing the need for human safety labels.

Read full paper on arXiv →

Constitutional AI: Harmlessness from AI feedback

Abstract

Related Research

Reconciling safety and utility in reinforcement learning alignment

Direct preference optimization: Your language model is secretly a reward model