alignmentPublished: December 15, 2022
Constitutional AI: Harmlessness from AI feedback
By Yuntao Bai, Saurav Kadavath, Sandeep Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen
Research TL;DR
"Introduces Constitutional AI (CAI) for training harmless assistants. Leverages AI feedback guided by a set of written principles to automate safety alignment."
Abstract
We study methods to train a harmless AI assistant using unsupervised self-improvement, steered by a list of rules or principles called a \"constitution\". The resulting model is trained to criticize and revise its own responses using AI feedback, removing the need for human safety labels.