arrow_backBack to research feed
alignmentPublished: June 30, 2026

Freeform Preference Learning for Robotic Manipulation

By Marcel Torne, Anubha Mahajan, Abhijnya Bhat, Chelsea Finn

Research TL;DR

"Freeform Preference Learning (FPL) uses natural-language preference axes with pairwise comparisons to train a language-conditioned reward model, enabling multi-objective policy optimization and test-time steering."

Abstract

Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/

Technical Analysis & Implementation

Freeform Preference Learning (FPL)§

FPL addresses the limitation of binary preferences in RLHF for robotics by allowing annotators to specify natural-language axes (e.g., speed, safety) and provide pairwise comparisons per axis. A language-conditioned reward model is learned and used to train a reward-conditioned policy.

Methodology§

Data Annotation: Given trajectory pairs $(\tau_1, \tau_2)$, annotators define a freeform label $l$ (e.g., "fast") and indicate which trajectory better satisfies that property. Multiple axes can be annotated per pair.

Reward Model: The reward model $R(\tau, l)$ maps a trajectory and label to a scalar. It is trained with a Bradley-Terry preference loss:

$$\mathcal{L} = -\mathbb{E}_{(\tau_1,\tau_2,l,c)} \left[ c \log \sigma(R(\tau_1,l)-R(\tau_2,l)) + (1-c) \log \sigma(R(\tau_2,l)-R(\tau_1,l)) \right]$$

where $c=1$ if $\tau_1$ is preferred over $\tau_2$ along axis $l$.

Architecture: The model encodes trajectories (e.g., via a transformer over state-action sequences) and labels (via a frozen BERT). Their embeddings are concatenated and passed through an MLP to compute reward.

Policy Learning: Using PPO, the policy $\pi( a | s, l )$ is trained to maximize $R(\tau, l)$. At test time, users specify a desired behavior via natural language (e.g., "gentle" or "fast") to condition the policy.

Code Snippet§

import torch
import torch.nn as nn
import torch.nn.functional as F

class RewardModel(nn.Module):
    def __init__(self, traj_dim=128, text_dim=256):
        super().__init__()
        self.traj_encoder = nn.LSTM(traj_dim, 128, batch_first=True)
        self.text_encoder = nn.BERT.from_pretrained('bert-base-uncased')
        self.fc = nn.Sequential(
            nn.Linear(128 + 768, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, traj_seq, label):
        # traj_seq: (batch, T, traj_dim)
        _, (h, _) = self.traj_encoder(traj_seq)
        traj_feat = h[-1]  # last hidden state
        label_feat = self.text_encoder(label).pooler_output
        feat = torch.cat([traj_feat, label_feat], dim=-1)
        return self.fc(feat).squeeze(-1)

# Training loop
model = RewardModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
for (traj1, traj2, label, pref) in dataloader:
    r1 = model(traj1, label)
    r2 = model(traj2, label)
    logits = r1 - r2
    loss = F.binary_cross_entropy_with_logits(logits, pref)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Results§

On 4 real-world and 2 simulated long-horizon manipulation tasks, FPL improves success rate by 38 percentage points over sparse-reward and binary-preference baselines. The learned reward provides dense signals without subtask segmentation, and the policy exhibits compositionality of preferences (e.g., "fast and precise") even when such combinations were not seen during training.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.