Language-Critique Imitation Learning from Suboptimal Demonstrations

Overview§

This paper introduces a framework that leverages natural language critiques as a structured supervision signal for imitation learning from suboptimal demonstrations. Unlike prior works that compress feedback into scalars (e.g., confidence scores, discriminator outputs), the method retains expressive intermediate reasoning about task progress, failures, and corrections.

Method§

1. Language Label Construction: Given a suboptimal demonstration trajectory $\tau = \{ (s_t, a_t) \}$, a language model (or human) generates a critique label $c_t$ that describes progress, identifies suboptimal actions, and suggests corrective guidance. For example: "You are approaching the goal but too slowly; try moving faster next time."

2. Language-Critique Loss: The policy $\pi_\theta(a|s)$ is trained to minimize the language-critique loss, which directly uses these structured signals without reducing them to scalars. For behavior cloning (LC-BC), the loss is: $$\mathcal{L}_{LC-BC} = \mathbb{E}_{(s,a,c) \sim \mathcal{D}} \left[ -\log \pi_\theta(a|s) \cdot \text{sim}(f(c), g(s,a)) \right]$$ where $f$ encodes the critique into a vector (e.g., using a pretrained language encoder) and $g$ encodes the state-action pair into the same embedding space; $\text{sim}$ is cosine similarity. For diffusion policies (LC-DP), the loss is applied per denoising step.

3. Theoretical Guarantee: The authors prove that the language-critique loss upper-bounds the expert performance gap under standard assumptions (bounded reward, Lipschitz policy), implying that minimizing this loss reduces suboptimality.

Implementation Details§

Critique Generation: Use a frozen LLM (e.g., GPT-3.5) to generate critiques by prompting with trajectory segments. Alternatively, use human annotations.
Encoder: A small transformer encoder (e.g., DistilBERT) maps critique text to embeddings; a separate MLP encodes $(s,a)$ pairs.
Policy: For LC-BC, a simple Gaussian policy; for LC-DP, a diffusion model with U-Net backbone.

Code Snippet§

import torch
import torch.nn as nn

class LCLoss(nn.Module):
    def __init__(self, text_encoder, state_action_encoder, temp=0.07):
        super().__init__()
        self.text_encoder = text_encoder  # e.g., DistilBERT
        self.sa_encoder = state_action_encoder  # MLP
        self.temp = temp

    def forward(self, states, actions, critique_texts):
        # Encode critiques
        critique_embeds = self.text_encoder(critique_texts)  # [B, D]
        # Encode state-action pairs
        sa_embeds = self.sa_encoder(torch.cat([states, actions], dim=-1))  # [B, D]
        # Cosine similarity
        sim = torch.cosine_similarity(critique_embeds, sa_embeds, dim=-1)  # [B]
        # Language-critique loss for BC
        log_prob = self.policy.log_prob(states, actions)  # [B]
        loss = - (log_prob * sim).mean() / self.temp
        return loss

Experiments§

Tasks: Continuous control in navigation (Maze2D), manipulation (Franka Kitchen), and gameplay (Atari).
Baselines: Behavioral cloning (BC), GAIL, IQL, CQL, and prior imitation learning from suboptimal data.
Results: LC-BC and LC-DP outperform all baselines, with up to 40% improvement in success rate on Kitchen. Ablations show that language critiques provide richer signal than scalar confidence scores.

Conclusion§

Language critiques enable effective imitation learning from suboptimal demonstrations by preserving structured supervision, leading to robust policy learning without requiring expert data.

Abstract

Technical Analysis & Implementation

Overview§

Method§

Implementation Details§

Code Snippet§

Experiments§

Conclusion§

Related Research

Second-Order KKT Guarantees for Bregman ADMM in Nonconvex and Non-Lipschitz Optimization

Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

Surprises in Proper Positive-Only Learning