arrow_backBack to research feed
otherPublished: July 1, 2026

Language-Critique Imitation Learning from Suboptimal Demonstrations

By Chih-Han Yang, Dai-Jie Wu, Yun-Ping Huang, Ping-Chun Hsieh, Kenneth Marino, Shao-Hua Sun

Research TL;DR

"Uses natural language critiques as structured supervision for imitation learning from suboptimal demonstrations, avoiding scalar compression and improving policy learning."

Abstract

Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions. We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars. Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance. We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP. We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions. Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines. These results demonstrate that language can serve as a powerful and structured form of supervision for learning robust policies from suboptimal data.

Technical Analysis & Implementation

Overview§

This paper introduces a framework that leverages natural language critiques as a structured supervision signal for imitation learning from suboptimal demonstrations. Unlike prior works that compress feedback into scalars (e.g., confidence scores, discriminator outputs), the method retains expressive intermediate reasoning about task progress, failures, and corrections.

Method§

1. Language Label Construction: Given a suboptimal demonstration trajectory $\tau = \{ (s_t, a_t) \}$, a language model (or human) generates a critique label $c_t$ that describes progress, identifies suboptimal actions, and suggests corrective guidance. For example: "You are approaching the goal but too slowly; try moving faster next time."

2. Language-Critique Loss: The policy $\pi_\theta(a|s)$ is trained to minimize the language-critique loss, which directly uses these structured signals without reducing them to scalars. For behavior cloning (LC-BC), the loss is: $$\mathcal{L}_{LC-BC} = \mathbb{E}_{(s,a,c) \sim \mathcal{D}} \left[ -\log \pi_\theta(a|s) \cdot \text{sim}(f(c), g(s,a)) \right]$$ where $f$ encodes the critique into a vector (e.g., using a pretrained language encoder) and $g$ encodes the state-action pair into the same embedding space; $\text{sim}$ is cosine similarity. For diffusion policies (LC-DP), the loss is applied per denoising step.

3. Theoretical Guarantee: The authors prove that the language-critique loss upper-bounds the expert performance gap under standard assumptions (bounded reward, Lipschitz policy), implying that minimizing this loss reduces suboptimality.

Implementation Details§

  • Critique Generation: Use a frozen LLM (e.g., GPT-3.5) to generate critiques by prompting with trajectory segments. Alternatively, use human annotations.
  • Encoder: A small transformer encoder (e.g., DistilBERT) maps critique text to embeddings; a separate MLP encodes $(s,a)$ pairs.
  • Policy: For LC-BC, a simple Gaussian policy; for LC-DP, a diffusion model with U-Net backbone.

Code Snippet§

import torch
import torch.nn as nn

class LCLoss(nn.Module):
    def __init__(self, text_encoder, state_action_encoder, temp=0.07):
        super().__init__()
        self.text_encoder = text_encoder  # e.g., DistilBERT
        self.sa_encoder = state_action_encoder  # MLP
        self.temp = temp

    def forward(self, states, actions, critique_texts):
        # Encode critiques
        critique_embeds = self.text_encoder(critique_texts)  # [B, D]
        # Encode state-action pairs
        sa_embeds = self.sa_encoder(torch.cat([states, actions], dim=-1))  # [B, D]
        # Cosine similarity
        sim = torch.cosine_similarity(critique_embeds, sa_embeds, dim=-1)  # [B]
        # Language-critique loss for BC
        log_prob = self.policy.log_prob(states, actions)  # [B]
        loss = - (log_prob * sim).mean() / self.temp
        return loss

Experiments§

  • Tasks: Continuous control in navigation (Maze2D), manipulation (Franka Kitchen), and gameplay (Atari).
  • Baselines: Behavioral cloning (BC), GAIL, IQL, CQL, and prior imitation learning from suboptimal data.
  • Results: LC-BC and LC-DP outperform all baselines, with up to 40% improvement in success rate on Kitchen. Ablations show that language critiques provide richer signal than scalar confidence scores.

Conclusion§

Language critiques enable effective imitation learning from suboptimal demonstrations by preserving structured supervision, leading to robust policy learning without requiring expert data.