DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation§

Core Idea§

DexCompose addresses the challenge of composing two pretrained full-hand dexterous manipulation policies for multi-task execution with a single hand. The key insight is that overlapping finger usage between tasks causes destructive interference. The framework decomposes action spaces at the finger level and introduces two asymmetric residual modules: a stabilizer that preserves the first skill's outcome, and an adaptor that enables the second skill within a residual subspace.

Methodology§

Given two pretrained policies $\pi_1$ (e.g., object retention) and $\pi_2$ (e.g., interaction), the goal is to execute both sequentially without forgetting. DexCompose consists of three phases:

1. Finger Mask Release Test: Roll out $\pi_1$ to collect successful final states $s^$. For each candidate finger mask $m \in \{0,1\}^{D_{finger}}$ (where $D_{finger}$ is total finger DoF), release the fingers specified by $m$ (zero out their actions) and check if the object state $s_{obj}$ remains unchanged (e.g., object still grasped). Record the minimal mask that maintains the state: $m^ = \arg\min_{m} \|m\|_0$ s.t. $\text{release}(s^*, m)$ preserves object state. This identifies necessary fingers for the first skill.

2. Residual Stabilizer: Train a small network $f_\theta$ that outputs a bounded residual $\delta a_1 = \text{tanh}(f_\theta(s, z))$ added to $\pi_1$'s action only for the necessary fingers (mask $m^$). The stabilizer is trained with reinforcement learning (RL) to minimize state deviation $\|s - s^\|$ while the hand interacts with the second skill.

3. Context-Aware Residual: For the downstream policy $\pi_2$, its action space is projected onto the available fingers (complement of $m^$). A second residual module $g_\phi$ learns to output $\delta a_2$ only in that subspace. The composite action: $$a = \underbrace{(\pi_1(s) + \delta a_1) \odot m^}_{\text{preservation}} + \underbrace{(\pi_2(s) + \delta a_2) \odot (1 - m^*)}_{\text{new task}}$$ where $\odot$ is element-wise multiplication. Both residuals are trained jointly via RL with a composite reward $r = r_1 + r_2$, where $r_1$ penalizes object state deviation and $r_2$ rewards downstream task success.

Implementation Details§

Policies are based on Proximal Policy Optimization (PPO) with a Gaussian action distribution.
Stabilizer and adaptor use 2-layer MLPs (64 hidden units) with tanh activations.
The bounded residual uses $\text{tanh}$ to keep outputs within $[-1, 1]$.
Training uses the Isaac Gym simulator with a Shadow Hand.

Code Snippet (PyTorch-style)§

class DexComposeResidual(nn.Module):
    def __init__(self, state_dim, action_dim, finger_mask):
        super().__init__()
        self.stabilizer = nn.Sequential(
            nn.Linear(state_dim + 64, 64), nn.Tanh(),
            nn.Linear(64, action_dim), nn.Tanh()
        )
        self.adaptor = nn.Sequential(
            nn.Linear(state_dim + 64, 64), nn.Tanh(),
            nn.Linear(64, action_dim), nn.Tanh()
        )
        self.register_buffer('mask', finger_mask)

    def forward(self, state, z1, z2):
        # z1, z2 are context embeddings (e.g., from task encoders)
        delta1 = self.stabilizer(torch.cat([state, z1], dim=-1))
        delta2 = self.adaptor(torch.cat([state, z2], dim=-1))
        # composite action
        a1 = (pi1(state) + delta1) * self.mask
        a2 = (pi2(state) + delta2) * (1 - self.mask)
        return a1 + a2

Results§

Evaluated on 16 composite tasks (4 object retention + 4 interactions). Average success rate 77.4%, outperforming naive chaining (19.2%) and fine-tuning baselines (39.1%). Ablations confirm the finger mask and dual residuals are crucial.

Key Equations§

Composite action: $$a = (\pi_1 + \delta_1) \odot m^ + (\pi_2 + \delta_2) \odot (1-m^)$$

Finger mask optimization: $$m^ = \min_{m \in \mathcal{M}} \|m\|_0 \quad \text{s.t.} \quad \|\Phi(s_{\text{release}}) - \Phi(s^)\| < \epsilon$$ where $\Phi$ is object state encoder.

Conclusion§

DexCompose provides a principled way to compose dexterous policies by assigning explicit finger-level ownership, enabling effective multi-task manipulation with a single hand.

Abstract

Technical Analysis & Implementation