GROW$^2$: Grounding Which and Where for Robot Tool Use

Abstract

Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of $\textit{open-world affordance grounding}$: select an open-category object to act as a tool and localize its specific region of action. To this end, we introduce GROW$^2$ (GROunding Which and Where), which leverages object parts as a natural abstraction to split the grounding process hierarchically into semantic and geometric levels, thus bypassing the need for data-heavy, end-to-end training. Semantically, GROW$^2$ harnesses the commonsense reasoning of Vision-Language Models (VLMs) to parse a natural-language task instruction, select a suitable object as the tool, and identify task-relevant parts on the tool and the target object. Geometrically, vision foundation models then ground the selected parts into precise 3D regions from a single RGB-D image. Experiments on established benchmarks show that GROW$^2$ outperforms state-of-the-art baselines on affordance prediction benchmarks. Further, it achieves zero-shot generalization over open-category objects and outperforms baselines in both simulated and real-world robot tool use experiments.

Technical Analysis & Implementation

Technical Breakdown§

Core Methodology§

GROW$^2$ decomposes open-world affordance grounding into two levels: semantic and geometric. At the semantic level, a Vision-Language Model (VLM, e.g., CLIP or GPT-4V) interprets a natural language task (e.g., "cut the cake") and selects a suitable tool object (e.g., a plate) and identifies task-relevant parts on both the tool (e.g., plate edge) and the target object (e.g., cake top). This is formulated as:

$$\text{Part}_{\text{tool}}, \text{Part}_{\text{target}} = \text{VLM}_{\text{afford}}(I_{\text{scene}}, T_{\text{task}})$$

where $I_{\text{scene}}$ is the scene image and $T_{\text{task}}$ the task instruction. The VLM outputs part names (e.g., "edge", "top") that are grounded geometrically.

At the geometric level, a vision foundation model (e.g., SAM or DINOv2) produces a segmentation mask or a 3D point cloud from a single RGB-D image. The selected parts are localized via cross-attention or region proposal to obtain precise 3D regions. This yields a tool affordance region $\mathcal{R}_{\text{tool}}$ and target affordance region $\mathcal{R}_{\text{target}}$.

Implementation Details§

Semantic grounding uses a pretrained VLM (e.g., CLIP-based) fine-tuned on part-object relationships or uses in-context learning with GPT-4.
Geometric grounding applies SAM to generate part-level masks, then lifts to 3D using depth data.
Training-free: no end-to-end affordance training; only off-the-shelf models are used.

Code Snippet (PyTorch-style pseudo code)§

def grow2_grounding(scene_rgb, depth, task_instruction):
    # Semantic: VLM selects tool and parts
    tool_name, part_tool, part_target = vlm_predict_part(scene_rgb, task_instruction)
    # Geometric: SAM masks
    masks_tool = sam_segment(scene_rgb, tool_name)
    masks_target = sam_segment(scene_rgb, target_name)  # target from task
    # Localize part regions
    region_tool = cross_attention(masks_tool, part_tool)
    region_target = cross_attention(masks_target, part_target)
    # Lift to 3D
    pcd_tool = depth_to_pointcloud(depth, region_tool)
    pcd_target = depth_to_pointcloud(depth, region_target)
    return pcd_tool, pcd_target

Key Results§

Outperforms baselines (e.g., AffordanceNet, Where2Act) on affordance prediction benchmarks.
Zero-shot generalization to unseen objects in both simulation (MetaWorld, RLBench) and real-world robot experiments.

Why It's Important§

Avoids expensive task-specific training by leveraging existing foundation models hierarchically, enabling open-vocabulary tool use.

Model	Input	Output
DeepSeek-V3	$0.14	$0.28
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
Gemini 1.5 Pro	$1.25	$5.00

Model

Input

Output

DeepSeek-V3

$0.14

$0.28

GPT-4o

$2.50

$10.00

Claude 3.5 Sonnet

$3.00

$15.00

Gemini 1.5 Pro

$1.25

$5.00

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Core Methodology§

Implementation Details§

Code Snippet (PyTorch-style pseudo code)§

Key Results§

Why It's Important§

Interactive LLM Token & Cost Calculator

Cost Breakdown (USD)

API Pricing Comparison (per Million Tokens)

Related Research

Self-Evolving World Models for LLM Agent Planning

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

Accelerate your workflow with Feedalyze