GROW$^2$: Grounding Which and Where for Robot Tool Use
By Yuhong Deng, Yuyao Liu, David Hsu
"Hierarchical grounding using object parts as abstraction, combining VLM commonsense reasoning with vision foundation models for zero-shot open-vocabulary tool affordance prediction."
Abstract
Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of $\textit{open-world affordance grounding}$: select an open-category object to act as a tool and localize its specific region of action. To this end, we introduce GROW$^2$ (GROunding Which and Where), which leverages object parts as a natural abstraction to split the grounding process hierarchically into semantic and geometric levels, thus bypassing the need for data-heavy, end-to-end training. Semantically, GROW$^2$ harnesses the commonsense reasoning of Vision-Language Models (VLMs) to parse a natural-language task instruction, select a suitable object as the tool, and identify task-relevant parts on the tool and the target object. Geometrically, vision foundation models then ground the selected parts into precise 3D regions from a single RGB-D image. Experiments on established benchmarks show that GROW$^2$ outperforms state-of-the-art baselines on affordance prediction benchmarks. Further, it achieves zero-shot generalization over open-category objects and outperforms baselines in both simulated and real-world robot tool use experiments.
Technical Analysis & Implementation
Technical Breakdown§
Core Methodology§
GROW$^2$ decomposes open-world affordance grounding into two levels: semantic and geometric. At the semantic level, a Vision-Language Model (VLM, e.g., CLIP or GPT-4V) interprets a natural language task (e.g., "cut the cake") and selects a suitable tool object (e.g., a plate) and identifies task-relevant parts on both the tool (e.g., plate edge) and the target object (e.g., cake top). This is formulated as:
$$\text{Part}_{\text{tool}}, \text{Part}_{\text{target}} = \text{VLM}_{\text{afford}}(I_{\text{scene}}, T_{\text{task}})$$
where $I_{\text{scene}}$ is the scene image and $T_{\text{task}}$ the task instruction. The VLM outputs part names (e.g., "edge", "top") that are grounded geometrically.
At the geometric level, a vision foundation model (e.g., SAM or DINOv2) produces a segmentation mask or a 3D point cloud from a single RGB-D image. The selected parts are localized via cross-attention or region proposal to obtain precise 3D regions. This yields a tool affordance region $\mathcal{R}_{\text{tool}}$ and target affordance region $\mathcal{R}_{\text{target}}$.
Implementation Details§
- Semantic grounding uses a pretrained VLM (e.g., CLIP-based) fine-tuned on part-object relationships or uses in-context learning with GPT-4.
- Geometric grounding applies SAM to generate part-level masks, then lifts to 3D using depth data.
- Training-free: no end-to-end affordance training; only off-the-shelf models are used.
Code Snippet (PyTorch-style pseudo code)§
def grow2_grounding(scene_rgb, depth, task_instruction):
# Semantic: VLM selects tool and parts
tool_name, part_tool, part_target = vlm_predict_part(scene_rgb, task_instruction)
# Geometric: SAM masks
masks_tool = sam_segment(scene_rgb, tool_name)
masks_target = sam_segment(scene_rgb, target_name) # target from task
# Localize part regions
region_tool = cross_attention(masks_tool, part_tool)
region_target = cross_attention(masks_target, part_target)
# Lift to 3D
pcd_tool = depth_to_pointcloud(depth, region_tool)
pcd_target = depth_to_pointcloud(depth, region_target)
return pcd_tool, pcd_targetKey Results§
- Outperforms baselines (e.g., AffordanceNet, Where2Act) on affordance prediction benchmarks.
- Zero-shot generalization to unseen objects in both simulation (MetaWorld, RLBench) and real-world robot experiments.
Why It's Important§
Avoids expensive task-specific training by leveraging existing foundation models hierarchically, enabling open-vocabulary tool use.
Interactive LLM Token & Cost Calculator
Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.
Cost Breakdown (USD)
API Pricing Comparison (per Million Tokens)
| Model | Input | Output |
|---|---|---|
| DeepSeek-V3 | $0.14 | $0.28 |
| GPT-4o | $2.50 | $10.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Related Research
Self-Evolving World Models for LLM Agent Planning
Read Synopsis →Jun 2026Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
Read Synopsis →Jun 2026RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk