arrow_backBack to research feed
agentsPublished: June 29, 2026

GROW$^2$: Grounding Which and Where for Robot Tool Use

By Yuhong Deng, Yuyao Liu, David Hsu

Research TL;DR

"Hierarchical grounding using object parts as abstraction, combining VLM commonsense reasoning with vision foundation models for zero-shot open-vocabulary tool affordance prediction."

Abstract

Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of $\textit{open-world affordance grounding}$: select an open-category object to act as a tool and localize its specific region of action. To this end, we introduce GROW$^2$ (GROunding Which and Where), which leverages object parts as a natural abstraction to split the grounding process hierarchically into semantic and geometric levels, thus bypassing the need for data-heavy, end-to-end training. Semantically, GROW$^2$ harnesses the commonsense reasoning of Vision-Language Models (VLMs) to parse a natural-language task instruction, select a suitable object as the tool, and identify task-relevant parts on the tool and the target object. Geometrically, vision foundation models then ground the selected parts into precise 3D regions from a single RGB-D image. Experiments on established benchmarks show that GROW$^2$ outperforms state-of-the-art baselines on affordance prediction benchmarks. Further, it achieves zero-shot generalization over open-category objects and outperforms baselines in both simulated and real-world robot tool use experiments.

Technical Analysis & Implementation

Technical Breakdown§

Core Methodology§

GROW$^2$ decomposes open-world affordance grounding into two levels: semantic and geometric. At the semantic level, a Vision-Language Model (VLM, e.g., CLIP or GPT-4V) interprets a natural language task (e.g., "cut the cake") and selects a suitable tool object (e.g., a plate) and identifies task-relevant parts on both the tool (e.g., plate edge) and the target object (e.g., cake top). This is formulated as:

$$\text{Part}_{\text{tool}}, \text{Part}_{\text{target}} = \text{VLM}_{\text{afford}}(I_{\text{scene}}, T_{\text{task}})$$

where $I_{\text{scene}}$ is the scene image and $T_{\text{task}}$ the task instruction. The VLM outputs part names (e.g., "edge", "top") that are grounded geometrically.

At the geometric level, a vision foundation model (e.g., SAM or DINOv2) produces a segmentation mask or a 3D point cloud from a single RGB-D image. The selected parts are localized via cross-attention or region proposal to obtain precise 3D regions. This yields a tool affordance region $\mathcal{R}_{\text{tool}}$ and target affordance region $\mathcal{R}_{\text{target}}$.

Implementation Details§

  • Semantic grounding uses a pretrained VLM (e.g., CLIP-based) fine-tuned on part-object relationships or uses in-context learning with GPT-4.
  • Geometric grounding applies SAM to generate part-level masks, then lifts to 3D using depth data.
  • Training-free: no end-to-end affordance training; only off-the-shelf models are used.

Code Snippet (PyTorch-style pseudo code)§

def grow2_grounding(scene_rgb, depth, task_instruction):
    # Semantic: VLM selects tool and parts
    tool_name, part_tool, part_target = vlm_predict_part(scene_rgb, task_instruction)
    # Geometric: SAM masks
    masks_tool = sam_segment(scene_rgb, tool_name)
    masks_target = sam_segment(scene_rgb, target_name)  # target from task
    # Localize part regions
    region_tool = cross_attention(masks_tool, part_tool)
    region_target = cross_attention(masks_target, part_target)
    # Lift to 3D
    pcd_tool = depth_to_pointcloud(depth, region_tool)
    pcd_target = depth_to_pointcloud(depth, region_target)
    return pcd_tool, pcd_target

Key Results§

  • Outperforms baselines (e.g., AffordanceNet, Where2Act) on affordance prediction benchmarks.
  • Zero-shot generalization to unseen objects in both simulation (MetaWorld, RLBench) and real-world robot experiments.

Why It's Important§

Avoids expensive task-specific training by leveraging existing foundation models hierarchically, enabling open-vocabulary tool use.

Interactive SEO Tool

Interactive LLM Token & Cost Calculator

Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.

Context Window64,000 tokens
Visual Tokenizer Chunks
Language models do not read text like humans. Instead, they process text in chunks called tokens. A token can be a single character, a syllable, a word, or even part of a word (like the "ing" in "walking"). On average, 1 token is equivalent to about 4 characters or 0.75 words of English text.
Estimated Token Count124

Cost Breakdown (USD)

Input Cost (Prompt):$0.000017
Output Cost (Generated):$0.000035
Total Est. Cost:$0.000052
Context Window Capacity0.1938%

API Pricing Comparison (per Million Tokens)

ModelInputOutput
DeepSeek-V3$0.14$0.28
GPT-4o$2.50$10.00
Claude 3.5 Sonnet$3.00$15.00
Gemini 1.5 Pro$1.25$5.00
INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk