Prompt-to-Pose Grasp Estimation


Overview

This project presents a prompt-to-pose grasp planning pipeline that allows a robot to understand a natural language instruction (e.g., “pick up the red mug”) and output 6-DoF grasp poses for that object — even in cluttered environments.
It combines grounded vision-language models with 3D grasp estimation networks, bridging perception and manipulation through learning-based inference.


System Architecture

The pipeline integrates multiple perception and learning modules, each playing a distinct role:

  1. Text-Conditioned Object Localization (Grounding DINO)
    • Accepts a user prompt describing the target object.
    • Uses Grounding DINO to detect and localize the object in 2D based on both image and text cues.
  2. Segmentation Refinement (SAM 2)
    • The detected bounding box is refined into a precise segmentation mask using Segment-Anything Model 2 (SAM 2).
    • Produces an accurate pixel-level mask that isolates the object from clutter.
  3. 3D Reconstruction and Grasp Estimation (Contact-GraspNet)
    • The segmented RGB-D data is converted into a point cloud.
    • NVIDIA Contact-GraspNet predicts a ranked set of 6-DoF grasp poses (position, orientation, and gripper width).
    • The top-scoring grasp is sent to the manipulator’s motion planner for execution.

Technical Highlights


Key Learnings


Future Work


Media

Below is the demo of the prompt-to-pose grasp estimation pipeline in action:

Demonstrates end-to-end inference from natural-language prompt to 6-DoF grasp poses in a cluttered tabletop scene.

← Back to projects