One-Shot 3D Object-to-Object Affordance Grounding with Semantic Feature Field for Generalizable Robotic Manipulation

Author:
Tian, Tongxuan, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisor:
Kuo, Yen-Ling, EN-Comp Science Dept, University of Virginia
Abstract:

Affordance grounding—the identification of functional properties that indicate how objects can be manipulated—is fundamental to embodied intelligence and robotic manipulation. While previous research has made significant progress in single-object affordance prediction, it has largely overlooked the critical reality that most real-world tasks involve interactions between multiple objects. This thesis addresses the challenge of object-to-object (O2O) affordance grounding in 3D space under limited data constraints.

We introduce O³Afford, a novel one-shot learning framework for object-to-object affordance grounding that leverages 3D semantic fields distilled from vision foundation models (VFMs). Our key insight is that by combining the rich semantic understanding capabilities of VFMs with the geometric information captured in 3D point clouds, we can enable effective generalization to unseen objects with minimal supervision. The framework projects multi-view features from vision foundation models onto point clouds of interacting objects, creating semantically-enriched representations that capture part-awareness critical for affordance prediction.

At the core of our approach is a transformer-based affordance decoder that explicitly models geometric relationships and semantic features between objects, considering how each object's geometry influences potential interaction regions on the other. This approach captures the geometry context of object-to-object affordances while maintaining awareness of the distinct functional roles in interactions such as pouring, cutting, and plugging.

We further integrate our affordance representations with large language models to enhance fine-grained spatial understanding for downstream tasks. Experimental evaluations demonstrate that O³Afford significantly outperforms existing methods in both affordance prediction accuracy and generalization capabilities across unseen object instances, partial observation, and novel categories. Through experiments in both simulation and real-world environments, we validate that our approach facilitates more effective manipulation planning for complex interactive tasks.

This work bridges a critical gap in affordance learning by enabling robots to understand not just how humans interact with individual objects, but how objects functionally interact with each other—a fundamental capability for advanced robotic manipulation in everyday environments.

Degree:
MS (Master of Science)
Keywords:
Affordance Grounding, Vision Foundation Models, Robotic Manipulation
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2025/04/16