Enabling Human-Robot Collaboration Through Representation Learning

Author:
Yasar, Mohammad, Computer Engineering - School of Engineering and Applied Science, University of Virginia
Advisor:
Iqbal, Tariq, EN-SIE, University of Virginia
Abstract:

Robots are transitioning from working in isolated chambers to close-proximity collaboration with humans as part of human-robot teams. This transition is pivotal for the development of human-robot teams, where robots are not just tools but active collaborators that must seamlessly integrate into human workflows. In such collaborative settings, the success of these teams hinges on the robots' ability to effectively model and understand both human-human and human-robot team dynamics. These capabilities are crucial for anticipating human intent, making informed and timely decisions, and taking appropriate actions within a constantly changing environment.

The core challenge addressed in this research is the representation learning gap that currently limits robots' ability to fully anticipate human intentions and respond with closed-loop actions. This gap also impedes their ability to adapt to non-stationary conditions—a critical requirement for real-world applications where environments and tasks can change unpredictably. To overcome these challenges, robots must not only possess advanced perception capabilities but also the ability to retain and build upon past experiences without suffering from catastrophic forgetting.

This research is structured around three main pillars: modeling humans, modeling robot's decision-making, and joint human-robot interaction. Each pillar addresses a fundamental aspect of the robot's role in a collaborative setting.

One of the primary challenges in human-robot interaction is the accurate modeling and prediction of human motion and intent. Human behavior is inherently complex, characterized by variability, adaptability, and context-dependent actions that are difficult to capture with traditional models. Moreover, existing perception models, which are often trained solely on datasets featuring human-only interactions, struggle to generalize when applied to mixed-team environments where robots and humans interact together. To overcome these limitations, our research focused on developing advanced architectural frameworks that enhance the robot's ability to perceive and predict human actions.

We have developed several architectures which improved interpretability of motion prediction architectures and incorporated multimodal and interactional context when predicting human motion. Furthermore, we have also addressed some of the optimization challenges when training such generative models for motion. Finally, we have introduced the PoseTron framework, a novel approach leveraging transformer-based architecture, which incorporates sequence learning and generative modeling techniques to predict human motion. PoseTron uses specialized attention mechanisms that efficiently weigh motion information from all agents in a scene, integrating this data to create a robust representation of team dynamics. This framework significantly improves the prediction of human motion in both single-agent and multi-agent settings, allowing robots to better understand and anticipate the actions of human partners in collaborative environments.

For robot decision-making and control, the challenge lies in enabling robots to operate effectively in dynamic and uncertain environments. Traditional approaches to robot control often involve a clear separation between planning and execution: robots generate a plan based on their current knowledge of the environment and then execute it. However, in real-world human-robot interaction scenarios, this separation can be problematic. Environmental conditions and human behaviors can change rapidly, rendering pre-established plans obsolete before they can be executed.

To address this, we developed control mechanisms that integrate planning and execution, allowing robots to continuously update their plans in real-time as they gather new information from their surroundings. This approach ensures that robots remain flexible and adaptive, capable of handling the unpredictability of human behavior. A key contribution in this area is the development of the LASSO algorithm, which decouples representation learning from policy learning. This modular architecture allows robots to learn robust representations of environmental dynamics that can be used to inform their actions, enabling them to navigate and operate effectively across a wide range of tasks and situations. LASSO, by focusing on the representation needed to forecast future states, allows for the flexible and efficient handling of both known and novel environments, ensuring robots can maintain high performance even in unfamiliar settings.

The ultimate goal of human-robot interaction is to achieve seamless and effective collaboration between humans and robots, where both parties can work together fluidly, influencing and responding to each other’s actions in real-time. Traditional models have often assumed a unidirectional flow of information, where robots passively respond to human actions. However, real-world scenarios demand a more dynamic, bidirectional interaction model, where robots not only react to but also anticipate and influence human behavior.

To address the identified gaps, this dissertation makes several novel contributions aimed at enabling close-proximity collaboration between humans and robots.

INTERACT Dataset: One of the primary contributions is the introduction of the INTERACT dataset. Unlike traditional datasets that focus solely on human interactions, INTERACT captures the dynamic interplay between humans and robots, providing a richer, more relevant source of data for training perception models. This inclusion is critical for developing robust models that can generalize to real-world collaborative environments. The dataset encompasses a variety of scenarios, from 3 humans collaborating to 3 humans and a robot collaborating in long-horizon navigation plus manipulation tasks, ensuring that the models trained on it can adapt to different collaborative scenarios.

PoseTron Framework: The second contribution is the development of PoseTron, a novel framework for human motion prediction. PoseTron addresses the challenge of human motion prediction in collaborative environments by leveraging advanced sequence learning and generative modeling techniques. Traditional models often fail to capture the variability in human behavior, offering limited predictive accuracy. With PoseTron, we introduce a novel transformer-based architecture to address the gap in learning algorithms. PoseTron introduces a conditional attention mechanism in the encoder enabling efficient weighing of motion information from all agents to incorporate team dynamics. The decoder features a novel multimodal attention mechanism, which weights representations from different modalities and the encoder outputs to predict future motion.

CollabPolicy Benchmark: The final contribution is the introduction of CollabPolicy, a multi-agent policy learning benchmark designed to facilitate the development of collaborative strategies among agents. This benchmark provides a testing ground for evaluating langauge-conditioned imitation learning alogirthms as well as foundation models on critical challenges such as spatial reasoning, object localization, physical coordination, and collaborative decision-making and policy learning. Through this framework, we aim to advance the capabilities of policy learning models in tackling complex, real-world multi-agent scenarios.

The contributions and key findings of this work has the potential to significantly advance the field of Human-Robot Interaction (HRI) due to its focus on addressing the critical challenges of modeling human motion, enhancing robotic control, and facilitating joint human-robot collaboration. Through the development of frameworks such as the PoseTron and IMPRINT architectures, this research has improved the accuracy and robustness of human motion prediction, enabling robots to better understand and anticipate human actions in both single-agent and multi-agent settings. The introduction of the LASSO algorithm represents allows for the seamless integration of planning and execution in dynamic environments, thus enhancing the robot's adaptability and autonomy. Furthermore, the creation of the INTERACT dataset and the CollabPolicy Benchmark has the potential to provide valuable resources for the research community, enabling the development and validation of models that capture the complexities of real-world human-robot interactions. Collectively, these contributions not only address existing gaps in HRI but also pave the way for more effective, efficient, and intuitive collaboration between humans and robots in diverse and unpredictable environments.

Degree:
PHD (Doctor of Philosophy)
Keywords:
Human-Robot Interaction, Representation Learning, Multimodal Learning
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2024/09/12