Multimodal and Multitask Representation Learning for Perceiving Embodied Interactions

Islam, Md Mofijul, Systems Engineering - School of Engineering and Applied Science, University of Virginia
Iqbal, Tariq, EN-SIE, University of Virginia

Humans inherently use multimodal data, such as verbal utterances and nonverbal gestures, to interact with others in shared physical environments. To develop AI systems to seamlessly interact with humans, it is also essential to understand how people behave and interact in these environments. Understanding human behavior and interactions (verbal and nonverbal) is paramount to ensuring seamless interactions between human and AI systems. Most recent approaches to understanding human behavior and interactions use unimodal data, such as using only visual data or only verbal utterances. For example, the majority of existing works solely use verbal utterance to comprehend embodied interactions and visual data to perceive human behavior. However, relying on unimodal data leads to single-point failure and can not ensure a robust perception of human action and comprehension of human interactions. For example, visual occlusion or visual data from low light conditions can limit the model to accurately recognize activities. Similarly, a verbal utterance is insufficient to determine an object if a visual scene contains two identical objects. Therefore, a robust model of understanding human interactions needs to incorporate multimodal information.

Developing these multimodal models for multiple tasks of perceiving human-embodied interactions requires addressing several fundamental challenges. For example, extracting salient representations from missing and noisy data modalities. Fusing, aligning, and extracting complementary representations from multiple heterogeneous modalities is challenging due to the disparate feature distributions and feedforward learning architecture. Similarly, learning salient representations from multiple verbal and visual perspectives needs to be addressed to effectively comprehend multimodal embodied interactions with verbal and nonverbal gestures. In my Ph.D. research, I developed robust models to perceive human behavior and embodied interaction using multimodal data to address these challenges.

First, we have developed multimodal learning models to robustly perceive human actions using multimodal sensor data, such as visual, depth, skeleton, and physical sensor data. These models can extract salient and complementary representations from heterogeneous modalities. Moreover, our proposed models can prioritize the modalities and extract salient representations from missing and noisy sensor modalities, whereas prior models could not effectively extract salient representations from heterogeneous and noisy sensor data. Additionally, we have developed a novel cooperative multitask learning model that can help to extract complementary multimodal representations using auxiliary information. Our extensive experimental results suggest that our proposed multimodal learning models outperform state-of-the-art models in recognizing human actions.

Second, comprehending embodied interaction can be studied by designing several fundamental tasks, such as understanding referring expression, comprehending embodied question answering, and determining perspective in an interaction. As collecting real-world data is costly and the existing simulator could not generate human multimodal interactions (verbal and nonverbal gestures), we have developed an embodied simulator, which we can use to generate synthetic multimodal human interactions and datasets. We can use these generated datasets to train and diagnose models for comprehending interactions with verbal utterances and nonverbal gestures.

The existing models of comprehending human interactions are designed to understand only verbal interaction from a single perspective and use the visual scene as context. However, people use multiple verbal and visual perspectives in real-world interactions (speaker and observer perspectives). Moreover, our experimental analysis suggests that perspective awareness in the learning models is crucial to comprehend embodied interactions. We have developed a perspective-aware learning model to understand human instructions with verbal utterances and non-verbal gestures. Our experimental analysis suggests that our proposed model can effectively extract salient multimodal representations to comprehend embodied interactions.

Additionally, the existing models and datasets of visual question answering use the visual scene as a context to answer verbal questions. However, humans use multimodal expressions (verbal utterances and nonverbal gestures) to ask questions in real-world settings. We have developed an embodied question answering (EQA) dataset and designed new tasks to develop models for comprehending question-answering interactions in embodied settings. As the existing models are designed to answer verbal questions, these models are less suitable for comprehending EQA tasks. We have developed learning models to extract aligned representations from multiple verbal and visual perspectives to answer questions with multimodal expressions.

While we can generate diverse synthetic interactions using our simulators, these interactions may differ from real-world human interactions, such as variations in pointing gesture and eye gaze patterns, different camera angles, object arrangements, and diverse environments. Thus, we have curated a large-scale embodied interaction dataset with multimodal data (verbal utterances and nonverbal gestures) in real-world settings. We have evaluated baseline multimodal learning models on this real-world dataset. The existing multimodal model aligns multiple representations and thus loses information across modalities. To address this challenge, we have proposed a reinforced residual representation-based multimodal learning model for extracting robust multimodal representations to comprehend human interactions in real-world settings. Our experimental results suggest that our proposed model with guided attention-based reinforced residual representation outperforms the baseline visual-language models in various challenging evaluation settings.

Our multimodal learning models and datasets can help to develop and evaluate models for various tasks, such as embodied question answering, visual-language-based navigation, and human-robot interactions. These models can be extended to improve human interactions in both virtual and real-world settings, including providing improved user experiences for people with disabilities. Furthermore, these models can enhance the user experience of AI assistants, like Amazon Alexa and Microsoft Cortana, strengthening their applications in online shopping, video gaming, and personalized online learning. Lastly, the findings from our works, proposed models, benchmarks, datasets, and embodied simulators can serve as a valuable tool for the research community, fostering the development and evaluation of learning multimodal models for Human-AI Interaction systems.

PHD (Doctor of Philosophy)
multimodal learning, multitask learning, generative ai, human-centered ai, representation learning, visual-language model, nlp, computer vision, multimodal perception, perceiving embodied interactions
All rights reserved (no additional license for public reuse)
Issued Date: