A Multimodal Data Capture System To Enable Fluent Human-Robot Collaboration; A Socioethical Analysis Of Cameras Throughout History

Author:
Hagood, Camp, School of Engineering and Applied Science, University of Virginia
Advisors:
Earle, Joshua, EN-Engineering and Society, University of Virginia
Iqbal, Tariq, EN-SIE, University of Virginia
Sarker, Sujan, EN-Comp Science Dept Engineering Graduate, University of Virginia
Abstract:

Technical Project Abstract:
Effective human-robot collaboration (HRC) relies on intuitive and reliable communication modalities, particularly in dynamic environments where traditional verbal or wearable sensor-based systems may be unreliable. While gesture-based communication offers a natural and non-intrusive alternative, it remains challenging due to limitations in current recognition systems, such as their dependence on large labeled datasets and lack of adaptability in various environmental conditions. Recent advances in vision-language models (VLMs) have shown promise in video understanding and general reasoning. How-ever, they often lack the domain-specific context required for accurate classification in specialized applications. To address these challenges, we introduce a novel gesture recognition system that leverages a vision-language model (VLM) guided by retrieval-augmented generation (RAG) and chain-of-thought (CoT) prompting to introduce contextual understanding and reasoning. Our system captures upper-body gestures using an Azure Kinect, extracts sampled frames, and classifies them using GPT-4o enhanced by RAG from military gesture documentation and CoT reasoning strategies. Recognized gestures are encoded as ROS 2 messages and transmitted using a publisher-subscriber model to command a mobile robot to execute the corresponding actions. We validate our approach through controlled experiments using seven U.S. Marine Corps (USMC) gestures. The system achieved an accuracy of 80%, an F1 score of 89.9%, and demonstrated effective gesture-to-robot execution. Our results highlight the potential of VLMs for zero-shot gesture classification and robotic control, providing a foundation for robust, scalable, and field-deployable gesture-based HRC systems.

STS Research Paper Abstract:
This paper takes a closer look at how cameras have shaped society, not just through their technological development, but through the social and ethical questions they raise. From early photography to modern-day smartphones and surveillance systems, cameras have changed the way people today see themselves and each other. They’ve empowered people to document life, tell stories, and hold power to account, but they’ve also been used to monitor, manipulate, and control. Using ideas from Science, Technology, and Society (STS), this research traces the history of camera technology and asks how its growing presence affects issues like privacy, trust, and freedom. The paper explores how cameras operate in different areas of life, from journalism and activism to policing and social media, showing how these tools can both reflect and shape power dynamics. Particular emphasis is placed on contemporary discussions surrounding facial recognition technology, issues of consent, and the ownership and control of images and data. Rather than seeing cameras as purely good or bad, the paper argues that their impact depends on how we choose to use them, and who gets to decide. Ultimately, this research highlights the need for more thoughtful conversations and ethical frameworks around visual technology, so that cameras and their supporting technologies can serve society in ways that are fair, transparent, and respectful of human rights.

How These Projects Relate:
My technical project and STS research paper are connected through their shared focus on the impact of visual technologies, but each looks at it from a different angle. In my technical project, I’m working on improving human-robot collaboration (HRC) through gesture-based communication, specifically for military applications. The system our team has developed uses vision-language models (VLMs) to recognize gestures and enable more intuitive interactions between humans and robots, especially in complex environments. While the goal is to improve robot functionality, the project also touches on some ethical concerns around the use of this technology in sensitive settings, like military operations.
In my STS research paper, I take a step back and explore the broader social and ethical implications of technologies like cameras, particularly around issues of privacy, surveillance, and consent. I look at how tools like facial recognition can both empower people and give rise to new forms of surveillance, raising questions about control and accountability. Both projects are tied together by the way visual technologies are transforming how humans interact with machines and each other. They both deal with how these technologies can be used for good, but also highlight the need for careful thought around their ethical implications, especially when it comes to privacy and power.

Degree:
BS (Bachelor of Science)
Keywords:
robotics, AI, Artificial Intelligence, Video Language Model, VLM, LLM, Large Langugae Model
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2025/04/29