Abstract
Robot-assisted surgery (RAS) represents a milestone in modern surgical innovation, combining human dexterity with robotic precision. Despite widespread clinical adoption and the availability of rich kinematic and video data, current RAS platforms remain limited in autonomy and offer minimal real-time decision support. One critical unmet need is the automated detection of human operational errors, mistakes made by the surgeon during procedures, that may compromise patient safety or reflect technical inefficiency. While existing work in skill assessment has provided tools for post hoc evaluation, these methods often rely on coarse, subjective metrics and overlook the nuanced contexts in which errors occur. Furthermore, they rarely account for the hierarchical structure of surgical activities, which span from high-level procedures down to fine-grained gestures and motions.
This dissertation addresses the gap between surgical activity modeling and real-time operational error detection through an activity-aware framework spanning three major thrusts.
The first thrust introduces a gesture-specific error rubric that distinguishes executional and procedural errors. Using dry-lab demonstrations from the JIGSAWS dataset, we manually annotate error labels and analyze their relationship with gesture types and kinematic profiles. We find that different gestures are associated with distinct error distributions and kinematic profiles, and that executional and procedural errors correlate with longer trial durations and lower skill levels.
The second thrust develops a two-stage method with semantic segmentation and logic-based inference for surgical context states detection, which captures tool-object interactions critical for understanding surgical intent and environment. Leveraging advances in video object segmentation, we adapt and evaluate the Space-Time Correspondence Network (STCN) model that achieves state of the art segmentation performance on surgical tools and objects across both dry lab and real clinical scenarios. We also show that this method is interpretable and can achieve real-time context state detection.
Our third thrust focuses on developing runtime models for surgical error detection. This involves two key approaches. First, we develop a dual-input Siamese network for surgical error detection, and our experiments show that incorporating gesture and task-specific information during model training improves model performance. Second, we build multimodal transformer models and demonstrate the enhanced performance by integrating gesture-level information, contextual cues, and descriptive textual prompts with video and kinematic data.
Collectively, this work presents a novel framework for detecting surgical errors in real time by leveraging the hierarchical structure of surgical activities and integrating contextual information into the learning systems. Our methods pave the way for context-aware surgical training systems and runtime safety monitors in future RAS platforms.