Long Video Content Analysis: Learning to Summarize Wireless Capsule Endoscopy Videos

Author: ORCID icon orcid.org/0000-0003-3834-1735
Adewole, Sodiq, Systems Engineering - School of Engineering and Applied Science, University of Virginia
Brown, Don, DS-Data Science School, University of Virginia
Syed, Sana, MD-PEDT Gastroenterology UPG-MD-PEDT Gastroenterology, University of Virginia
Barnes, Laura, EN-Eng Sys and Environment, University of Virginia
Doryab, Afsaneh, EN-Eng Sys and Environment, University of Virginia
Porter, Michael, EN-Eng Sys and Environment, University of Virginia

The overall goal of this dissertation is in two parts; 1) minimizing experts’ review time on Capsule Endoscopy (CE) videos via video summarization; and 2) developing models that captures the temporal and topological relationship between frames in the videos as against an independent image analysis that has been addressed in literature.
With an estimated 70 million Americans affected by different digestive tract diseases each year, physicians use VCE as a nonsurgical procedure to examine the entire digestive tract without the invasiveness associated with the traditional upper and lower endoscopy procedures. While VCE helps ease diagnosis of many digestive tract diseases, a single capsule endoscopy study can last between 8 - 11 hours generating up to 100,000 images of various sections of the digestive tract.
Even when up to fifty thousand (50,000) images are obtained in a typical small bowel VCE study, it is possible for pathology of interest to be present in as few as one single frame. Physicians have to review the entire video in order to identify frames with the pathology of interest.
Many researchers have proposed different techniques to automate analysis of CE frames, however, large proportion of the proposed techniques require fully labelled video frames for each class of abnormality in the video. Meanwhile, collecting frame-level annotation for medical video is not an easy task. In this dissertation, we developed novel models with three (3) levels of supervision to mitigate this problem. Our goal is to generate summaries with selected representative frames that captures the regions of abnormality in the GI tract thereby saving the physician the time and effort required to review the entire video.
The first model in this dissertation is an unsupervised video shots boundary detector used for efficient temporal segmentation of the VCE video. The key novelty is in the efficient representation of the video frame features with a lower 1-dimensional embedding. It is prohibitively expensive to temporally segment a video using the high-dimensional frame features extracted from a CNN model. Therefore, we projected the frame features to a 1-dimensional embedding space to minimize the computational cost of detecting shots boundaries. Our experiments with multiple embedding algorithms shows that encoding the video features using PCA achieved the best performance in shot boundary detection on the videos.
Secondly, we developed a weakly-supervised temporal segmentation technique using Graph-based representation learning. We believe the topological relationship between the frames is better captured using a GCNN model as it relaxes the hard assumption of temporal dependence between the frames as well the implicit frames independence assumption in traditional CNN model. In addition, while a short video may follow a temporal correlation assumption, multiple scene and events in long videos may not. The goal of our GCNN model is to learn to map the nodes in the graph into binary class-agnostic categories. During testing, we use the categories of the nodes to segment the videos into normal and abnormal segments. To achieve this, we represented each video segment as a graph and each frame as the nodes in the graph. We trained the graph in a class agnostic manner to separate normal from abnormal nodes. We represented the relationship between the frames as the edge weights of the graph and the model acts as a binary classifier to classify each frame into abnormal or normal frame. Chaining this prediction together allows us to temporally detect scene change in the video and segment the video into an homogeneous identifiable pathological unit.
Lastly, leveraging the boundary detection technique above, we developed an end-to-end weakly supervised abnormality localization model where we applied video-level labels, to localize the frames where the relevant disease is captured. A GCNN model was trained to generate an embedding for each video segment and then classifies the video into binary category of abnormal or normal. We considered full video as a graph, each video segment as a sub-graph and the frames as the nodes.
The model was divided into two parts - graph classification and abnormality localization. The graph classification model, trained based on cross-entropy loss classifies each sub-graph (video segment) into binary disease-agnostic classes and the disease localization selects relevant frames from each abnormal video segment that contains the respective disease. An extension of this framework, which we describe in our future work, would be an end-to-end localization of the full long video with multiple abnormalities.

PHD (Doctor of Philosophy)
Capsule Endoscopy, Video Content Analysis, Video Summarization, Abnormality Localization, Graph Neural Network
Sponsoring Agency:
The Naval Postgraduate School: Grant # N00244-19-1-0005The National Center for Advancing Translational Science of the National Institutes of Health Award UL1TR003015/ KL2TR003016
All rights reserved (no additional license for public reuse)
Issued Date: