Combining and evaluating computer vision and natural language processing techniques for use in clinical contexts

Author: ORCID icon
Sengupta, Saurav, School of Data Science, University of Virginia
Brown, Don, DS-Research, University of Virginia

In recent years, deep learning has led to the creation of accurate disease predictors using images, better object detectors for isolating cell structures within images, enabled automated tagging of clinical terms in text based data with accurate named entity recognition models and many more such advances that have the potential to enable better healthcare outcomes by assisting medical practitioners in their work. In this work, we focus on the applications of deep learning, specifically developed for computer vision and natural language processing, for Electronic Health Records (EHR) data, for structured data like diagnosis codes stored over time for patients from different health systems collected by the NIH to help pinpoint risk factors for Long COVID, as well as unstructured imaging and text data in the form of high resolution histopathology images and their paired text descriptions and develop an image captioning system that can automatically generate text descriptions from these images. We also focus on attention based interpretability methods for deep learning models and how they can be used to explain model behavior, as we believe explaining model decisions is crucial for building trust in these models.

In the first portion of this research, we describe our image captioning system developed to generate automatic text descriptions for high resolution histopathology slides built using pre-trained transformer based models to address the lack of training data. We show that by encoding the Whole Slide Image (WSI) as a sequence of image tokens using a Vision Transformer pre-trained on a wide array of histopathology data and feeding them into a Bidirectional Encoder Representations from Transformers (BERT) based language model that has been pre-trained on medical research and clinical notes, we can build a fairly performant captioning system that combines these two data modes, namely images and text. We also show the results of our investigation of cross-attention layers for explanations of model behavior.

In the second portion of this research, we describe how we used a Long Short Term Memory based neural network model, developed to analyze text sequences, to analyze temporally arranged diagnosis code data recorded up to the first COVID-19 infection to investigate risk factors for Long COVID. While modeling sequential diagnosis codes has been done before, we add to this model by re-using pre-trained embeddings to encode the diagnosis data instead of randomly initializing the embeddings. We also discuss the use of a semi-supervised learning technique called as positive unlabeled (PU) learning to improve model learning to build a better classification model. We then investigate these models using self-attention weights generated during training to answer our stated goal of building models whose behavior can be investigated.

We present in this work different methods to apply deep learning in clinical contexts. We also validate by the results presented in this work that models developed for language modeling, like LSTMs, can be successfully adapted for other sequential or time series data like temporally arranged diagnosis data. We also build an encoder-decoder model, inspired by existing image captioning models, but adapted for high resolution images that can find applications in other domains like remote sensing where there is a need to handle high resolution imaging data. We believe this work generates further proof that deep learning can be successfully used in clinical contexts, while also showing how we can investigate model behavior using attention based modeling, which is crucial for developing trust in these black box models.

PHD (Doctor of Philosophy)
computer vision, natural language processing, ai for healthcare, computational pathology, biomedical imaging
Issued Date: