Online Archive of University of Virginia Scholarship
Uncovering Signatures of Viral Sequences by Interpreting CNN Models using Integrated Gradients; Between Code and Cell: Collaborations and Conflicts over AI in Biological Research10 views
Author
Bai, Yili, School of Engineering and Applied Science, University of Virginia
Advisors
Norton, Peter, EN-Engineering and Society, University of Virginia
Warren, Andrew, PV-BII-Biocomplexity Initiative, University of Virginia
Vullikanti, Anil, PV-BII-Biocomplexity Initiative, University of Virginia
Abstract
Artificial intelligence has become a defining force in both engineering and scientific discovery, transforming how knowledge is produced and verified. As artificial intelligence (AI) systems proliferate in biology and public health, how can researchers ensure that models which excel at prediction also meet high standards of transparency, credibility, and ethical responsibility?
One domain where these concerns become especially tangible is genomics. Deep learning models now play a central role in identifying viral sequences within large genomic datasets, yet their decision making processes remain largely opaque. To address this challenge, this project reconstructs a scalable data pipeline for metagenomic processing and introduces an Integrated Gradients based framework designed to interpret Plinko, a convolutional neural network (CNN) trained for viral classification. Through this system we traced the model’s predictions back to specific nucleotide and amino acid k-mer features and found that Plinko relies on broad genome wide compositional biases that are characteristic of viral and host genomes. Approximately 70% of the k-mer vocabulary across both sequence types contributes significant discriminative signals, reflecting well established evolutionary and structural constraints such as CpG suppression in viral genomes and the hydrophobic composition of viral capsid proteins. By revealing these underlying patterns, the project offers a biologically grounded explanation for Plinko’s reasoning and establishes a generalizable approach for interpreting deep learning models in sequence classification tasks.
As AI proliferates in experimental biology, researchers, AI developers, and funding institutions competitively negotiate credibility and authority. Biological researchers preserve authority through validation, documentation, and empirical testing, while AI enterprises emphasize efficiency, automation, and scale. Drawing on institutional reports, practitioner forums, and case studies such as DeepMind’s AlphaFold and the European Molecular Biology Laboratory’s curation of predictive databases, the analysis shows that credibility in AI-enabled biology depends on shared practices that unite computational prediction with experimental verification. It concludes that the future of AI in scientific research depends not only on developing dual literacy, in which scientists learn to interpret algorithms and engineers learn to understand experimental rigor, but also on sustained collaboration between these two communities of expertise. Such collaboration ensures that the interpretive values of biology and the innovative capacities of computer science reinforce one another, creating systems that are both powerful and trustworthy in advancing scientific understanding.
School of Engineering and Applied Science
Bachelor of Science in Computer Science
Technical Advisor: Andrew Warren, Anil Vullikanti
STS Advisor: Peter Norton
Language
English
Rights
All rights reserved by the author (no additional license for public reuse)
Bai, Yili. Uncovering Signatures of Viral Sequences by Interpreting CNN Models using Integrated Gradients; Between Code and Cell: Collaborations and Conflicts over AI in Biological Research. University of Virginia, School of Engineering and Applied Science, BS (Bachelor of Science), 2025-12-11, https://doi.org/10.18130/hpve-vh37.