Online Archive of University of Virginia Scholarship
Transformer-Based Foundation Models and High-Performance Computational Tools for Chromatin Accessibility Analysis3 views
Author
LeRoy, Nathaniel, Biomedical Engineering - School of Engineering and Applied Science, University of Virginia0000-0002-7354-7213
Advisors
Sheffield, Nathan, Biochemistry and Molecular Genetics, University of Virginia
Abstract
Chromatin accessibility profiling through scATAC-seq has emerged as a powerful tool for understanding gene regulation and cellular heterogeneity. Yet, existing analytical methods fail to leverage the vast public datasets now available and remain computationally intensive for routine use. This dissertation presents a comprehensive framework for applying transformer-based transfer learning to scATAC-seq analysis by conceptualizing genomic regions as discrete linguistic tokens. This framework enables direct adaptation of natural language processing techniques to epigenomic data. We first present the gtars toolkit, which provides efficient Rust-based utilities for creating consensus genomic interval vocabularies and tokenizing datasets into these shared representations. Building on this foundation, we introduce scEmbed, which demonstrates that pre-trained Word2Vec-inspired region embeddings can accelerate clustering and enable cross-dataset cell-type annotation without external data modalities. Finally, to address the limitations of static embeddings, we present Atacformer, a transformer-based foundation model that generates contextualized representations of genomic intervals that capture cell-level chromatin accessibility patterns. Atacformer achieves strong performance in zero-shot clustering, annotation, and batch correction while maintaining interpretability through discrete token-level embeddings that reveal biologically meaningful regulatory relationships. Together, these contributions establish a scalable, transferable framework that bridges isolated experiments with unified biological insights, democratizing deep learning approaches for chromatin accessibility analysis across diverse biological contexts.
Degree
PHD (Doctor of Philosophy)
Keywords
scATAC-seq; deep learning; representation learning
LeRoy, Nathaniel. Transformer-Based Foundation Models and High-Performance Computational Tools for Chromatin Accessibility Analysis. University of Virginia, Biomedical Engineering - School of Engineering and Applied Science, PHD (Doctor of Philosophy), 2025-12-03, https://doi.org/10.18130/nv8e-bp78.