Explainable Deep Generative Models, Ancestral Fragments, and Murky Regions of the Protein Structure Universe: Datasets, Models, and Analyses of Fold Space

Author: ORCID icon orcid.org/0000-0002-5645-5050
Draizen, Eli, Biomedical Engineering - School of Engineering and Applied Science, University of Virginia
Bourne, Philip, DS-Deans Office, University of Virginia
Mura, Cameron, DS-Deans Office, University of Virginia

Modern proteins did not arise abruptly, as singular events, but rather over the course of at least 3.5 billion years of evolution. Can machine learning teach us how this occurred? The molecular evolutionary processes that yielded the intricate three- dimensional (3D) structures of proteins involve duplication, recombination and mutation of genetic elements, corresponding to short peptide fragments. Identifying and elucidating these ancestral fragments is crucial to deciphering the interrelationships amongst proteins, as well as how evolution acts upon protein sequences, structures & functions. Traditionally, structural fragments have been found using comparative approaches such as sequence alignment and 3D structural superposition, but that be- comes challenging when proteins have undergone extensive permutations—allowing two proteins to share a common architecture, though their topologies may drastically differ (a phenomenon we term the Urfold). In my thesis, I develop several tools and datasets, leveraging decades worth of structural biology knowledge in light of the Ur- fold model of protein structure, in order to decipher the underlying molecular bases for protein structural relationships.

• For my first aim, I developed a community resource to create and share protein properties–structural, biophysical and evolutionary—for utilization in structural bioinformatics pipelines that involve machine learning. These properties can be used as feature-sets in any machine learning model; besides reusability and efficiency, such a resource would also facilitate more reproducible work- flows, by ensuring analyses are performed with standardized data. This project, termed ‘Prop3D’, is described in Chapter 2. The work, which has been written- up for submission to a journal in December 2022.

• In my second aim, I designed a sequence-independent, alignment-free, rotationally-invariant similarity metric of protein inter-relationships based on Deep Generative Models and 3D structures. Motivated by the Urfold view of protein structure, this framework leverages similarities in latent-spaces rather than the 3D structures directly, and it encodes biophysical properties; this capability, in turn, allows higher orders of similarity to be detected among proteins that are presumed to be only distantly related. I used this new similarity metric to detect clusters, or ‘communities’, of similar protein structures using Stochastic Block Models. This method takes a rather different approach to traditional clustering, allowing for proteins to span multiple clusters, thereby more explicitly allowing for the continuous nature of fold space. This project, termed ‘DeepUrfold’, is described in Chapter 3. The work, which was submitted to Bioinformatics in November 2022, is also available as a preprint at https://doi.org/10.1101/2022.07.29.501943.

• Finally, for my last aim, I sought to discover if particular residues/peptide fragments from a given domain might be responsible for conferring the similarity/linkage to other domains—including those relationships which may be exceedingly remote—using Layer-wise Relevance Propagation, an Explainable AI technique. This in turn creates an automatable/systematic and reproducible framework to identify new urfolds across the protein structure universe. This project, termed ‘DeepUrfold-explain’, is described in Chapter 4. Though some- what nascent, the project has been accepted to the peer-reviewed Machine Learning in Structural Biology (MLSB) workshop at the Neural Information Processing Systems (NeurIPS) conference held in December 2022, and is also currently available as a preprint at https://doi.org/10.1101/2022.11.16. 516787.

PHD (Doctor of Philosophy)
Protein Structure, Protein Evolution, Deep Learning, Generative Models, Bioinformatics
Issued Date: