Efficient Genomic Interval Intersection Algorithms
Layer, Ryan, Computer Science - School of Engineering and Applied Science, University of Virginia
Robins, Gabriel, Department of Computer Science, University of Virginia
The comparison of diverse genomic datasets is fundamental to understanding genome biology. Because individual experiments typically focus on a single genomic feature (e.g., genes, regulatory elements, genetic variants), the discovery and characterization of higher-level functions requires the integration of many different experimental results. For example, certain genomic sequences called enhancers have been shown to modulate gene expression. Since these sequences also tend to have specific chemical markers (histone marks) and can be targets of sequence-specific DNA binding proteins (transcription factors), comparing the set of sequences with certain histone marks to the set that are bound by transcription factors can lead to the discovery of novel enhancers. This is a difficult problem, especially considering that it is common for these types of experiments to identify millions of sequences. However, it can be simplified by the use of a reference genome, which is a representative nucleotide sequence for the chromosomes in an organism's genome. A reference provides the coordinate system for mapping a given sequence to an interval based on the relative position of its matching subsequence. By mapping the sequences in feature sets to intervals, the comparison between sets is reduced to an interval set intersection problem. That is, if two features intersect, then they share common genomic sequences and can thus possibly have a biologically relevant relationship.
Given the utility of efficient interval intersection in genomic analysis, especially in the face of every-larger data sets, this thesis introduces three algorithms that improve upon the state-of-the-art. The Binary Interval Search (BITS) algorithm is an optimal algorithm for counting the number of intersections between two sets that is well suited for parallel processing on a GPU. By counting intersections without full enumeration (a fundamentally harder problem), BITS enables the large-scale comparison of thousands of genomic features on a single machine that would have otherwise required a cluster of over 100 nodes. The split-then-sweep algorithm extends the basic concepts introduced by BITS to the consideration of many genomic interval sets. Sets are partitioned into independent slices, which in many cases reduces the number of the intervals that must be considered, and in all cases exposes a significant amount of parallelism. Slicing improved sequential runtimes and parallel runtimes by a factor of 2 and 19, respectively. The LUMPY algorithm uses interval intersection as a platform to identify differences in genome structure from high-throughput sequencing data. Unlike any other algorithm in this domain, LUMPY is able to jointly consider many different types of evidence. This greatly increases its sensitivity, especially in cases where only a fraction of cells contain a specific variation, which is typical in cancer genomes. These algorithms significantly advance the field by greatly expanding the data that can be considered in the analysis of genome intervals and high-throughput sequencing data.
PHD (Doctor of Philosophy)
All rights reserved (no additional license for public reuse)