Characterizing and Correcting Tn5 Sequence Bias in ATAC-seq Data

Wolpe, Jacob, Cell Biology - School of Medicine, University of Virginia
Guertin, Michael, Biochemistry and Molecular Genetics, University of Virginia

Chromatin accessibility assays enable base pair resolution measurements of the chromatin landscape and facilitate inferences about the nature of regulatory state. These assays determine which genomic regions are actively regulated by quantifying enzymatic digestion of DNA in a given region. However, the enzymes which conduct this digestion are biased with respect to their propensity to cleave specific sequences. Enzymatic sequence bias can introduce artifacts into data, leading to misinterpretation of downstream analysis. Previous work to address this bias relied on calculating enzymatic bias for specific k-mers and correcting these values to their genomic frequency, known as k-mer scaling. K-mer scaling was successful in correcting nuclease bias, but not the bias of Tn5 transposase, the enzyme used in ATAC-seq. This dissertation illustrates that the breadth and complexity of Tn5 bias hinders the use of k-mer scaling for sequence bias correction. Comparison of Tn5 bias with nucleases highlights why k-mer scaling is ineffective: Tn5 sequence bias is based on a region greater than 20 bp. K-mer scaling can only be applied to k-mers which have many instances in the data set and genome, an impossibility with k-mer sizes greater than 9, due to the number of reads contained in most ATAC-seq experimental data sets. To model this large bias window, we used a machine learning approach, rule ensemble, which integrates information from many input k-mers into a computational bias correction. We created a workflow using this approach, seqOutATACBias, in order to promote bias correction in other studies. We applied seqOutATACBias to naked DNA and found that it effectively diminishes both local sequence bias in addition to correcting a previously unreported Tn5 regional bias for high GC content. Correction of enzymatic sequence bias is of utmost importance for determining ground truth of chromatin accessibility assays.

PHD (Doctor of Philosophy)
Bioinformatics, Computational Biology, NGS
Sponsoring Agency:
All rights reserved (no additional license for public reuse)
Issued Date: