Cross-Context Prediction of Transcription Factor Binding Sites for Unannotated ENCODE Data
Singh, Ritambhara, Computer Science - School of Engineering and Applied Science, University of Virginia
Qi, Yanjun, Department of Computer Science, University of Virginia
Robins, Gabriel, Department of Computer Science, University of Virginia
Finding transcription factor (TF) binding sites in the DNA is of central importance for understanding the molecular mechanisms of gene regulation. The genome-wide identification of TF-binding sites (TFBS) has recently become available for a few species owing to the ENCODE consortium through ChIP-seq. However, due to the expensive and time-consuming nature of existing ChIP-seq technologies, computational methods, for accurately modeling the DNA sequence preference of transcription factors and for predicting their binding sites, continue to be critical, especially for the under-studied cellular contexts, e.g. non-model species' genomes or rare disease's cell types. Most existing studies for sequence-driven TF binding prediction have ignored the consideration of cellular contexts and/or ignored the data heterogeneity among multiple contexts. We, on the other hand, propose a novel method named "transfer string kernel" (TSK) to achieve better predictions of sequence-based TF binding preferences using cross-context sample adaptation. Relying on the idea of "knowledge transfer" from machine-learning, TSK pursues context-specific TF binding prediction via three essential components: (1) TF binding patterns on DNA sequences are mapped to a high-dimensional feature space under the discriminative string kernel framework; (2) Labeled TF binding examples from a source context are transferred to a target context through re-weighting the source samples adaptively using kernel mean matching (KMM) estimator; (3) An instance-weighted support vector machine framework is then implemented to classify sequence segments into TF binding or non-binding sites in the target context.
Utilizing TF binding data from ENCODE, we experimentally verify TSK's capability to adapt from the source cell-type (GM12878) to the target cell-type (K562) and the source genome as human to the target genome as mouse. The proposed TSK method consistently improves the predictions of TF binding in these target contexts over the baselines, without consideration of heterogeneity or scarcity among samples, and state-of-the-art sequence-based TF binding predictors.
MS (Master of Science)
All rights reserved (no additional license for public reuse)