FastGSK: Fast and Efficient Sequence Analysis Using Gapped String Kernels

Blakely, Derrick, Computer Science - School of Engineering and Applied Science, University of Virginia
Blakely, Derrick, EN-Comp Science Dept, University of Virginia

String kernel methods achieve strong classification performance on DNA, protein, and natural language data using modestly-sized training sets. However, existing kernel function algorithms suffer from slow kernel computation times, as they depend exponentially on the sub-sequence feature length and the task's alphabet size. In this work, we introduce a new string kernel algorithm using gapped k-mers called FastGSK. Compared to previous string kernels, it uses a simplified kernel algorithm that decomposes into a set of independent counting operations over the possible mismatch positions. This algorithm is naturally parallelizable, is performant on any sequence analysis task, and allows us to devise a fast Monte Carlo approximation method to scale to much greater feature lengths. We evaluate FastGSK on 10 DNA transcription factor binding site (TFBS) prediction tasks, 10 protein remote homology detection tasks, and 7 English-language medical named entity recognition tasks. FastGSK consistently matches or outperforms the state-of-the-art string kernel algorithms in AUC, while achieving average speedups in kernel computation of ~100x and speedups of ~800x for large feature lengths. We further show that FastGSK consistently outperforms character-level recurrent neural networks across these sequence analysis tasks. Our algorithm is available as a Python package and as C++ source code.

MS (Master of Science)
Sequence Analysis, String Kernels, Support Vector Machines
Issued Date: