Maximal Information from Local Alignments

Mills, Lauren, Cell Biology - Graduate School of Arts and Sciences, University of Virginia
Pearson, William, Department of Biochemistry and Molecular Genetics, University of Virginia

Accurate identification of homologs, through sequence similarity searching programs like BLAST, is central to converting genome sequence data to biological knowledge. BLAST, FASTA, and other widely used search programs use local alignments to identify homologous sequences based on shared domains, but the boundaries of local alignments reflect both the signal from homology and the intrinsic properties of the alignment scoring matrix. Reliable identification of homologous domains requires sensitive alignment methods, accurate statistical estimates, and accurate alignment boundaries. Matrices that produce sensitive searches can also produce inaccurate alignments for more closely related homologs. Past improvements in search strategies focussed on search sensitivity and statistical accuracy, but largely ignored boundary accuracy. Homologous overextension, a boundary error that occurs when two homologous domains are aligned, but the alignment extends beyond the ends of the domains, can propagate inaccurate functional predictions and contaminate models used in more sensitive similarity searches.

In this thesis, I discuss the theoretical and empirical basis for homologous overextension. In Chapter 1, I outline the properties of local similarity scoring matrices that can produce alignment overextension. In Chapter 2, I show that overextension occurs in 8% of alignments in comprehensive searches, increasing to 10% for the 100 most similar alignments. About half of this overextension occurs because of a mismatch between the alignment identity of the homologous domain and the target identity of the scoring matrix used in the initial alignment and more than 85% of this high-identity alignment overextension can be corrected by shifting to the appropriate scoring matrix. In Chapter 3, I consider alignment over extension in other contexts and summarize additional strategies for identifying over extension. Alignment accuracy is central to effectively exploiting our growing knowledge about structure-function relationships, active sites, and variant phenotypes. Future characterizations of alignment methods should examine both internal and alignment boundary accuracy.

PHD (Doctor of Philosophy)
bioinformatics, sequence alignment, over extension, homology, BLAST, FASTA
All rights reserved (no additional license for public reuse)
Issued Date: