Protein Sequence Constraints

Lavelle, Daniel Thor, Department of Biochemistry and Molecular Genetics, University of Virginia
Khorasanizadeh, Sepideh, Department of Biochemistry and Molecular Genetics, University of Virginia
Bekiranov, Stefan, Department of Biochemistry and Molecular Genetics, University of Virginia
Kretsinger, Robert, Department of Biology, University of Virginia
Nakamoto, Robert, Department of Molecular Phys and Biological Physics, University of Virginia
Stukenberg, Peter, Department of Biochemistry and Molecular Genetics, University of Virginia

To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino-acid words in proteins, we compared the frequencies of 4- and 5-amino-acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. While the human proteome has many over-represented clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, clump counts from non-redundant Pfam-AB sequences are well described by random models; from 1.9% (MC(0) model) to 0.1% (window shuffled model) of 4 amino-acid word clumps are 2-fold over-represented. Likewise, using 5-residue clumps from a structural 10-letter alphabet, from 4.7% (MC(0) model) to 0.5% (window shuffled model) of words are 2-fold over-represented in Pfam-AB. Using a false discovery rate q-value analysis, the number of exceptional 4- or 5-letter words in real proteins compared with random sequence models is similar to the number found when comparing words from one random model to another. Consensus over-represented words are not enriched in conserved regions of proteins, but 4- and 5-letter words are enriched in α-helical secondary structures (1.18 (i+1) to 1.61-fold (i+2)). To test whether local secondary structure sequence preferences constrain protein sequences as a whole, we examined the 9-letter binary (Hydrophobic/Polar) word clumps found in a structurally distinct, non-homologous library of topologs from Cath version 3.1 (CATH-T) and compared them to counts generated by random models based only on amino-acid frequency iii iv data ("sequence-only") or amino-acid frequencies in secondary structures ("structureinformed"). Statistically exceptional 9-letter binary (H/P) clumps were identified by qvalue false discovery rate analysis. Only 12% and 14.50f the 512 possible words in CATH-T proteins are significantly overand under-represented, respectively, when compared to window shuffled random sequences. However, when word clumps associated with α-helices, β-strands, and loops are examined separately, a MC(2) model that preserved triresidue frequencies for α-helical regions fit the 9-residue clump data best. Most 9-letter words can be well described by short tri-residue frequencies. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.

Note: Abstract extracted from PDF text

PHD (Doctor of Philosophy)
All rights reserved (no additional license for public reuse)
Issued Date: