Detect Positive Selection in Type IIL Restriction Enzyme Using PAML

The role of positive selection in molecular evolution has long been debated since the proposal of the neutral theory. With the explosive growth of genomic sequence data and aided by statistical methods such as Phylogenetic Analysis by Maximum Likelihood (PAML), recent years witnessed a flurry of reports of positive selection in protein-coding genes. However, with a few exceptions, the reported positively selected sites generally lacked functional data to support them. Restriction enzymes in bacteria are under constant selective pressure from the bacteria-phage arms race, and in some enzymes amino acids responsible for the specificity changes are known. Here I tested the performance of PAML using the functionally well characterized type IIL restriction enzymes as the model system. The result showed that PAML was highly consistent in detecting positive selection in the sequence data. However, in terms of identifying positively selected sites, PAML was very sensitive to the sampling bias in the dataset and had a high false negative rate and also possibly a high false positive rate. The result suggests positively selected sites identified by PAML should be treated with caution and validated by functional studies.


Introduction
Positive selection is defined as the process where variants in a population provide higher fitness and thus are favored by selection. The importance of positive selection has long been accepted at the morphological trait level. However, whether it is important at the molecular level is still debated. The neutral theory proposed that instead of natural selection, most evolutionary changes at the molecular level are caused by random fixation of neutral mutations (Kimura, 1983). Although it does not totally exclude positive selection, it does assume that positive selection is rare and thus does not play a major role in evolution at the molecular level. Ever since the proposal of neutral theory, there has been ongoing debate about the role of positive selection in molecular evolution.

Recent or ongoing positive selection
The advent of molecular biology made it possible to study molecular evolution using DNA and protein sequences. With accumulated sequence data, statistical methods have been developed for studying evolution at the molecular level. For recent or ongoing positive selection, mainly three tests are used by researchers. The first is the McDonald-Kreitman (MK) test that compares the nonsynonymous/synonymous substitution rate ratio within species (pN/pS) and between species (dN/dS) (McDonald & Kreitman, 1991). Because positive selection has a larger impact on increasing dN than pN, dN/dS significantly higher than pN/pS indicates positive selection. The second is linkage disequilibrium (LD). LD refers to non-random association of alleles at different loci.
Under neutral evolution, recombination causes LD around a new allele to decay substantially because it takes the allele a long time to reach high frequency. However, positive selection raises allele frequency too rapidly for recombination to break up the LD. Based on this principle, long-range LD is used as an indicator of positive selection (Sabeti et al., 2002). The third method is Fixation Index (F ST ) test (Lewontin & Krakauer, 1973). Recent positive selection can generate differentiation between populations. F ST is a measure of population difference and significantly different distribution of F ST among loci from the neutral expectation can be used to infer positive selection (Akey et al., 2002).
Using above methods, researchers started to find evidence of positive selection at the molecular level. Researchers not only can conduct MK test but also can calculate α (1 -dSpN/dNpS), which indicates the proportion of adaptive substitutions. Using the MK test, studies on various Drosophila species found substantial proportions of amino acid substitutions driven by positive selection (~45% - Smith & Eyre-Walker, 2002;~25% -Bierne & Eyre-Walker, 2004, ~54% -Begun et al., 2007. Besides Drosophila, another study surveyed animals from all phyla (e.g. insects, molluscs, annelids, echinoderms, reptiles, birds, and mammals) and concluded that for the majority of animals, more than 50% of amino acid substitutions were driven by positive selection (Galtier N, 2016).
Similar result (more than 50% adaptive substitutions) was also reported in enteric bacteria (Charlesworth & Eyre-Walker, 2006). After reviewing such studies, Hahn proposed a 'Selection Theory' in which he argued that positive selection is more prevalent and should serve as the null model for population genetics studies (Hahn, 2008). Besides MK test, positive selection was also detected by the other two methods.
By long-range LD, widespread signals for recent positive selection were found in the human genome (Voight et al., 2006;Sabeti et al., 2007;Williamson et al., 2007).
However, analyses in several model systems found no evidence of positive selection. No positive selection was detected in 330 human genes by the MK test comparing human, chimpanzees and Old World monkeys (Zhang & Li, 2005). Similarly, no positive selection was found in Arabidopsis or yeast (Foxe et al., 2008;Doniger et al., 2008). In addition these population genetics-based methods often do not have power to determine whether departure from the neutral model is due to positive selection or changes in population size. For example, for MK test, Eyre-Walker found that increase of effective population size can produce artificial evidence of positive selection when there were slightly deleterious amino acid substitutions (Eyre-Walker, 2002). For the other two methods, high false-positive problem has been reported. Akey reviewed several genome-wide positive selection studies in humans and found only 14.1% positive selected regions were identified by multiple studies (Akey, 2009).

Past positive selection
For past positive selection, the dN/dS (nonsynonymous/synonymous substitution rate ratio, also called ω) is the most widely used metric for detecting positive selection.
Assuming selection mostly acts only on nonsynonymous substitutions, ω can be used as a measure for selective pressure. In principle, ω can be binned into three categories: 1) ω close to 1 -neutral substitution. In this case nonsynonymous and synonymous substitutions are fixed at the same rate, which indicates no selection. 2) ω < 1 -purifying selection. In this case nonsynonymous substitutions are deleterious and thus are removed by selection. ω close to 0 indicates strong purifying selection that removes almost all nonsynonymous substitutions. 3) ω > 1 -positive selection. In this case, nonsynonymous substitutions provide selective benefits and are fixed at a higher rate than synonymous ones (Yang & Bielawski, 2000).
Traditional methods calculate an average dN/dS ratio for the entire protein sequence. This is highly conservative, because most sites of the protein may be under strong purifying selection (with dN/dS close to 0) and therefore the average dN/dS will be well under 1 even when some sites are under positive selection. Yang's group developed a ML method (PAML) that allowed ω to vary among sites. Two steps are involved: 1) a likelihood-ratio test (LRT) between the null model (ω <= 1) and the alternative positive selection model (ω <= 1 & ω > 1) is conducted. If LRT suggests positive selection, then 2) sites with high posterior probability to have ω > 1 are identified using a Bayes approach (Yang & Bielawski, 2000;Yang, 2007). Later, a branch-site model was developed to detect positive selection that only affects sites in specific lineages by allowing ω to vary both among sites and among lineages (Goldman & Yang, 1994).
Using dN/dS, various genes have been studied and showed evidence of positive selection (Ford, 2002). Among those genes, one classical example was the Major Histocompatibility Complex (MHC) gene in vertebrates (Hughes & Yeager, 1998) whose protein products present antigen peptides to T cells in the immune system. Hughes and Yeager tested the hypothesis that positive selection acting on peptide-binding region (PBR) maintains the polymorphisms of MHC by calculating dN and dS of PBR and non-PBR. They found significant higher dN relative to dS only in PBR, indicative of positive selection. Besides MHC, many immune-related genes have been shown to be under positive selection. Positively selected sites were identified in various members of Tolllike receptor gene family in mammals (TLR2 - Tschirren et al., 2011;TLR9 -Park et al., 2010;TLR22 -Sundaram et al., 2012), as well as in the CC chemokine receptor genes of mammalian immune system (Metzger & Thomas, 2010). Ford surveyed over 100 published studies and found 119 genes with statistical evidence of positive selection and found the largest group (47 genes) was involved in either host defense or parasite response (Ford, 2002). Similarly, a study that surveyed all genes in the Drosophila genomes also found that immune response genes contained extremely high proportion of positively selected codons (Heger & Ponting, 2007). This makes sense because immune system genes are thought to be under strong positive selection due to ongoing arms races between hosts and pathogens (Heger & Ponting, 2007;Aguileta et al., 2009;Ford, 2002).
Despite all these findings, multiple studies have also challenged PAML. False-positive problems have been reported by several papers. Suzuki and Nei found that PAML produced an unacceptably high false-positive rate in a simulation study (Suzuki & Nei, 2002). Hughes and Friedman showed that due to the stochastic nature of substitutions, dN/dS > 1 can happen just by chance and the chance was higher for short branches (Hughes & Friedman, 2008). By calculating the number of nonsynonymous and synonymous substitutions, they also found that sites identified by the branch-site model were largely due to lack of synonymous substitutions (Hughes & Friedman, 2008). One of the more convincing cases against PAML was presented by Yokoyama et al. using dim-light vision proteins in vertebrates as their model system (Yokoyama et al., 2008).
Using phylogenetic methods, they inferred ancestral states of the protein and then engineered 11 ancestral proteins by mutagenesis. After comparing the phenotypes of ancestral proteins and extant proteins, they identified 15 adaptive changes at 12 amino acid sites. However, PAML identified 8 totally different sites under positive selection (with ω > 1) but none of those sites were associated with any functional change.
Disagreement between the experimental results and PAML predictions demonstrated that detecting positive selection purely by statistical methods can be problematic. Using a similar approach, Zhuang et al. studied odorant receptor genes in primates and found little overlap between sites predicted by PAML and those with experimental support (Zhuang et al., 2009).
PAML has been widely used and become influential to our understanding of the role of the positive selection in molecular evolution. However, there is still an ongoing debate about whether findings from PAML analysis are reliable and biologically meaningful.
Due to the lack of model systems with functional data, PAML's performance was mostly evaluated by simulation studies (Anisimova et al., 2001(Anisimova et al., , 2002Wong et al., 2004;Yang & dos Reis, 2011). There is a clear need for more studies in real systems to test how well PAML works. Using type IIL restriction enzymes in bacteria as the model system, I will try to answer two questions: 1) Can PAML detect positive selection? 2) Does PAML correctly identify positively selected sites?

The model system
Bacteria in the natural environment are constantly attacked by phages, which outnumber the bacterial cells by a ratio of 10:1. In response, bacteria have evolved immune systems such as the restriction-modification system (RM system) to protect against phage infections, leading to a pronounced evolutionary arms race (Stern & Sorek, 2010). All RM systems consist of two functional modules, a methyltransferase to protect the host DNA by methylating specific nucleotide bases within the recognition sequence and an endonuclease to cleave unprotected foreign DNA at or near the recognition sequence (i.e., restriction site). The enormous selective pressure from the rapid turnover and evolution of phage particles has driven rapid evolution and diversification of the RM genes, particularly those involved in recognizing the restriction sites (Sharp et al., 1992;Murray et al., 1993;. There are 4,596 known restriction enzymes from more than 900 species that recognize 729 restriction sites (http://rebase.neb.com/rebase/rebase.html).
Unlike most genes studied for positive selection, the precise functions including the recognition sequence are known for most RM genes (Roberts et al., 2015). The type II RM system is the most common type with high specificity in both methylation and cleavage. For most type II RM systems, the methyltransferase and endonuclease are encoded by two separate genes and also act independently. However, both proteins need to recognize the same target DNA sequence (i.e., restriction site) for normal function.
Specificity change in merely one module or non-synchronous change in two modules will cause the host DNA to be cleaved by its own endonuclease. For this reason, there is strong constraint for evolutionary changes for type II RM, since synchronous changes are rare. However, in one family of type II RM proteins (type IIL REs), the two functional modules are merged into one protein and they share one target DNA recognition domain (TRD) (Callahan et al., 2016). As a result, the TRD of type IIL REs should be more tolerant to specificity changes and amenable to positive selection.
In support of this idea, evolutionary changes of type IIL RE specificity are frequent and labile. Among 21 type IIL REs compared in a study, each had its unique specificity (Morgan & Luyten, 2009). With a few exceptions, type IIL REs recognize 6-base restriction sites. Most bases in the recognition sequence are free to change with the exception of the 5th base, which is highly conserved. The molecular basis of sequence recognition in type IIL RE has been elucidated. Amino acids that correlated with the specificity changes were identified and 7 of them (corresponding to positions 645, 751, 773, 774, 806, 808, 810 in MmeI, Figure 1) were confirmed to cause specificity change at base 2, 3, 4 and 6 by mutagenesis experiment and the structure of enzyme-DNA complex (Morgan & Luyten, 2009;Callahan et al., 2016). Because those specificity changes correspond to functional changes in nature, the 7 sites are expected to be under positive selection to match the rapid evolution of the restriction sites in phages. As such, type IIL REs represent an excellent model system for studying adaptive molecular evolution. Simulation studies showed that sequence divergence affected the accuracy and power of PAML analysis (Anisimova et al., 2001(Anisimova et al., , 2002Wong et al., 2004;Yang & dos Reis, 2011). Accuracy was defined as the probability that a site predicted to be under positive selection was truly under positive selection, and power was defined as the probability that a site truly under positive selection was predicted to be under positive selection. To test the effect of sequence divergence on PAML analysis, I downloaded additional sequences that are close to the 6 seed sequences. Homologs were downloaded as above with one difference: all genomes of species in the top 50 hits of the blastp search were downloaded. I selected four subtrees that contained the seed sequences to assess the effect of sequence divergence on PAML analysis ( Figure 2). Within each subtree, different levels of sequence divergence were achieved by iteratively extracting smaller subclades using the original sequences as anchors till about 10 sequences were left.

Sequence datasets
Sequence divergence was measured in two different ways. The max dS measures the synonymous substitution rate (dS) between the two most distantly related sequences in the dataset, and the average dS is the average synonymous substitution rate between all pairs of sequences in the dataset.
To further expand the range of sequence divergence for PAML analysis, I also compiled a dataset consisting of homologs from a single species, Corynebacterium diphtheriae.
This species was chosen because in the original dataset, there was one pair of highly similar sequences from this species with different specificities, suggesting possible recent positive selection. I downloaded 180 C. diphtheriae genomes from NCBI from which I identified 28 unique homolog sequences.

Phylogenetic analysis
For all analyses in this study, protein sequences were aligned using MAFFT (version 7.205, Katoh & Standley, 2013) and the alignment was trimmed by ZORRO (Wu et al., 2012) with the cutoff 4 to remove low-quality aligned columns. The protein alignments were used to guide the alignment of corresponding nucleotide sequences using an inhouse perl script. RAxML (version 8.2.4, Stamatakis, 2014) was used in all phylogeny reconstruction. Substitution models used for protein and nucleotide alignment were PROTCATWAG and GTRCAT respectively. The topology of phylogenetic tree was based on the trimmed protein alignment and the branch lengths estimated using the nucleotide alignment were multiplied by 3 to get codon-based branch lengths for the following dN/dS ratio analyses.

Ancestral state/sequence reconstruction
Using the phylogeny of the 47 original sequences, the ancestral states of recognition site were reconstructed by parsimony using Mesquite (version 3.31, Maddison WP & DR Maddison, 2017). The ancestral sequences of the enzymes were reconstructed by maximum likelihood method using CODEML in PAML (version 4.9e, Yang, 2007).

PAML analysis
Two models of CODEML in PAML (version 4.9e, Yang, 2007) were used. The site model that allowed ω ratio to vary across all codons was used unless otherwise noted. I compared the recommended model pair M7 (beta) and M8 (beta&ω). M7 assumes a beta distribution (in the interval [0, 1]) for all ω among sites and serves as the null model. M8 adds one more class of ω that allows ω to be higher than 1 and serves as the alternative model. The likelihood ratio test was conducted for the pair of models first and if the difference was statistically significant, a Bayesian approach was then used to calculate the posterior probability of ω > 1 for each site and sites with a probability higher than 0.95 were identified as positively selected sites. To avoid likelihood being trapped in a local optimum, for each analysis I tried two initial ω values (0.4 and 1.5) as the starting points and only included the result with the higher likelihood. In all analyses except one case (the whole dataset), starting with different initial values always resulted in convergence to the same likelihood.
Because the 6th base in the recognition sequence is partially conserved (G or C), the branch-site model was applied to detect positive selection that might only act on a few branches (branches involving changes between G and C). The branch-site model was only applied on the original sequences because functional information was required to bin the branches into two different categories. I applied the branch-site model on the 47 original sequences and the 10 original sequences from subtree 2.

Bootstrap analysis
To investigate the consistency of PAML, I conducted a bootstrap analysis with 100 pseudoreplicates. I used subclade 2 dataset of subtree 2 for bootstrap analysis because PAML identified the largest number of sites under positive selection in this dataset. For each pseudoreplication, I randomly selected 50 sequences without replacement from the 82 sequences of the subclade 2 and carried out a PAML analysis as described above. The bootstrap support of a site was calculated as the number of times it was identified as positively selected sites in the 100 pseudoreplicates.

Evolution of type IIL RE specificity
As expected, the evolution of recognition sequences of type IIL REs was rapid (Figure 3).
Among the 6 bases of the recognition site, only the 5th base was highly conserved (A).
The 6th base changed with some constraint (G/C) while the remaining four bases all changed frequently. Using parsimony, I reconstructed the evolutionary history of type IIL RE specificity. The number of specificity change events inferred from the tree in Figure 3 was 20 for the 1st base, 19 for the 2nd base, 19 for the 3rd base, 20 for the 4th base and 11 for the 6th base ( Figure 4).
Next I reconstructed the ancestral states of the 7 amino acids that conferred restriction site specificity. In particular, I was interested in whether the amino acid substitutions that cause specificity changes in the mutagenesis experiment actually occurred during the history of evolution. Table 1

PAML analysis of the whole dataset
When the whole dataset was used (47 original sequences + 274 homologs), the likelihood ratio test indicated the existence of positive selection (p < 0.01, Table 2). However, the Bayesian approach did not identify any site with ω>1. The overall distribution of ω is shown in Figure 5A. Relatively high ω values were mostly found within the methyltransferase domain and TRD, but no ω exceeded 1.
The Max dS in the whole dataset was 9.02 synonymous substitution per synonymous site (Table 2), which meant on average 9 synonymous substitutions per codon had occurred since the divergence of the sequence pair. It was quite possible that the sequences in the whole dataset were too divergent for PAML to work. To reduce the sequence divergence, I carried out PAML analyses in subclades of the tree.

Effect of sequence divergence
Among the 22 subclades of sequences that I tested, PAML detected positive selection in 20 subclades (likelihood ratio test, Table 2). However, only the subtree 2 consistently identified positively selected sites at different sequence divergence levels. The distributions of ω from the subtree datasets were similar to that of the whole dataset in that relative high ω appeared in the methyltransferase and TRD ( Figure 5B). For the third subtree, although likelihood ratio tests were always significant along different sequence divergence, only the highest and the lowest divergent datasets identified sites. However, the sites were not consistent with each other (Table 2).
For 28 close sequences from C. diphtheriae, sequence divergence was substantially lower (max dS = 0.4 synonymous substitutions per synonymous site). I found no evidence of positive selection as there was no significant different likelihoods between the null and alternative models.  (594,631,656,669,708,745,767,774) appeared in more than half of the 100 datasets (bootstrap > 50, Figure 6) and their locations in the structure of the enzyme were shown in Figure 7.

Results from the branch-site model
For the branch-site model, no matter which dataset (the 47 original sequences or the 10 original sequences, Table 2) was used, no positive selection was detected as there was no significant difference in the likelihood between the null and the alternative model.

Evidence for positive selection in type IIL REs
My ancestral reconstruction of the restriction sites of type IIL REs showed there had been to the sampling bias in the dataset as discussed below.

Impact of sequence divergence
Sequence divergence had a large impact on positively selected sites identified by PAML.
PAML identified no positively selected sites in datasets containing the most and least divergent sequences (i.e., the whole dataset and sequences from C. diphtheriae). Within subtree 2, zero and two sites were found in the least and most divergent datasets respectively, while more sites were identified in the medium divergent datasets. This was expected because neither too divergent sequences nor too similar sequences are informative. Simulation study by Yang's group has shown that PAML's accuracy and power were higher when analyzing medium and high divergent datasets (Anisimova et al., 2002).
It has also been shown that adding more sequences largely increased both the accuracy and power of PAML (Anisimova et al., 2002). As a result, larger datasets were more tolerant of higher sequence divergence than smaller datasets and there was no one cutoff for sequence divergence in PAML analyses. As seen in this study, the effect of sequence divergence in the more divergent datasets was mitigated by the increasing number of sequences, as the average dS increased at a lower rate than the max dS. For reference, I surveyed some of the previous studies in which positively selected sites had been identified. The sequence divergence used in those studies measured by max dS ranged from 0.15 to 4.45 synonymous substitutions per synonymous site, and measured by average dS ranged from 0.067 to 1.52 (Yang, 1998;Yang, Swanson & Vacquier, 2000;Yang & Swanson, 2002;Viljakainen & Pamilo, 2008;Areal et al., 2011). The divergence of the datasets in this study overlapped with previous studies but spanned a larger range (max dS ranged from 0.4 to 9.66, and average dS ranged from 0.21 to 2.79 synonymous substitutions per synonymous site). The more divergent datasets in my study were compensated by including many more (>50) sequences than the previous studies.

False positives and negatives
7 sites responsible for specificity changes at bases 2, 3, 4, 6 in type IIL restriction enzymes have been identified by mutagenesis experiments. Most of mutants at these sites retained comparable activities (Morgan & Luyten, 2009;Callahan et al., 2016), suggesting few if any sites other than these 7 amino acids contributed to sequence recognition. The 7 sites were all located within the TRD and close to DNA ( Figure 1B).
Moreover, extensive parallel changes at sites (645,773,806,808,810) strongly implicated the importance of the amino acid replacement in the functional adaptation. As such, they represented the best candidates for positively selected sites in this enzyme and therefore were used as true positives to benchmark the performance of PAML.
Among the 17 sites identified by PAML, only one site (774) belonged to the set of true positive sites. All the other sites were either in the TRD or methyltransferase domain.
None of them were located close to DNA in the structure or correlated significantly with specificity changes (Morgan & Luyten, 2009;Callahan et al., 2016). Because amino acids not making direct contact with DNA can also influence specificity (Lukacs et al., 2000), I cannot exclude the possibility that other positively selected sites exist. On the other hand, for the reason I discussed above, I think it is unlikely many more such sites exist, except for the unidentified sites responsible for specificity changes at the 1st base. One possible reason why only site 774 has been identified by PAML is that site 774 was under diversifying selection, while other 5 sites with functional support were mostly under directional selection (Table 1). Yang's group also proposed the possibility that PAML is better at detecting recurrent diversifying selection, but lack the power in directional selection detection (Anisimova et al., 2002). The little overlap between sites identified by PAML and the 7 true positive sites showed that PAML had clearly a high false negative rate and also quite possible a high false positive rate. The same problem has been reported in previous studies that tested PAML in systems with functional data (Yokoyama et al., 2008;Zhuang et al., 2009).
My results also suggest that neither the posterior probability nor bootstrap value is a good metric of confidence that can be placed on the sites identified by PAML. All the identified sites had a posterior probability of >0.95 to have ω>1. Two of them (sites 708 and 656) also had very high bootstrap support values (> 80%).

Consistency of PAML
PAML was very consistent in detecting positive selection in type IIL REs. However, it was not so consistent in identifying positively selected sites. The average bootstrap value of the 55 identified sites was 20.5% and only 8 of them received a bootstrap support greater than 50%. The low bootstrap support suggested that signal used by PAML to identify positively selected sites was weak and PAML was very sensitive to the sampling bias in the dataset. Similarly, study using HA1 gene of human H3N2 influenza virus as the system also found similar inconsistency problem of PAML. They showed large variation among identified sites from seven highly overlapped datasets (Chen & Sun, 2011).

Conclusion
Using Type IIL as the model system, my results showed that PAML was very consistent at detecting positive selection in molecular sequences, but performed poorly in identifying positively selected sites. PAML failed to identify the vast majority of sites that were presumably under positive selection and instead identified many possible false positive sites, resulting in both low sensitivity and specificity. Given the likely high false positive rate, we should treat sites identified by PAML with caution and validate them with functional data whenever possible.  2ΔlnL (twice the log likelihood difference between models) is compared to a χ2 distribution with 2 degrees of freedom (critical values 5.99, 9.21 at 5% and 1% significance respectively, '**' means p<0.01). Only sites with probability higher than 0.95 to be with ω > 1 are included (sites with probability higher than 0.99 are underlined).

Tables
A C G C A G AchA6III A G C C A G