--- title: "scoup: Simulate Codons with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process" author: - name: Hassan Sadiq affiliation: Department of Statistics and Actuarial Science, Stellenbosch University, South Africa output: BiocStyle::html_document: toc_float: true package: scoup bibliography: citation.bib abstract: | Genetic simulations are an important part of molecular biology. They are useful for assessing the efficiency and the sensitivity of models of evolution. Despite their relevance, hardly any simulator dedicated to sequence generation for natural selection inference analyses exist on the Bioconductor platform. In the broader molecular evolution genre, existing genetic simulators are yet to fully exploit the correspondence between the population genomic and the phylogenetic literature. `scoup` was designed as a contribution toward filling these voids. With `scoup`, it is possible to explore the implications of the interplay between mutation and genetic drift on the phenomenological inferences of natural selection obtained from phylogenetic models. The ratio of the variance of non-synonymous to synonymous selection coefficient (vN/vS) is also presented as a reliable selection discriminant metric. Example code of how to use the package are presented with elaborate comments. vignette: | %\VignetteIndexEntry{scoup Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction Statistical inference of the extent at which Darwinian natural selection had impacted on genetic data from multiple populations commands a healthy quota of the phylogenetic literature [@jacques2023]. Validation of these codon-based models relies heavily on simulated data. A search of the entries on the Bioconductor [@bioc2004] platform, on 29 July 2024, with keywords `codon`, `mutation`, `selection`, `simulate` and `simulation` returned a total of 72 unique (out of the 2300 available Software) packages. None of the retrieved entries was dedicated to codon data simulation for natural selection analyses. Given the ever increasing diverse types of models of natural selection inference from molecular data that exist, there is indeed need for applicable packages on the platform. Population genomic studies provided the mathematical foundation upon which phylogenetics thrived [@wright1931; @fisher1922; @hardy1908; @weinberg1908; @darwin1859]. The thirst to bridge the gap between these two genres of evolutionary biology continue to drive the invention of more complex models of evolution [@brosou2012]. Consequently, there is need to develop codon sequence simulators to match the growth. `scoup` is designed on the basis of the mutation-selection (MutSel) framework [@halpern1998] as a contribution to this quest. Only a couple of existing selection-focused simulators in the literature used the MutSel framework [@wilke2015; @arenas2014]. This is most probably due to the perceived complexity of the methodology. In `scoup`, the versatility of the @uhlenbeck1930 algorithm as a framework for evolutionary analyses [@bartoszek2017] was exploited to circumvent the complexity. In a bid to identify an appropriate quantifier that permits direct comparison between the degree of selection signatures imposed during simulation and that inferred, the ratio of the variance of the non-synonymous to synonymous selection coefficients (vN/vS) was discovered to be appropriate. The vN/vS statistic is consequently posited as a quality selection discriminant metric. `scoup` therefore represents an important contribution to the phylogenetic modelling literature. Example code of how to successfully use the package is presented below. # Installation Use the following code to install `scoup` from the Bioconductor platform. ```{r, eval=FALSE} if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("scoup") ``` # Sample Code Three primary evolutionary algorithms are available in `scoup`. These include the frequency-dependent [@jones2017; @ayala1974], the Ornstein-Uhlenbeck (OU) and the discrete algorithms. Example of `R` [@cran2024] code where these functions were utilised are presented. The homogeneous [@muse1994; @goldman1994] and heterogeneous (site-wise and branch-wise) [@nielsen1998; @kpond2005] selection inference modelling contexts were explored. Data quality was assessed by comparing the maximum likelihood inferred ($\omega$) and the analytically calculated ($\mathrm{d}N/\mathrm{d}S$) estimates to the magnitude of the imposed selection pressure (measured by vN/vS). Template code used to analyse the output data (to obtain $\omega$) with `PAML` [@sandra2023; @yang2007] and `FUBAR` [@murrell2013; @hyphy2005] are presented in the Appendix. The `R` code presented subsequently, require that the user should have already installed the `scoup` package. ## Ornstein-Uhlenbeck Sensitivity ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of extant taxa ## Excluded values contributed to results presented in article leaves <- 8 # 64 # Number of codon sites ## Excluded values contributed to results presented in article sSize <- 15 # 250 # Number of data replications for each parameter combination ## Edited count was used for the results presented in article sims <- 1 # 50 # OU reversion parameter (Theta) value ## Excluded values contributed to results presented in article eThta <- c(0.01) # c(0.01, 0.1, 1) # OU asymptotic variance value ## Excluded values contributed to results presented in article eVary <- c(0.0001) # c(0.0001, 0.01, 1) # OU landscape shift parameters hbrunoStat <- hbInput(c(vNvS=1, nsynVar=0.01)) # Sequence alignment size information seqStat <- seqDetails(c(nsite=sSize, ntaxa=leaves)) # Iterate over all listed OU variance values for(g in seq(1,length(eVary))){ # Iterate over all listed OU reversion parameter values for(h in seq(1,length(eThta))){ # Create appropriate simulation function ("ou") object adaptStat <- ouInput(c(eVar=eVary[g],Theta=eThta[h])) # Iterate over the specified number of replicates for(i in seq(1,sims)){ # Execute simulation simData <- alignsim(adaptStat, seqStat, hbrunoStat, NULL) } } } # Print simulated alignment seqCOL(simData) ``` As expected, the correlation coefficient estimate was approximately $0.9974$ when the means of the inferred ($\omega$) and the calculated ($\mathrm{d}N/\mathrm{d}S$) selection effects were compared. The correlation estimation included all the commented values. ## vN/vS Sensitivity ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of extant taxa ## Omitted value was used for the results presented in article xtant <- 8 # 64 # Number of codon sites ## Omitted count was used for the results presented in article siteSize <- 15 # 64 # Number of data replications for each parameter combination ## Omitted count was used for the results presented in article simSize <- 1 # 50 # Variance of the non-synonymous selection coefficients ## Excluded values contributed to results presented in article nsynVary <- c(0) # c(0, 0.001, 0.1) # Ratio of the variance of the non-synonymous to synonymous coeff. ## Excluded values contributed to results presented in article vNvSvec <- c(0) # c(0, 0.001, 1, 10) # Sequence alignment size information seqStat <- seqDetails(c(nsite=siteSize, ntaxa=xtant)) # Iterate over all listed coefficient variance ratios for(a in seq(1,length(vNvSvec))){ # Iterate over all listed non-synonymous coefficients variance for(b in seq(1,length(nsynVary))){ # Create appropriate simulation function ("omega") object adaptData <- wInput(list(vNvS=vNvSvec[a],nsynVar=nsynVary[b])) # Iterate over the specified number of replicates for(i in seq(1,simSize)){ # Execute simulation simulateSeq <- alignsim(adaptData, seqStat, NA) } } } # Print simulated alignment cseq(simulateSeq) ``` Sequences generated with the presented code (with the excluded values activated) produced strongly correlated selection inferences (correlation coefficient $\approx 0.9923$) when the average $\mathrm{d}N/\mathrm{d}S$ and the $\omega$ values were compared. This implementation is an example of how to execute the frequency-dependent evolutionary technique with the package. ## Site-wise Application ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of codon sites ## Commented value was used for results presented in article sitesize<- 15 # 100 # Variance of non-synonymous selection coefficients nsynVary <- 0.01 # Number of extant taxa ## Commented value was used for results presented in article taxasize <- 8 # 1024 # Sequence alignment size information seqsEntry <- seqDetails(c(nsite=sitesize, ntaxa=taxasize)) # Create the applicable ("ou") object for simulation function ## eVar= OU asymptotic variance, Theta=OU reversion parameter adaptEntry <- ouInput(c(eVar=0.1,Theta=1)) # Ratio of the variance of the non-synonymous to synonymous coeff. ## Excluded values contributed to results presented in article sratio <- c(0) # c(0, 1e-06, 1e-03, 0.1, 1, 10, 1000) # Iterate over all listed coefficient variance ratios for(a0 in seq(1,length(sratio))){ # OU landscape shift parameters mValues <- hbInput(c(vNvS=sratio[a0], nsynVar=nsynVary)) # Execute simulation simSeq <- alignsim(adaptEntry, seqsEntry, mValues, NA) } # Print simulated codon sequence cseq(simSeq) ``` This is another example of how to call the OU shifting landscape evolutionary approach. The results obtained yielded a pairwise correlation coefficient estimate of approximately $0.9988$ between the means of $\mathrm{d}N/\mathrm{d}S$ and $\omega$. The correlation coefficient estimates were approximately $0.8123$ and $0.8305$ when the averages were each compared to vN/vS, respectively. ## Branch-wise (Episodic) Test ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of internal nodes on the desired balanced tree iNode <- 3 # Number of required codon sites ## Excluded value was used for the results presented in article siteCount <- 15 # 1000 # Variance of non-synonymous selection coefficients nsnV <- 0.01 # Number of data replications for each parameter combination ## Edited count was used for the results presented in article nsim <- 1 # 50 # Ratio of the variance of the non-synonymous to synonymous coeff. ## Excluded values contributed to results presented in article vNvSvec <- c(0) # c(0, 1e-06, 1e-03, 0.1, 1, 10, 100) # Sequence alignment size information seqsBwise <- seqDetails(c(nsite=siteCount, blength=0.10)) # Iterate over all listed coefficient variance ratios for(h in seq(1,length(vNvSvec))){ # Iterate over the specified number of replicates for(i in seq(1,nsim)){ # Create the parameter set applicable at each internal tree node scInput <- rbind(vNvS=c(rep(0,iNode-1),vNvSvec[h]), nsynVar=rep(nsnV,iNode)) # Create the applicable ("discrete") object for simulation function adaptBranch <- discreteInput(list(p02xnodes=scInput)) # Execute simulation genSeq <- alignsim(adaptBranch, seqsBwise, NULL) } } # Print simulated sequence data seqCOL(genSeq) ``` The correlation coefficient between the averages of the analytical $\mathrm{d}N/\mathrm{d}S$ and the inferred $\omega$ estimates was approximately $0.9998$, obtained from 50 independent iterations of the code for all the listed vN/vS values. The correlation coefficient estimate was approximately $0.6349$ for vN/vS vs $\omega$ and $0.6360$ for vN/vS vs $\mathrm{d}N/\mathrm{d}S$. # Conclusion Reference `scoup` code were presented to facilitate use of the package. Although not explicitly presented, it is also possible to generate data with signatures of directional selection by setting the `aaPlus` element of the `wInput` entry of the `alignsim` function accordingly. The capacity of the package is expected to be extended in future versions. # Citation A more appropriate citation will be provided for the package after it has been accepted to the Bioconductor platform and after the corresponding article has been accepted for publication. In the meantime, to cite this package, use Sadiq, H. 2024. "scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process". R Package. doi:10.18129/B9.bioc.scoup. # References # Appendix: Sample Data Analyses (Non-R) Code `CODEML` script executed in `PAML` to infer single alignment-wide $\omega$ estimates for data sets generated from 50 independent executions of each of the sensitivity analyses code presented above. The same `CODEML` script was used to analyse data (also 50 replicates) from the episodic analyses code, with the `model` entry replaced with `2`. The `scoup` simulated sequence data and tree are `seq.nex` and `seq.tre`, respectively. ```{bash, eval=FALSE} seqfile = seq.nex treefile = seq.tre outfile = seq.out noisy = 0 verbose = 0 seqtype = 1 ndata = 1 icode = 0 cleandata = 0 model = 0 NSsites = 0 CodonFreq = 3 estFreq = 0 clock = 0 fix_omega = 0 omega = 1e-05 ``` \vspace*{1cm} `FUBAR` command executed with `HyPhy` through the terminal in MacBook. Note that `HyPhy` was already installed on the computer. The `seq.nex` input is the `scoup` simulated codon sequence data that is saved in the same NEXUS file with the tree data. The NEXUS file resides in the working directory. ```{bash, eval=FALSE} hyphy fubar --code Universal --alignment seq.nex --tree seq.nex ``` # Session info The output of `sessionInfo()` from the computer where this file was generated is provided below. ```{r sessionInfo, echo=FALSE} sessionInfo() ```