--- title: "scoup: Simulate Codons with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process" author: - name: Hassan Sadiq affiliation: Department of Statistics and Actuarial Science, Stellenbosch University, South Africa output: BiocStyle::html_document: toc_float: true package: scoup bibliography: citation.bib abstract: | Genetic simulations are an important part of molecular biology. They are useful for assessing the efficiency and the sensitivity of models of evolution. Despite their relevance, hardly any simulator dedicated to sequence generation for natural selection inference analyses exist on the Bioconductor platform. In the broader molecular evolution genre, existing genetic simulators are yet to fully appreciate the correspondence between the population genomic and the phylogenetic literature. `scoup` is designed as a contribution toward filling these voids. With `scoup`, it is possible to explore the implications of the interplay between mutation and genetic drift on the phenomenological inferences of natural selection obtained from phylogenetic models. The ratio of the variance of non-synonymous to synonymous selection coefficient (vN/vS) is also presented as a reliable selection discriminant metric. Example code of how to use the package are presented with elaborate comments. vignette: | %\VignetteIndexEntry{scoup Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction Statistical inference of the extent at which Darwinian natural selection had impacted on genetic data from multiple populations commands a healthy portion of the phylogenetic literature [@jacques2023]. Validation of these codon-based models relies heavily on simulated data. A search of the entries on the Bioconductor [@bioc2004] platform, on 29 July 2024, with keywords `codon`, `mutation`, `selection`, `simulate` and `simulation` returned a total of 72 unique (out of the 2300 available Software) packages. None of the retrieved entries was dedicated to codon data simulation for natural selection analyses. Given the ever increasing diversity natural selection inference models that exist, there is indeed need for applicable packages on the platform. Population genomic studies provided the mathematical foundation upon which phylogenetics thrived [@wright1931; @fisher1922; @hardy1908; @weinberg1908; @darwin1859]. The thirst to bridge the gap between these two genres of evolutionary biology continue to drive the invention of more complex models of evolution [@brosou2012]. Consequently, there is need to develop codon sequence simulators to match the growth. `scoup` is designed on the basis of the mutation-selection (MutSel) framework [@halpern1998] as a contribution to this quest. Only a couple of existing selection-focused simulators in the literature used the MutSel framework [@wilke2015; @arenas2014]. This is most probably due to the perceived complexity of the methodology. In `scoup`, the versatility of the @uhlenbeck1930 algorithm as a framework for evolutionary analyses [@bartoszek2017] was used to circumvent the complexity. In a bid to identify an appropriate quantifier that permits direct comparison between the degree of selection signatures imposed during simulation and that inferred, the ratio of the variance of the non-synonymous to synonymous selection coefficients (vN/vS) was discovered to be appropriate. The vN/vS statistic is consequently posited as a quality selection discriminant metric. `scoup` therefore represents an important contribution to the phylogenetic modelling literature. Example code of how to successfully use the package is presented below. # Installation Use the following code to install `scoup` from the Bioconductor platform. ```{r, eval=FALSE} if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("scoup") ``` # Sample Code Three primary evolutionary algorithms are available in `scoup`. These include the frequency-dependent [@jones2017; @ayala1974], the Ornstein-Uhlenbeck (OU) and the episodic algorithms. Example of `R` [@cran2024] code where these functions are utilised are presented. The homogeneous [@muse1994; @goldman1994] and heterogeneous (site-wise and branch-wise) [@nielsen1998; @kpond2005] selection inference modelling contexts are explored. Data quality is assessed by comparing the maximum likelihood inferred ($\omega$) and the analytically calculated ($\mathrm{d}N/\mathrm{d}S$) estimates to the magnitude of the imposed selection pressure (measured by vN/vS). Template code used to analyse the output data (to obtain $\omega$) with `PAML` [@sandra2023; @yang2007] and `FUBAR` [@murrell2013; @hyphy2005] are presented in the Appendix. The `R` code presented subsequently, require that the user should have already installed the `scoup` package. ## Ornstein-Uhlenbeck Sensitivity ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of extant taxa leaves <- 8 # Number of codon sites sSize <- 15 # Number of data replications for each parameter combination sims <- 1 # OU reversion parameter (Theta) value eThta <- c(0.01) # OU asymptotic variance value eVary <- c(0.0001) # OU landscape shift parameters hbrunoStat <- hbInput(c(vNvS=1, nsynVar=0.01)) # Sequence alignment size information seqStat <- seqDetails(c(nsite=sSize, ntaxa=leaves)) # Iterate over all listed OU variance values for(g in seq(1,length(eVary))){ # Iterate over all listed OU reversion parameter values for(h in seq(1,length(eThta))){ # Create appropriate simulation function ("ou") object adaptStat <- ouInput(c(eVar=eVary[g],Theta=eThta[h])) # Iterate over the specified number of replicates for(i in seq(1,sims)){ # Execute simulation simData <- alignsim(adaptStat, seqStat, hbrunoStat, NULL) } } } # Print simulated alignment seqCOL(simData) ``` As expected, the correlation coefficient estimate was approximately $0.9974$ when the means of the inferred ($\omega$) and the calculated ($\mathrm{d}N/\mathrm{d}S$) selection effects were first compared. The correlation estimation included more iterations. ## vN/vS Sensitivity ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of extant taxa xtant <- 8 # Number of codon sites siteSize <- 15 # Number of data replications for each parameter combination simSize <- 1 # Variance of the non-synonymous selection coefficients nsynVary <- c(0.001) # Ratio of the variance of the non-synonymous to synonymous coeff. vNvSvec <- c(1) # Sequence alignment size information seqStat <- seqDetails(c(nsite=siteSize, ntaxa=xtant)) # Iterate over all listed coefficient variance ratios for(a in seq(1,length(vNvSvec))){ # Iterate over all listed non-synonymous coefficients variance for(b in seq(1,length(nsynVary))){ # Create appropriate simulation function ("omega") object adaptData <- wInput(list(vNvS=vNvSvec[a],nsynVar=nsynVary[b])) # Iterate over the specified number of replicates for(i in seq(1,simSize)){ # Execute simulation simulateSeq <- alignsim(adaptData, seqStat, NA) } } } # Print simulated alignment cseq(simulateSeq) ``` Sequences generated with the presented code (with more iterations included) produced strongly correlated selection inferences (correlation coefficient $\approx 0.9923$) when the average $\mathrm{d}N/\mathrm{d}S$ and the $\omega$ values were compared. This implementation is an example of how to execute the frequency-dependent evolutionary technique with the package. ## Site-wise Application ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of codon sites sitesize<- 15 # Variance of non-synonymous selection coefficients nsynVary <- 0.01 # Number of extant taxa ## Commented value was used for results presented in article taxasize <- 8 # Sequence alignment size information seqsEntry <- seqDetails(c(nsite=sitesize, ntaxa=taxasize)) # Create the applicable ("ou") object for simulation function ## eVar= OU asymptotic variance, Theta=OU reversion parameter adaptEntry <- ouInput(c(eVar=0.1,Theta=1)) # Ratio of the variance of the non-synonymous to synonymous coeff. sratio <- c(1e-06) # Iterate over all listed coefficient variance ratios for(a0 in seq(1,length(sratio))){ # OU landscape shift parameters mValues <- hbInput(c(vNvS=sratio[a0], nsynVar=nsynVary)) # Execute simulation simSeq <- alignsim(adaptEntry, seqsEntry, mValues, NA) } # Print simulated codon sequence cseq(simSeq) ``` This is another example of how to call the OU shifting landscape evolutionary approach. The results obtained yielded a pairwise correlation coefficient estimate of approximately $0.9988$ between the means of $\mathrm{d}N/\mathrm{d}S$ and $\omega$. The correlation coefficient estimates were approximately $0.8123$ and $0.8305$ when the averages were each compared to vN/vS, respectively. ## Branch-wise (Episodic) Test ```{r, eval=TRUE} # Make package accessible in R session library(scoup) # Number of internal nodes on the desired balanced tree iNode <- 3 # Number of required codon sites siteCount <- 15 # Variance of non-synonymous selection coefficients nsnV <- 0.01 # Number of data replications for each parameter combination nsim <- 1 # Ratio of the variance of the non-synonymous to synonymous coeff. vNvSvec <- c(1e-06) # Sequence alignment size information seqsBwise <- seqDetails(c(nsite=siteCount, blength=0.10)) # Iterate over all listed coefficient variance ratios for(h in seq(1,length(vNvSvec))){ # Iterate over the specified number of replicates for(i in seq(1,nsim)){ # Create the parameter set applicable at each internal tree node scInput <- rbind(vNvS=c(rep(0,iNode-1),vNvSvec[h]), nsynVar=rep(nsnV,iNode)) # Create the applicable ("discrete") object for simulation function adaptBranch <- discreteInput(list(p02xnodes=scInput)) # Execute simulation genSeq <- alignsim(adaptBranch, seqsBwise, NULL) } } # Print simulated sequence data seqCOL(genSeq) ``` The correlation coefficient between the averages of the analytical $\mathrm{d}N/\mathrm{d}S$ and the inferred $\omega$ estimates was approximately $0.9998$, obtained from 50 independent iterations of the code for a set of vN/vS values. The correlation coefficient estimate was approximately $0.6349$ for vN/vS vs $\omega$ and $0.6360$ for vN/vS vs $\mathrm{d}N/\mathrm{d}S$. # Conclusion Reference `scoup` code are presented to facilitate use of the package. Although not explicitly presented, it is also possible to generate data with signatures of directional selection by setting the `aaPlus` element of the `wInput` entry of the `alignsim` function accordingly. The capacity of the package is expected to be extended in future versions. # Citation A more appropriate citation will be provided for the package after it has been accepted for publication. In the meantime, to cite this package, use Sadiq, H. 2024. "scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process". R Package. doi:10.18129/B9.bioc.scoup. # References # Appendix: Sample Data Analyses (Non-R) Code `CODEML` script executed in `PAML` to infer single alignment-wide $\omega$ estimates for data sets generated from 50 independent executions of the sensitivity analyses code presented above. The same `CODEML` script was used to analyse data (also 50 replicates) from the episodic analyses code, with the `model` entry replaced with `2`. The `scoup` simulated sequence data and tree are `seq.nex` and `seq.tre`, respectively. ```{bash, eval=FALSE} seqfile = seq.nex treefile = seq.tre outfile = seq.out noisy = 0 verbose = 0 seqtype = 1 ndata = 1 icode = 0 cleandata = 0 model = 0 NSsites = 0 CodonFreq = 3 estFreq = 0 clock = 0 fix_omega = 0 omega = 1e-05 ``` \vspace*{1cm} `FUBAR` command executed with `HyPhy` through the terminal in MacBook. Note that `HyPhy` was already installed on the computer. The `seq.nex` input is the `scoup` simulated codon sequence data that is saved in the same NEXUS file with the tree data. The NEXUS file resides in the working directory. ```{bash, eval=FALSE} hyphy fubar --code Universal --alignment seq.nex --tree seq.nex ``` # Session info The output of `sessionInfo()` from the computer where this file is generated is provided below. ```{r sessionInfo, echo=FALSE} sessionInfo() ```