library(BG2)
The BG2
package provides statistical tools for the analysis of Poisson and Binary GWAS data. Currently, BG2
contains functions to perform BG2 which is a novel two step Bayesian procedure that, when compared to single marker association testing (SMA), drastically reduces the rate of false discoveries while maintaining the same level of recall of true causal SNPs. Full details on the BG2 procedure can be found in the manuscript (Xu, Williams, and Ferreira, n.d.). The two packages in Bioconductor are GWASTools (Gogarten et al. 2012) and GWAS.BAYES (Williams, Ferreira, and Ji 2022). While GWASTools provides frequentist methods, BG2 provides Bayesian methods. While GWAS.Bayes provides methods for analysis of Gaussian data, BG2 provides methods for the analysis of non-Gaussian data.
This vignette explores two toy examples as well as a case study presented in the BG2 manuscript (Xu, Williams, and Ferreira, n.d.) to illustrate how the functions provided in BG2
perform the BG2 procedure. Data has been simulated under a generalized linear mixed model from 9,000 SNPs of 328 A. Thaliana ecotypes. The BG2
package includes as R
objects the simulated data; 9,000 SNPs, the simulated phenotypes (both binary and Poisson), and the kinship matrix used to simulate the data. Further, the Github repo that contains the BG2
package also contains contains the data for the A. Thaliana case study.
The function implemented in BG2
is described below:
BG2
Performs BG2, as described in the BG2 manuscript (Xu, Williams, and Ferreira, n.d.), using generalized linear mixed models for a given numeric phenotype vector either binary or assumed Poisson distributed Y
, a SNP matrix encoded numerically SNPs
, fixed covariates Fixed
, and random effects and their projection matrices (covariance
and Z
respectively). The BG2
function returns the indices of the SNP matrix that were identified in the best model found by the BG2 procedure.The model for GWAS analysis used in the BG2
package is
\[\begin{equation*} g(\textbf{y}) = X \boldsymbol{\beta} + X_f \boldsymbol{\beta}_f + Z_1 \boldsymbol{\alpha}_1 + \ldots + Z_l \boldsymbol{\alpha}_l \end{equation*}\]
where
Currently, BG2
can analyze binary responses (family = "bernoulli"
) and Poisson responses (family = "poisson"
). The BG2 manuscript (Xu, Williams, and Ferreira, n.d.) provides full details on the priors for \(\boldsymbol{\beta}\). BG2
utilizes spectral decomposition techniques similar to that of Kang et al. (2008) as well as population parameters previously determined (Zhang et al. 2010) to speed up computation.
The BG2
function requires a vector of observed phenotypes (either binary or assumed Poisson distributed), a matrix of SNPs, and the specification of the random effects. First, the vector of observed phenotypes must be a numeric vector or a numeric \(n \times 1\) matrix. BG2
does not allow the analysis of multiple phenotypes simultaneously. In the BG2
package, there are two simulated phenotype vectors. The first simulated phenotype vector comes from a Poisson generalized linear mixed model with both a kinship random effect and an overdispersion random effect. The data is assumed to have four replicates for each A. Thaliana ecotype. Here are the first five elements of the Poisson simulated vector of phenotypes:
data("Y_poisson")
Y_poisson[1:5]
#> [1] 74 10 215 1 0
The second simulated phenotype vector comes from a binary generalized linear mixed model with only a kinship random effect. The first five elements of the binary simulated vector of phenotypes are
data("Y_binary")
Y_binary[1:5]
#> [1] 0 0 1 1 0
Second, the SNP matrix has to contain numeric values where each column corresponds to a SNP of interest and the \(i\)th row corresponds to the \(i\)th observed phenotype. In this example, the SNPs are a subset of the A. Thaliana TAIR9 genotype dataset and all SNPs have minor allele frequency greater than 0.01. Each simulated phenotype vector is simulated using this SNP matrix. Here are the first five rows and five columns of the SNP matrix:
data("SNPs")
SNPs[1:5,1:5]
#> SNP2555 SNP2556 SNP2557 SNP2558 SNP2559
#> [1,] 1 1 1 0 0
#> [2,] 0 1 1 1 1
#> [3,] 0 0 1 1 1
#> [4,] 1 1 0 0 1
#> [5,] 1 1 1 1 1
Third, the kinship matrix is an \(n \times n\) positive semi-definite matrix containing only numeric values. The \(i\)th row or \(i\)th column quantifies how observation \(i\) is related to other observations. Since, both simulated phenotypes are simulated from the same SNP matrix they also are simulated from the same kinship structure. The first five rows and five columns of the kinship matrix are
data("kinship")
kinship[1:5,1:5]
#> V1 V2 V3 V4 V5
#> [1,] 0.78515873 0.15800700 0.04264546 0.02057071 0.05643574
#> [2,] 0.15800700 0.78146476 0.05135891 0.01476357 0.05482448
#> [3,] 0.04264546 0.05135891 0.80199976 0.10558970 0.04888596
#> [4,] 0.02057071 0.01476357 0.10558970 0.80030413 0.02935703
#> [5,] 0.05643574 0.05482448 0.04888596 0.02935703 0.78401489
The function BG2
implements the BG2 method for generalized linear mixed models with either Poisson or Bernoulli distributed responses. This function takes inputs of the observed phenotypes, the SNPs coded numerically, the distributional family the response follows, fixed covariates treated as a matrix, the covariance matrices of the random effects, the design matrices of the random effects, the number of replicates of individuals or ecotypes you may have, and the choice of a fixed value or a prior for the dispersion parameter of the nonlocal prior. Further, the other inputs of BG2
are the FDR nominal level, the maximum number of iterations of the genetic algorithm in the model selection step, and the number of consecutive iterations of the genetic algorithm with the same best model for convergence. The full details of BG2 are available in the BG2 manuscript (Xu, Williams, and Ferreira, n.d.). The default values of maximum iterations and the number of iterations are the values used in the simulation study in the BG paper, that is, 4000 and 400 respectively.
Here we illustrate the use of BG2 with a nominal FDR of 0.05 with Poisson count data. First we specify the covariance matrices for the random effects. The first random effect is assumed to be \(\boldsymbol{\alpha}_1 \sim N(0,\kappa_1 K)\), where \(K\) is the realized relationship matrix or kinship matrix. The second random effect is assumed to be \(\boldsymbol{\alpha}_1 \sim N(0,\kappa_2 I)\), where the covariance matrix is an identity matrix times a scalar. This second random effect is to account for overdispersion in the Poisson model. The Covariance
argument takes a list of random effect covariance matrices. For this example, the list of covariance matrices is set as:
n <- length(Y_poisson)
covariance <- list()
covariance[[1]] <- kinship
covariance[[2]] <- diag(1, nrow = n, ncol = n)
The design matrices \(Z_i\) do not need to be specified in Z
as the observations have no other structure such as a grouping structure. Z
is set to be NULL implying that \(Z_i = I_{n \times n}\). Further, the number of ecotype replications is 4. Finally, we let the dispersion parameter of the nonlocal prior have a uniform prior.
set.seed(1330)
output_poisson <- BG2(Y=Y_poisson, SNPs=SNPs, Fixed = NULL,
Covariance=covariance, Z=NULL, family="poisson",
replicates=4, Tau="uniform",FDR_Nominal = 0.05,
maxiterations = 4000, runs_til_stop = 400)
output_poisson
#> [1] 385 450 462 463 1261 1350 1410 2250 3150 4050 7630
BG2
outputs the column indices of the SNPs
matrix that are in best model or column indices of SNPs perfectly correlated to SNPs in the best model. The data was generated with causal SNPs at positions 450, 1,350, 2,250, 3,150, 4,050, 4,950, 5,850, 6,750, 7,650,and 8,550. BG2 identifies 5 causal SNPs.
Here we illustrate the use of BG2 with a nominal FDR of 0.05 with Poisson count data. First we specify the covariance matrices for the random effects. The only random effect is assumed to be \(\boldsymbol{\alpha} \sim N(0,\kappa_1 K)\), where \(K\) is the realized relationship matrix or kinship matrix. For this example, the list of covariance matrices is set as:
covariance <- list()
covariance[[1]] <- kinship
In this example, the design matrices \(Z_i\) do not need to be specified in Z
as the observations have no other structure such as a grouping structure. Z
is set to be NULL implying that \(Z_i = I_{n \times n}\). With binary data, setting the number of replicates provides no computation gain and is not required. Finally, we let the dispersion parameter of the nonlocal prior have a Inverse Gamma distribution. Details on the Inverse Gamma Distribution are provide in the BG2 manuscript (Xu, Williams, and Ferreira, n.d.).
set.seed(1330)
output_binary <- BG2(Y=Y_binary, SNPs=SNPs, Fixed = NULL,
Covariance=covariance, Z=NULL, family="bernoulli",
replicates=NULL, Tau="IG",FDR_Nominal = 0.05,
maxiterations = 4000, runs_til_stop = 400)
output_binary
#> [1] 512 2250 7650 8550
Similarly to the Poisson example in Section 4.1.1, the data was generated with causal SNPs at positions 450, 1,350, 2,250, 3,150, 4,050, 4,950, 5,850, 6,750, 7,650,and 8,550. BG2 identifies 4 causal SNPs and no false SNPs.
sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BG2_1.0.0 BiocStyle_2.28.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.3 xfun_0.39 bslib_0.4.2
#> [4] ggplot2_3.4.2 recipes_1.0.6 lattice_0.21-8
#> [7] vctrs_0.6.2 tools_4.3.0 generics_0.1.3
#> [10] stats4_4.3.0 parallel_4.3.0 tibble_3.2.1
#> [13] fansi_1.0.4 ModelMetrics_1.2.2.2 pkgconfig_2.0.3
#> [16] Matrix_1.5-4 data.table_1.14.8 lifecycle_1.0.3
#> [19] compiler_4.3.0 stringr_1.5.0 munsell_0.5.0
#> [22] codetools_0.2-19 htmltools_0.5.5 class_7.3-21
#> [25] sass_0.4.5 yaml_2.3.7 prodlim_2023.03.31
#> [28] crayon_1.5.2 pillar_1.9.0 jquerylib_0.1.4
#> [31] MASS_7.3-59 cachem_1.0.7 gower_1.0.1
#> [34] iterators_1.0.14 rpart_4.1.19 foreach_1.5.2
#> [37] nlme_3.1-162 parallelly_1.35.0 lava_1.7.2.1
#> [40] tidyselect_1.2.0 digest_0.6.31 stringi_1.7.12
#> [43] future_1.32.0 dplyr_1.1.2 reshape2_1.4.4
#> [46] purrr_1.0.1 bookdown_0.33 listenv_0.9.0
#> [49] splines_4.3.0 fastmap_1.1.1 grid_4.3.0
#> [52] colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
#> [55] survival_3.5-5 utf8_1.2.3 future.apply_1.10.0
#> [58] withr_2.5.0 scales_1.2.1 lubridate_1.9.2
#> [61] timechange_0.2.0 rmarkdown_2.21 globals_0.16.2
#> [64] nnet_7.3-18 timeDate_4022.108 GA_3.2.3
#> [67] memoise_2.0.1 evaluate_0.20 knitr_1.42
#> [70] hardhat_1.3.0 caret_6.0-94 rlang_1.1.0
#> [73] Rcpp_1.0.10 glue_1.6.2 BiocManager_1.30.20
#> [76] pROC_1.18.0 ipred_0.9-14 jsonlite_1.8.4
#> [79] R6_2.5.1 plyr_1.8.8
Gogarten, Stephanie M, Tushar Bhangale, Matthew P Conomos, Cecelia A Laurie, Caitlin P McHugh, Ian Painter, Xiuwen Zheng, et al. 2012. “GWASTools: An R/Bioconductor Package for Quality Control and Analysis of Genome-Wide Association Studies.” Bioinformatics 28 (24): 3329–31.
Kang, Hyun Min, Noah A. Zaitlen, Claire M. Wade, Andrew Kirby, David Heckerman, Mark J. Daly, and Eleazar Eskin. 2008. “Efficient Control of Population Structure in Model Organism Association Mapping.” Genetics 178 (3): 1709–23. https://doi.org/10.1534/genetics.107.080101.
Williams, Jacob, Marco A. R. Ferreira, and Tieming Ji. 2022. “BICOSS: Bayesian iterative conditional stochastic search for GWAS.” BMC Bioinformatics 23 (475). https://doi.org/10.1186/s12859-022-05030-0.
Xu, Shuangshuang, Jacob Williams, and Marco A. R. Ferreira. n.d. “BG2: Bayesian Variable Selection in Generalized Linear Mixed Models with Non-Local Priors for Non-Gaussian Gwas Data.” Bioinformatics Submitted.
Zhang, Zhiwu, Elhan Ersoz, Chao-Qiang Lai, Rory J Todhunter, Hemant K Tiwari, Michael A Gore, Peter J Bradbury, et al. 2010. “Mixed Linear Model Approach Adapted for Genome-Wide Association Studies.” Nature Genetics 42 (4): 355–60.