--- title: "Genotype Conditional Association Test Vignette" author: "Wei Hao, Minsun Song, Alejandro Ochoa, John D. Storey" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: gcatest.bib vignette: > %\VignetteIndexEntry{gcatest Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The `gcatest` package provides an implementation of the Genotype Conditional Association Test (GCAT) [@song_testing_2015]. GCAT is a test for genetic association that is powered by Logistic Factor Analysis (LFA) [@hao_probabilistic_2016]. LFA is a method of modeling population structure in a genome wide association study. GCAT performs a test for association between each SNP and a trait (either quantitative or binary). We have shown that GCAT is robust to confounding from population structure. # Sample usage We include a sample dataset with the package. `sim_geno` is a simulated genotype matrix. `sim_trait` is a simulated trait. There are 10,000 SNPs and 1,000 individuals. The first five SNPs are associated with the trait. This simulations were done under the Pritchard-Stephens-Donnelly model with $K=3$, with Dirichlet parameter $\alpha=0.1$ and variance allotment in the trait corresponding to $5\%$ genetic, $5\%$ environmental, and $90\%$ noise. This dataset is simulated under identical parameters as the PSD simulation in Figure 2 of the paper [@song_testing_2015], except that we have adjusted the size of the simulation to be appropriate for a small demo. ```{R, warning=FALSE} library(lfa) library(gcatest) dim(sim_geno) length(sim_trait) ``` ## `gcat` The first step of `gcat` is to estimate the logistic factors: ```{R} LF <- lfa(sim_geno, 3) dim(LF) ``` Then, we call the `gcat` function, which returns a vector of p-values: ```{R} gcat_p <- gcat(sim_geno, LF, sim_trait) ``` We can look at the p-values for the associated SNPs: ```{R} gcat_p[1:5] ``` And also plot the histogram of the unassociated SNPs: ```{R, fig.height = 3, fig.align = 'center'} library(ggplot2) dat <- data.frame(p = gcat_p[6:10000]) ggplot(dat, aes(p, after_stat(density))) + geom_histogram(binwidth=1/20) + theme_bw() ``` # Data Input The `genio` package provides the function `read_plink` for parsing PLINK binary genotypes (extension: `.bed`) into an R object of the format needed for the `gcat` function. A `BEDMatrix` object (from the eponymous function and package) is also supported, and can result in reduced memory usage (at a small runtime penalty). # References