\documentclass[10pt]{article} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{gcat Package} \usepackage{fullpage} \usepackage{hyperref} \title{Genotype Conditional Association Test Vignette} \author{Wei Hao, Minsun Song, John D. Storey} \date{\today} \begin{document} \maketitle \section{Introduction} The \texttt{gcatest} package provides an implementation of the Genotype Conditional Association Test (GCAT)~\cite{song_testing_2015}. GCAT is a test for genetic association that is powered by Logistic Factor Analysis (LFA)~\cite{hao_probabilistic_2016}. LFA is a method of modeling population structure in a genome wide association study. GCAT performs a test for association between each SNP and a trait (either quantitative or binary). We have shown that GCAT is robust to confounding from population structure. \section{Sample usage} We include a sample dataset with the package. \texttt{sim\_geno} is a simulated genotype matrix. \texttt{sim\_trait} is a simulated trait. There are 10,000 SNPs and 1,000 individuals. The first five SNPs are associated with the trait. This simulations were done under the Pritchard-Stephens-Donnelly model with $K=3$, with Dirichlet parameter $\alpha=0.1$ and variance allotment in the trait corresponding to $5\%$ genetic, $5\%$ environmental, and $90\%$ noise. This dataset is simulated under identical parameters as the PSD simulation in Figure 2 of the paper~\cite{song_testing_2015}, except that we have adjusted the size of the simulation to be appropriate for a small demo. <>= library(lfa) library(gcatest) dim(sim_geno) length(sim_trait) @ \subsection{\texttt{gcat}} The first step of \texttt{gcat} is to estimate the logistic factors: <>= LF <- lfa(sim_geno, 3) dim(LF) @ Then, we call the \texttt{gcat} function, which returns a vector of p-values: <>= gcat_p <- gcat(sim_geno, LF, sim_trait) @ We can look at the p-values for the associated SNPs: <>= gcat_p[1:5] @ And also plot the histogram of the unassociated SNPs: <>= library(ggplot2) dat <- data.frame(p = gcat_p[6:10000]) ggplot(dat, aes(p, after_stat(density))) + geom_histogram(binwidth=1/20) + theme_bw() @ \section{Data Input} The \texttt{genio} package provides the function \texttt{read\_plink} for parsing PLINK binary genotypes (extension: \texttt{.bed}) into an R object of the format needed for the \texttt{gcat} function. A \texttt{BEDMatrix} object (from the eponymous function and package) is also supported, and can result in reduced memory usage (at a small runtime penalty). \bibliographystyle{plain} \bibliography{gcatest} \end{document}