%\VignetteIndexEntry{gaston package} %\VignetteDepends{gaston} %\VignettePackage{gaston} %\VignetteEngine{knitr::knitr} \documentclass{article} %\usepackage[noae]{Sweave} \usepackage[top=35mm, bottom=40mm, left=25mm , right=25mm]{geometry} %\usepackage{moreverb} \usepackage[utf8]{inputenc} \usepackage{amsfonts} \usepackage{amsmath} \usepackage{etoolbox} %\setkeys{Gin}{width=0.4\textwidth} %\SweaveOpts{echo=TRUE, eps=FALSE, pdf=TRUE} \raggedbottom \pagestyle{empty} \parindent0pt \parskip8pt \def\thesection{\arabic{section}} \def\theequation{\arabic{equation}} \let\epsilon\varepsilon %<>= %options(continue=" ", prompt = " ", SweaveHooks=list(fig.mar=function() par(mar=c(5.1,4.1,3.1,2.1))), width=90) %@ <>= require(knitr) options(width = 90, prompt="> ") knit_hooks$set(fig.mar=function() par(mar=c(5.1,4.1,3.1,2.1))) opts_chunk$set(out.width='0.4\\textwidth', fig.align='center', highlight=TRUE, comment=NA, fig.height=6, fig.width=6) opts_knit$set(unnamed.chunk.label='gaston') @ %<>= %options(prompt="> ", continue = " "); %@ <>= opts_chunk$set(prompt=TRUE, continue = " "); @ %<>= %options(prompt=" ", continue=" "); %@ <>= opts_chunk$set(prompt=FALSE, continue=" "); @ <>= <> @ <>= require(gaston) desc <- packageDescription("gaston") @ \title{{\bfseries Gaston}\\ {\large Version \Sexpr{desc$Version}}} \author{Hervé Perdry, Claire Dandine-Roulland} %\DefineVerbatimEnvironment{Sinput}{Verbatim}{} %\DefineVerbatimEnvironment{Soutput}{Verbatim}{} %\DefineVerbatimEnvironment{Scode}{Verbatim}{} %\fvset{listparameters={\setlength{\topsep}{-0.5em}}} %\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \makeatletter \preto{\@verbatim}{\topsep=2pt \partopsep=-10pt } \preto{\alltt}{\topsep=-3pt \partopsep=-10pt } \makeatother \begin{document} \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Introduction} Gaston offers functions for efficient manipulation of large genotype (SNP) matrices, and state-of-the-art implementation of algorithms to fit Linear Mixed Models, that can be used to compute heritability estimates or to perform association tests. Thanks to the packages \verb!Rcpp!, \verb!RcppParallel!, \verb!RcppEigen!, Gaston functions are mainly written in C++. Many are multithreaded. For better performances, we recommend to build Gaston with \verb!clang++!. To do so, it is sufficient to create in your home directory a file \verb!~/.R/Makevars! containing \verb!CXX1X = clang++! or if you use a version of R $>=$ 3.4.0, \verb!CXX11 = clang++! In this vignette, we illustrate Gaston using the included data sets \verb!AGT!, \verb!LCT!, and \verb!TTN! (see the corresponding manual pages for a description). Gaston also includes some example files in the \verb!extdata! folder. Not all options of the functions are described here, but rather their basic usage. The reader is advised to look at the manual pages for details. Note that the package name is written \verb!gaston! when dealing with R commands, but Gaston (with a capital) in human language. \section{Modifying Gaston's behaviour with global options} \subsection{Number of threads} The number of threads used by multithreaded functions can be modified through \verb!RcppParallel! function \verb!setThreadOptions!. It is advised to try several values for the number of threads, as using too many threads might be counterproductive due to an important overhead. The default value set by \verb!RcppParallel! is generally too high. Some functions have a \verb!verbose! argument, which controls the function verbosity. To mute all functions at once you can use \verb!options(gaston.verbose = FALSE)!. \subsection{Basic statistics computations} Since version 1.4, the behaviour of all functions that output a matrix of genotypes (a bed.matrix, described in the next section) can be modified by setting \verb!options(gaston.auto.set.stats = FALSE)!. The effects of this option is described in section \ref{stats} below. Note that some examples in the manual pages might not work if you use this option. \subsection{Autosomes, gonosomes, mitochondria} Since version 1.4.7, some functions take into account wether a SNP is autosomal, X or Y linked, or mitochondrial. The ids of the corresponding chromosomes are defined by options \verb!gaston.autosomes!, \verb!gaston.chr.x!, \verb!gaston.chr.y!, \verb!gaston.chr.mt!. Default values are \verb!c(1:22, 25)!, \verb!23!, \verb!24! and \verb!26!, following to Plink convention (in this convention, \verb!25! corresponds to the pseudo-autosomal XY region: we chose to include it in the autosomes). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Genotype matrices} An S4 class for genotype matrices is defined, named \verb!bed.matrix!. Each row correspond to an individual, and each column to a SNP. \subsection{Reading bed.matrices from files} Bed.matrices be read from files using \verb!read.bed.matrix!. The function \verb!read.vcf! reads VCF files; it relies on the package \verb!WhopGenome!. Gaston includes example files that can be used for illustration: <<>>= x <- read.bed.matrix( system.file("extdata", "LCT.bed", package="gaston") ) x @ The folder \verb!extdata/! contains files \verb!LCT.bed!, \verb!LCT.rds!, \verb!LCT.bim! and \verb!LCT.fam!. The \verb!.bed!, \verb!.bim! and \verb!.fam! files follow the PLINK specifications. The \verb!.rds! file is a R data file; if it is present, the \verb!.bim! and \verb!.fam! files are ignored. You can ignore the \verb!.rds! file using option \verb!rds = NULL!: <<>>= x <- read.bed.matrix( system.file("extdata", "LCT.bed", package="gaston"), rds = NULL ) x @ A bed.matrix can be saved using \verb!write.bed.matrix!. The default behavior is to write \verb!.bed!, \verb!.bim!, \verb!.fam! and \verb!.rds! files; see the manual page for more details. \subsection{Conversion from and to R objects} A numerical matrix \verb!x! containing genotype counts (0, 1, 2 or \verb!NA!) can be transformed in a bed.matrix with \verb!as(x, "bed.matrix")!. The resulting object will lack individual and SNP informations (if the rownames and colnames of \verb!x! are set, they will be used as SNP and individual ids respectively). Conversely, a numerical matrix can be retrieved from a bed.matrix using \verb!as.matrix!. The function \verb!as.bed.matrix! allows to provide data frames corresponding to the \verb!.fam! and \verb!.bim! files. They should have colnames \verb!famid!, \verb!id!, \verb!father!, \verb!mother!, \verb!sex!, \verb!pheno!, and \verb!chr!, \verb!id!, \verb!dist!, \verb!pos!, \verb!A1!, \verb!A2! respectively. This function is widely used in the examples included in manual pages. <<>>= data(TTN) x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim) x @ \subsection{The insides of a bed.matrix} In first approach, a bed.matrix behaves as a "read-only" matrix containing only 0, 1, 2 and NAs, unless the genotypes are standardized (use \verb!standardize<-!). They are stored in a compact form, each genotype being coded on 2 bits (hence 4 genotypes per byte). Bed.matrices are implemented using S4 classes and methods. Let's have a look on the slots names of the bed.matrix \verb!x! created above using the dataset \verb!LCT!. <<>>= data(TTN) x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim) slotNames(x) @ The slot \verb!x@bed! is an external pointer, which indicates where the genetic data are stored in memory. It will be used by the C++ functions called by Gaston. <<>>= x@bed @ Let's look at the contents of the slots \verb!x@ped! and \verb!x@snps!. The other slots will be commented later. The slot \verb!x@ped! gives informations on the individuals. The first 6 columns correspond to the contents of a \verb!.fam! file, or to the first 6 columns of a \verb!.ped! file (known as linkage format). The other columns are simple descriptive stats that are computed by Gaston, unless \verb!options(gaston.auto.set.stats = FALSE)! was set (see below section \ref{stats}). <<>>= dim(x@ped) head(x@ped) @ The slot \verb!x@snps! gives informations on the SNPs. Its first 6 columns corresponds to the contents of a \verb!.bim! file. The other columns are simple descriptive stats that are computed by Gaston. Again, if the global option \verb!options(gaston.auto.set.stats = FALSE)! was set (see below), these columns will be absent (see below section \ref{stats}). \pagebreak <<>>= dim(x@snps) head(x@snps) @ \textbf{Note:} We refer to the allele \verb!A1! as the \textit{reference allele} and to \verb!A2! as the \textit{alternative allele}. The choice of the reference allele is arbitrary, it doesn't to be e.g. the major allele. If the matrix is produced by \verb!read.bed.matrix!, Gaston will keep them in the order they appear in the \verb!.bim! file, their won"t be any reordering. \subsection{Basic statistics included in a bed.matrix}\label{stats} Some simple descriptive statistics can be added to a bed.matrix with \verb!set.stats!. Since version 1.4 of gaston, this function is called by default by all functions that create a bed.matrix, unless the global option \verb!options(gaston.auto.set.stats = FALSE)! was set. This option can be used to gain some time in some particular cases. We illustrate here the effect of this option: <<>>= options(gaston.auto.set.stats = FALSE) data(TTN) x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim) head(x@ped) head(x@snps) @ The reader is invited to compare with what we obtained in the previous section of this document. The function \verb!set.stats! can be called to add the missing descriptive statistics to the \verb!ped! and \verb!snps! slots: <<>>= x <- set.stats(x) head(x@ped) @ The columns \verb!N0!, \verb!N1!, \verb!N2! and \verb!NAs! give for each individual the number of autosomal SNPs with a genotype equal to 0, 1, 2 (corresponding to \verb!A1A1!, \verb!A1A2! and \verb!A2A2!) and missing, respectively. The homologous columns with names ending with \verb!.x!, \verb!.y! and \verb!.mt! give the same for SNPs on the X, Y, and mitochondria (this is why you get only 0s here). The columns \verb!callrate!, \verb!callrate.x!, \verb!callrate.y!, \verb!callrate.mt! give the individual callrate for autosomal, X, Y, mitochondrial SNPs, and similarly, \verb!hz!, \verb!hz.x!, \verb!hz.y!, \verb!hz.mt! give the individual heterozygosity. <<>>= head(x@snps) @ The columns \verb!N0!, \verb!N1!, \verb!N2! and \verb!NAs! contain the number of genotypes 0, 1, 2 and \verb!NA! for each SNP. For X-linked SNPs only, the homologous columns with names ending with \verb!.f! give the same, taking only women into account. The callrate is the proportion of non-missing genotypes. In the \verb!snps! slot, \verb!maf! is the minor allele frequency, and \verb!hz! it the heterozygosity rate. Note that the callrate for Y linked SNPs taking only men into account, while the heterozygosity rate for X linked SNPs is computed only on women. The default is too also also update the slots \verb!x@p!, \verb!x@mu! and \verb!x@sigma!: <<>>= str(x@p) str(x@mu) str(x@sigma) @ \verb!p! contains the alternative allele (\verb!A2!) frequency; \verb!mu! is equal to \verb!2*p! and is the expected value of the genotype (coded in 0, 1, 2); and \verb!sigma! is the genotype standard error. If the Hardy-Weinberg equilibrium holds, \verb!sigma! should be close to \verb!sqrt(2*p(1-p))!. This is illustrated on the figure below. %\begin{center} %\setkeys{Gin}{width=0.7\textwidth} <>= plot(x@p, x@sigma, xlim=c(0,1)) t <- seq(0,1,length=101); lines(t, sqrt(2*t*(1-t)), col="red") @ %\end{center} There are also functions \verb!set.stats.snps! and \verb!set.stats.ped! to update only \verb!x@snps! and \verb!x@ped!. The option \verb!gaston.auto.set.stats! can be useful when extracting individuals from bed.matrices (seen next section): if \verb!gaston.auto.set.stats = FALSE!, the slots \verb!@p!, \verb!@mu! and \verb!@sigma! are left unaltered; in contrast, the default behavior is to call \verb!set.stats! which recomputes the values of these slots. Note that when using \verb!set.stats! ``manually'', one can use options to alter its behavior, allowing to update only some slots. The following code illustrates this (remember that here \verb!gaston.auto.set.stats = FALSE!) : <<>>= head(x@p) y <- x[1:30,] head(y@p) y <- set.stats(y, set.p = FALSE) head(y@p) y <- set.stats(y) head(y@p) @ Hardy-Weinberg Equilibrium can be tested for all SNPs simply by \verb!x <- set.hwe(x)!. This command adds to the data frame \verb!x@snps! a column \verb!hwe!, containing the $p$-value of the test (both $\chi^2$ and ``exact'' tests are available, see the man page). Note that when writting a bed.matrix \verb!x! to disk with \verb!write.bed.matrix!, the \verb!.rds! file contains all the slots of \verb!x!, and thus contains the additionnal variables added by \verb!set.stats! or \verb!set.hwe!, which are not saved in the \verb!.bim! and \verb!.fam! files. For the remaining of this vignette, we restore the default behaviour: <<>>= options(gaston.auto.set.stats = TRUE) @ \pagebreak \subsection{Subsetting bed.matrices} It is possible to subset bed.matrices just as normal matrices, writing e.g. \verb!x[1:100,]! to extract the first 100 individuals, or \verb!x[1:100,10:19]! to extract the SNPs 10 to 19 for these 100 individuals: <<>>= x[1:100,] x[1:100,10:19] @ The use of logical vectors for subsetting is allowed too. The following code extracts the SNPs with minor allele frequency $> 0.1$: <<>>= x[,x@snps$maf > 0.1] @ For this kind of selection, Gaston provides the functions \verb!select.inds! and \verb!select.snps!, which have a nicer syntax. Hereafter we use the same condition on the minor allele frequency, and we introduce also a condition on the Hardy-Weinberg Equilibrium $p$-value: <<>>= x <- set.hwe(x) select.snps(x, maf > 0.1 & hwe > 1e-3) @ The conditions in \verb!select.snps! can use any of the variables defined in the data frame \verb!x@snps!, as well as variables defined in the user session. The function \verb!select.inds! is similar, using variables of the data frame \verb!x@ped!. One can for example select individuals with callrate greater than 95\% with \verb!select.inds(x, callrate > 0.5)!. These functions should make basic Quality Control easy. \noindent{\bfseries Note:}\ \ If \verb!options(gaston.auto.set.stats = FALSE)! was set, subsetting will erase some or all statistics added by \verb!set.stats!, except the \verb!p!, \verb!mu! and \verb!sigma! slots. \subsection{Genomic sex} When dealing with genome wide data, the function \verb!set.genomic.sex! can be use to add a \verb!x@ped$genomic.sex! column, determined by clustering on individuals X heterozygosity rate and Y callrate. \verb!x <- select.inds(x, sex == genomic.sex)! will then allow to keep only individuals whose sex and genomic sex are concordant. An other possibility is to use \verb!x@ped$sex <- x@ped$genomic.sex; x <- set.stats(x)! to erase the original sex indication and use genomic sex in subsequent analyses. \subsection{Merging bed.matrices} Gaston defines methods \verb!rbind! and \verb!cbind! which can be used to merge several bed.matrices. \verb!cbind! checks that individuals famids and ids are identical ; all remaining columns are taken from the first argument, without further control. <<>>= data(AGT) x1 <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim) x1 data(LCT) x2 <- as.bed.matrix(LCT.gen, LCT.fam, LCT.bim) x2 x <- cbind(x1,x2) x @ \verb!rbind! similarly checks whether the SNP ids are identical, and also if the reference alleles are identical. If needed, it will perform reference allele inversions (changing a SNP A/G to G/A) or allele strand flips (changing A/G to T/C) or both. <<>>= x3 <- x[1:10, ] x4 <- x[-(1:10), ] rbind(x3,x4) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Standardized matrices} Gaston allows to ``standardize'' a genotype matrix, replacing each genotype $X_{ij}$ ($i$ is the individual index, $j$ is the SNP index) by \begin{equation} X_{ij} - \mu_j \over \sigma_j \end{equation} where $\mu_j = 2 p_j$ is the mean of the 0,1,2-coded genotype ($p_j$ being the alternative allele frequency), and $\sigma_j$ is either its standard error, or the expected standard error under Hardy-Weinberg Equilibrium, that is $\sqrt{2p_j(1-p_j)}$. Consider the TTN data set. The unscaled matrix contains 0,1,2 values: <<>>= x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim) X <- as.matrix(x) X[1:5,1:4] @ To scale it using the standard error, use \verb!standardize(x) <- "mu_sigma"! (or \verb!standardize(x) <- "mu"!, as the function uses \verb!match.arg!): <<>>= standardize(x) <- "mu" as.matrix(x)[1:5, 1:4] @ The result is similar to what would be obtained from the base function \verb!scale!: <<>>= scale(X)[1:5,1:4] @ To use $\sqrt{2p_j(1-p_j)}$ instead of the standard error, use \verb!standardize(x) <- "p"!; a similar result can again be obtained from \verb!scale!, with the right options: <<>>= standardize(x) <- "p" as.matrix(x)[1:5, 1:4] scale(X, scale = sqrt(2*x@p*(1-x@p)))[1:5,1:4] @ To go back to the 0,1,2-coded genotypes, use \verb!standardize(x) <- "none"!. \noindent{\bfseries Note:}\ \ In standardized matrices, the \verb!NA! values are replaced by zeroes, which amount to impute the missing genotypes by the mean genotype. \subsection{Matrix multiplication} Standardized matrices can be multiplied with numeric vectors or matrices with the operator \verb!%*%!. This feature can be used e.g. to simulate quantitative phenotypes with a genetic component. The following code simulates a phenotype with an effect of the SNP \#350: <<>>= y <- x %*% c(rep(0,350),0.25,rep(0,ncol(x)-351)) + rnorm(nrow(x), sd = 1) @ \pagebreak \subsection{Genetic Relationship Matrix and Principal Component Analysis} If $X_s$ is a standardized $n\times q$ genotype matrix, a Genetic Relationship Matrix (GRM) of the individuals can be computed as \begin{equation*} GRM = {1\over q-1} X_sX_s' \end{equation*} where $q$ is the number of SNPs. This computation is done by the function \verb!GRM!. \textbf{NOTE:} Since version 1.4.7, \verb!GRM! has an argument \verb!which.snps! which allows to give a logical vector specifying which SNPs to use in the computation. The default is to use only autosomal SNPs (the above formula makes little to no sense for X linked SNPs, unless there are only women in the sample). The Principal Component Analysis (PCA) of large genomic data sets is used to retrieve population stratification. It can be obtained from the eigen decomposition of the GRM. To illustrate this, we included in the \verb!extdata! folder a dataset extracted from the 1000 genomes project. This data set comprehend the 503 individuals of european ascent, with 10025 SNPs on the chromosome 2. These SNPs have been selected with the function \verb!LD.thin! so that they have very low LD with each other ($r^2 < 0.05$). We added in the data frame \verb!x@ped! a variable \verb!population! which is a factor with levels \verb!CEU!, \verb!FIN!, \verb!GBR!, \verb!IBS! and \verb!TSI!. We can load this data set as follows: <<>>= x <- read.bed.matrix( system.file("extdata", "chr2.bed", package="gaston") ) x head(x@ped) table(x@ped$population) @ We compute the Genetic Relationship Matrix, and its eigen decomposition (we don't need to standardize \verb!x! explicitely: is \verb!x! isn't standardized, \verb!GRM! will use \verb!standardize(x) = "p"!): <<>>= K <- GRM(x) eiK <- eigen(K) # deal with a small negative eigen value eiK$values[ eiK$values < 0 ] <- 0 @ The eigenvectors are normalized. The Principal Components (PC) can be computed by multiplying them by the square root of the associated eigenvalues. Here we plot the projection of the 503 individuals on the first two PCs, with colors corresponding to their population. %\begin{center} <>= PC <- sweep(eiK$vectors, 2, sqrt(eiK$values), "*") plot(PC[,1], PC[,2], col=x@ped$population) legend("bottomleft", pch = 1, legend = levels(x@ped$population), col = 1:5) @ %\end{center} As $K$ can be written \begin{equation*} K = \left( {1 \over \sqrt{q-1}} X_s \right) \left( {1 \over \sqrt{q-1}} X_s \right)', \end{equation*} the PCs are the left singular vectors of ${1 \over \sqrt{q-1}} X_s$. The vectors of loadings are the right singlar vectors of this matrix. The vector of loadings corresponding to a PC $u$ is the vector $v$ with unit norm, such that $u = {1 \over \sqrt{q-1}} X_s v$. They can be retrieved with the function \verb!bed.loadings!: <<>>= # one can use PC[,1:2] instead of eiK$vectors[,1:2] as well L <- bed.loadings(x, eiK$vectors[,1:2]) dim(L) head(L) @ We verify that the loadings have unit norm: <<>>= colSums(L**2) @ And that the PCs are retrieved by right multiplying $X_s$ by the loadings (here we need to explicitely standardize \verb!x!): <<>>= standardize(x) <- 'p' head( (x %*% L) / sqrt(ncol(x)-1) ) head( PC[,1:2] ) @ \vfill\eject \subsection{Linkage Disequilibrium} Doing the crossproduct in the reverse order produces a moment estimate of the Linkage Disequilibrium (LD): \begin{equation*} LD = {1\over n-1} X_s'X_s \end{equation*} where $n$ is the number of individuals. This computation is done by the function \verb!LD! (usually, only parts of the whole LD matrix is computed). The fonction \verb!LD! can compute a square symmetric matrix giving the LD for a given region, measured by $r^2$ (the default), $r$ or $D$. It can also compute the LD between two different regions. \begin{center} \setkeys{Gin}{width=\textwidth} <<>>= data(AGT) x <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim) ld.x <- LD(x, c(1,ncol(x))) LD.plot(ld.x, snp.positions = x@snps$pos, write.ld = NULL, max.dist = 20e3, write.snp.id = FALSE, draw.chr = FALSE, pdf = "LD_AGT.pdf") @ \includegraphics[width=\textwidth]{LD_AGT.pdf} \end{center} This method is also used by \verb!LD.thin! to extract a set of SNPs in low linkage disequilibrium (it is often recommended to perform this operation before computing the GRM). We illustrate this function on the AGT data set: <<>>= y <- LD.thin(x, threshold = 0.4, max.dist = 500e3) y @ The argument \verb!max.dist = 500e3! is to specify that the LD won't be computed for SNPs more than 500 kb appart. We verify that there is no SNP pair with LD $r^2 > 0.4$ (note that the LD matrix has ones on the diagonal): <<>>= ld.y <- LD( y, lim = c(1, ncol(y)) ) sum( ld.y > 0.4 ) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \vfill\eject \section{Linear Mixed Models} \def\R{\mathbb{R}} \def\N{\mathcal{N}} Linear Mixed Models are usually written under the form \begin{equation*} Y = X\beta + Z_1 u_1 + \cdots + Z_k u_k + \varepsilon \end{equation*} where $Y \in\R^{n}$ is the response vector, and $X\in \R^{n\times p}, Z_1 \in \R^{n\times q_1}, \dots, Z_k \in\R^{n\times q_k}$ are design matrices. The vector $\beta\in\R^p$ is the vector of fixed effects; the random effects are drawn in normal distributions, $u_1\sim \N(0, \tau_1 I_{q_1}), \dots, u_k\sim \N(0, \tau_k I_{q_k})$, as the residual error $\epsilon\sim\N(0,\sigma^2 I_n)$. Here we will use the equivalent form \begin{equation*} Y = X\beta + \omega_1 + \ldots + \omega_k + \varepsilon \end{equation*} where $K_i = Z_i Z_i'$ $(i = 1, \dots, k)$, the random terms are $\omega_i \sim N(0,\tau_i K_i)$ for $i \in 1, \dots,k$ and $\varepsilon \sim N(0,\sigma^2 I_n)$. Note that using the R function \verb!model.matrix! can help you to rewrite a model under this form. Gaston provides two functions for estimating the model parameters when the model is written in the second form. We are going to illustrate these functions on simulated data. \subsection{Data simulation} We will use the above notations. First generate some (random) design matrices: <<>>= set.seed(1) n <- 100 q1 <- 20 Z1 <- matrix( rnorm(n*q1), nrow = n ) X <- cbind(1, 5*runif(n)) @ Then the vector of random effects $u_1$ and a vector $y$ under the model $Y = X (1,2)' + Z u_1 + \varepsilon$ with $u_1 \sim \N(0, 2 I_{q_1})$ and $\varepsilon \sim\N(0, 3 I_n)$: <<>>= u1 <- rnorm(q1, sd = sqrt(2)) y <- X %*% c(1,2) + Z1 %*% u1 + rnorm(n, sd = sqrt(3)) @ To illustrate the case where there are several random effects vectors, we generate an other matrix $Z_2$, the corresponding vector of random effects $u_2$, and a vector $y_2$ under the model $Y = X (1,2)' + Z u_1 + Z_2 u_2 + \varepsilon$. <<>>= q2 <- 10 Z2 <- matrix( rnorm(n*q2), nrow = n ) u2 <- rnorm(q2, sd = 1) y2 <- X %*% c(1,2) + Z1 %*% u1 + Z2 %*% u2 + rnorm(n, sd = sqrt(3)) @ \subsection{Model fitting with the AIREML algorithm} \verb!lmm.aireml! is a function for linear mixed models parameter estimation and BLUP computations. \subsubsection{One random effects vector} Let's start with the simple case (only one random effects vector). As \verb!lmm.aireml! uses the second form of the linear mixed model, and we give it the matrix $K_1 = Z_1 Z_1'$. <<>>= K1 <- tcrossprod(Z1) fit <- lmm.aireml(y, X, K = K1, verbose = FALSE) str(fit) @ The result is a list giving all kind of information. Here we see that $\tau$ is estimated to $\Sexpr{round(fit[["tau"]],2)}$ and $\sigma^2$ to $\Sexpr{round(fit[["sigma2"]],2)}$. The Best Linear Unbiased Predictor (BLUP) for $\beta$ is in the component \verb!BLUP_beta! and here it is $(\Sexpr{round(fit[["BLUP_beta"]][1],2)}, \Sexpr{round(fit[["BLUP_beta"]][2],2)})$. The component \verb!BLUP_omega! holds the BLUP for $\omega_1 = Z u_1$. The BLUP for $u_1$ can be retrieved by the formula $\widehat{u_1} = \tau_1 Z_1' P y$, the value of $Py$ being in the component \verb!Py!. The plots below compare the true values of $\omega_1$ and $u_1$ with their BLUPs. %\setkeys{Gin}{width=0.7\textwidth} %\begin{center} <>= par(mfrow = c(1, 2)) plot(Z1 %*% u1, fit$BLUP_omega); abline(0, 1, lty = 2, col = 3) BLUP_u1 <- fit$tau * t(Z1) %*% fit$Py plot(u1, BLUP_u1); abline(0, 1, lty = 2, col = 3) @ %\end{center} \subsubsection{Several random effects vectors} It is sufficient to give to \verb!lmm.aireml! a list with the two matrices $K_1$ and $K_2$: <<>>= K2 <- tcrossprod(Z2) fit2 <- lmm.aireml(y2, X, K = list(K1, K2), verbose = FALSE) str(fit2) @ The component \verb!tau! now holds the two values $\tau_1$ and $\tau_2$. Note that there is only one \verb!BLUP_omega! vector. It corresponds to the BLUP for $\omega_1 + \omega_2$. The BLUPs for each $\omega_i$ can be retrieved using $\widehat{\omega_i} = \tau_i K_i Py$: %\begin{center} <>= par(mfrow = c(1, 2)) omega1 <- fit2$tau[1] * K1 %*% fit2$Py omega2 <- fit2$tau[2] * K2 %*% fit2$Py plot(Z1 %*% u1, omega1); abline(0, 1, lty = 2, col = 3) plot(Z2 %*% u2, omega2); abline(0, 1, lty = 2, col = 3) @ %\end{center} The BLUPs for $u_1$ and $u_2$ can as above be retrieved using $\widehat{u_i} = \tau_i Z_i' P y$. \subsection{Model fitting with the diagonalization trick} In the case where there is only one vector of random effects ($k = 1)$, the diagonalization trick uses the eigen decomposition of $K_1$ to speed up the computation of the restricted likelihood, which allows to use a generic algorithm to maximize it. Of course, the eigen decomposition of $K_1$ needs to be computed beforehand, which induces an overhead. The trick relies on a transformation of the data which leads to a diagonal variance matrix for $Y$. The computation of a the restricted likelihood involves the computation of the inverse of this matrix, which is then easy to compute. Write the eigen decomposition of $K_1$ as $K_1 = U \Sigma^2 U'$ with $\Sigma^2$ a diagonal matrix of positive eigenvalues, and $U$ an orthogonal matrix. If we let $\tilde Y = U'Y$, $\tilde X = U'X$, $\tilde \omega_1 = U' \omega_1$, and $\tilde\varepsilon = U' \varepsilon$, we have $$ \tilde Y = \tilde X\beta + \tilde \omega_1 + \tilde \varepsilon $$ with $\tilde \omega_1 \sim N\left(0,\tau \Sigma^2\right)$ and $ \tilde\varepsilon \sim N(0,\sigma^2 I_n) $. As stated above, $\mathrm{var}\left(\tilde Y\right) = \tau \Sigma^2 + \sigma^2 I_n$ is diagonal. We fit the first model again: <<>>= eiK1 <- eigen(K1) fit.d <- lmm.diago(y, X, eiK1) str(fit.d) @ You can check that the result is similar to the result obtained earlier with \verb!lmm.aireml!. The likelihood computation with the diagonalization trick is fast enough to plot the likelihood: %\setkeys{Gin}{width=0.4\textwidth} %\begin{center} <>= TAU <- seq(0.5,2.5,length=50) S2 <- seq(2.5,4,length=50) lik <- lmm.diago.likelihood(tau = TAU, s2 = S2, Y = y, X = X, eigenK = eiK1) lik.contour(TAU, S2, lik, heat = TRUE, xlab = "tau", ylab = "sigma^2") @ %\end{center} \subsection{Genomic heritability estimation with Gaston} Heritability estimates based on Genetic Relationship Matrices (GRM) are computed under the following model: \begin{equation*} Y = \beta + \omega + \epsilon \end{equation*} where $\omega \sim \N(0, \tau K)$ and $\epsilon\sim \N(0, \sigma^2 I_n)$, $K$ being a GRM computed on whole genome data (e.g. by the function \verb!GRM!). The heritability is $h^2 = {\tau \over \tau + \sigma^2}$. Note that as $K = {1\over q-1} X_sX_s'$ with $X_s$ a standardized genotype matrix, letting $Z = (q-1)^{-{1 \over 2}} X_s$, this model is equivalent to \begin{equation*} Y = \beta + Zu + \epsilon \end{equation*} with $u \sim \N(0,\tau I_q)$. The function \verb!random.pm! generates random positive matrices with eigenvalues roughly similar to those typically observed when using whole genome data. It outputs a list with a member \verb!K!: a positive matrix, and a member \verb!eigen!: its eigen decomposition (as the base function \verb!eigen! would output it). The function \verb!lmm.simu! can be used simulate data under the linear mixed model. It uses the eigen decomposition of $K$. Hereafter we use $\tau = 1$, $\sigma^2 = 2$, hence a $33.3\%$ heritability. <<>>= set.seed(1) n <- 2000 R <- random.pm(n) y <- 2 + lmm.simu(tau = 1, sigma2 = 2, eigenK = R$eigen)$y @ We can use \verb!lmm.diago! to estimates the parameters of the model. <<>>= fit <- lmm.diago(y, eigenK = R$eigen) @ <>= h2 <- fit$tau/(fit$tau + fit$sigma) @ We have $\widehat\tau = \Sexpr{round(fit[["tau"]],3)}$ and $\widehat{\sigma^2} = \Sexpr{round(fit[["sigma2"]],2)}$, hence $h^2$ is estimated to $\Sexpr{round(h2*100,1)}\%$. The function \verb!lmm.diago.likelihood! allows to compute a profile likelihood with parameter $h^2 = {\tau \over \tau + \sigma^2}$. It can be useful to get confidence intervals. Here, we simply plot it: %\begin{center} <>= H2 <- seq(0,1,length=51) lik <- lmm.diago.likelihood(h2 = H2, Y = y, eigenK = R$eigen) plot(H2, exp(lik$likelihood-max(lik$likelihood)), type="l", yaxt="n", ylab="likelihood") @ %\end{center} It is often advised to include the first few (10 or 20) Principal Components (PC) in the model as fixed effects, to account for a possible population stratification. The function \verb!lmm.diago! has an argument \verb!p! for the number of PCs to incoporate in the model with fixed effects. We simulate a trait with a large effect of the two first PCs: <<>>= PC <- sweep(R$eigen$vectors, 2, sqrt(R$eigen$values), "*") y1 <- 2 + PC[,1:2] %*% c(5,5) + lmm.simu(tau = 1, sigma2 = 2, eigenK = R$eigen)$y @ Here are the heritability estimates with $p = 0$ (the default) and $p = 10$. <<>>= fit0 <- lmm.diago(y1, eigenK = R$eigen) fit0$tau/(fit0$tau+fit0$sigma2) fit10 <- lmm.diago(y1, eigenK = R$eigen, p = 10) fit10$tau/(fit10$tau+fit10$sigma2) @ As expected, the estimate is inflated when no PCs are incorporated in the model. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \vfill\eject \section{Association tests} The function \verb!association.test! performs genome-wide association tests with either (generalized) linear models, or with generalized) linear (mixed) models, for quantitative or binary (0/1) traits. \subsection{Quantitative trait} When the function \verb!association.test! is called with parameter \verb!method = "lm"! it performs a test of association between SNPs and a quantitative trait $Y$ under the model \begin{equation*} Y = (X|PC)\alpha + G\beta + \varepsilon \end{equation*} where $X$ is the matrix of covariables with fixed effects (including a column of ones for the intercept), $G$ is the vector of genotypes at the SNP under consideratione. A few PCs can be included in the model with fixed effects to correct for population stratification (the parameter \verb!p! gives the number of PCs to include). The Wald test is performed to test for $\beta = 0$. When the function is called with parameter \verb!method = "lmm"! the following mixed model is used instead: \begin{equation*} Y = (X|PC)\alpha + G\beta + \omega + \varepsilon \end{equation*} where $X$ and $G$ are as above, and $\omega\sim\N(0,\tau K)$ where $K$ is a Genetic Relationship Matrix (computed on the whole genome). Three tests are proposed for $\beta = 0$: a score test, a Wald test on $\widehat\beta$, or a Likelihood Ratio Test. We illustrate this function on a simple example, built on the AGT data set: <<>>= data(AGT) x <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim) standardize(x) <- 'p' @ As the whole genome is not available, we generate a random positive matrix to play the role of the GRM: <<>>= set.seed(27) R <- random.pm(nrow(x)) @ And we simulate a phenotype with an effect of the SNP \#351, and a polygenic component: <<>>= y <- 2 + x %*% c(rep(0,350),0.35,rep(0,ncol(x)-351)) + lmm.simu(tau = 0.3, sigma2 = 1, eigenK=R$eigen)$y @ The following code performs the association test with a score test and a wald test, and compares the resulting $p$-values: %\begin{center} <>= t_wald <- association.test(x, y, K = R$K, method = "lm", test = "wald") t_wald_mixed <- association.test(x, y, eigenK = R$eigen, method = "lmm", test = "wald") plot( t_wald$p, t_wald_mixed$p, log = "xy", xlab = "lm (wald)", ylab = "lmm (wald)") abline(0,1,lty=3) @ %\end{center} We can display the results under the form of a (mini) Manhattan plot: %\begin{center} %\setkeys{Gin}{width=0.7\textwidth} <>= manhattan(t_wald_mixed, col = ifelse(1:ncol(x) == 351, "red", "black")) @ %\end{center} We colored the point corresponding to the SNP \#351 in red. Note that when used on genome-wide data, the function \verb!manhattan! will draw a classical Manhattan plot, alternating colors between chromosomes. There are other SNPs with low association $p$-values: these are the SNPs in LD with SNP \#351. We can confirm this by plotting the $p$-values (again on log scale) as a function of the LD (measured by $r^2$): %\begin{center} <>= lds <- LD(x, 351, c(1,ncol(x))) plot(lds, -log10(t_wald_mixed$p), xlab="r^2", ylab="-log(p)") @ %\end{center} \subsection{Binary phenotype} We use the quantitative phenotype previously generated to construct a binary phenotype, and we perform the association test with a logistic model (Wald test) or logistic mixed model (score test). %\begin{center} %\setkeys{Gin}{width=0.7\textwidth} <>= y1 <- as.numeric(y > mean(y)) t_binary <- association.test(x, y1, K = R$K, method = "lm", response = "binary", test="wald") t_binary_mixed <- association.test(x, y1, K = R$K, method = "lmm", response = "binary") plot( t_binary$p, t_binary_mixed$p, log = "xy", xlab = "lm (wald)", ylab = "lmm (score)" ) abline(0,1,lty=3) @ %\end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \vfill\eject \section{What you can hope for in later releases} A few things that may be added to the package in the near future: \begin{itemize} \item Reading \verb!.ped! files \item Maximum Likelihood Linkage Disequilibrium estimation \item LMM models fitting written in equation form \item More functions and models for association testing \item Anything you asked the maintainer for \end{itemize} \end{document}