%\VignetteIndexEntry{SKAT} %\documentclass{article} \documentclass[11pt]{article} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.15in} \setlength{\textwidth}{6.5in} \setlength{\textheight}{8.6in} \usepackage{amsmath} \usepackage{amscd} \usepackage[tableposition=top]{caption} \usepackage{ifthen} \usepackage[utf8]{inputenc} \begin{document} \SweaveOpts{concordance=TRUE} \title{SKAT Package} \author{Seunggeun (Shawn) Lee} \maketitle \section{Overview} SKAT package has functions to 1) test for associations between SNP sets and continuous/binary phenotypes with adjusting for covariates and kinships and 2) to compute power/sample size for future studies. \section{Association test} An example dataset (SKAT.example) has a genotype matrix (Z) of 2000 individuals and 67 SNPs, vectors of continuous (y.c) and binary (y.b) phenotypes, and a covariates matrix (X). <>= library(SKAT) data(SKAT.example) names(SKAT.example) attach(SKAT.example) @ To test for associations, SKAT\_Null\_Model function should be used in prior to run SKAT to estimate parameters under the null model of no associations. <>= # continuous trait obj<-SKAT_Null_Model(y.c ~ X, out_type="C") out.c<-SKAT(Z, obj) out.c$p.value # dichotomous trait obj<-SKAT_Null_Model(y.b ~ X, out_type="D") out.b<-SKAT(Z, obj) out.b$p.value @ The returned object from SKAT has many information, such as number of markers in Z and number of markers to be used for the test. The version 2.1.0 has test.snp.mac which has MAC of each markers used in the test. <>= out.c$param out.c$test.snp.mac @ When the trait is binary and the sample size is small, SKAT can produce conservative results. We developed a moment matching adjustment (MA) that adjusts the asymptotic null distribution by estimating empirical variance and kurtosis. By default, SKAT will conduct the MA adjustment when the sample size $ < 2000$. In the following code, we use only 200 samples to run SKAT. <>= IDX<-c(1:100,1001:1100) # With-adjustment obj.s<-SKAT_Null_Model(y.b[IDX] ~ X[IDX,],out_type="D") SKAT(Z[IDX,], obj.s, kernel = "linear.weighted")$p.value @ If you don't want to use the adjustment, please set Adjustment=FALSE in the SKAT\_Null\_Model function. <>= # Without-adjustment obj.s<-SKAT_Null_Model(y.b[IDX] ~ X[IDX,],out_type="D", Adjustment=FALSE) SKAT(Z[IDX,], obj.s, kernel = "linear.weighted")$p.value @ Resampling based approaches to adjust for binary traits have been developed and implemented in SKATBinary function. When you use the SKATBinary function, Adjustment=TRUE in SKAT\_Null\_Model is not necessary. Implemented methods are 1) Efficient resampling (ER); 2) ER with adaptive resampling (ER.A); 3) Quantile adjusted moment matching (QA); 4) Moment matching adjustment (MA); 5) No adjustment (UA); and 6) Hybrid. "Hybrid" (default method) selects a method based on the total minor allele count (MAC), the number of individuals with minor alleles (m), and the degree of case-control imbalance. Detailed description of these methods can be found in the following reference: \\ Lee, S., Fuchsberger, C., Kim, S., Scott, L. (2016) An efficient resampling method for calibrating single and gene-based rare variant association analysis in case–control studies. \textit{Biostatistics} (2016) 17 (1): 1-15. <>= # default hybrid approach out<-SKATBinary(Z[IDX,], obj.s, kernel = "linear.weighted") out$p.value @ We have recently developed more scalable and accurate method for binary traits, which is implemented in SKATBinary\_Robust function. Detailed description of these methods can be found in the following reference: \\ Zhao, Z., Bi, W., Zhou, W., VanderHaar, P., Fritsche, L.G., Lee, S. (2020) UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-based Rare-Variant Test. \textit{AJHG}, 106: 3-12, doi:https://doi.org/10.1016/j.ajhg.2019.11.012 <>= # Robust approach out<-SKATBinary_Robust(Z[IDX,], obj.s, kernel = "linear.weighted") out$p.value @ \subsection{Assign weights for each SNP} It is assumed that rarer variants are more likely to be causal variants with large effect sizes. To incorporate this assumption, the linear weighted kernel uses a weighting scheme and is formulated as $ Z W W Z'$, where $Z$ is a genotype matrix, and $W = diag \{ w_1, \ldots, w_m \}$ is a weight matrix. In the previous examples, we used the default beta(1,25) weight, $w_i = dbeta(p_i, 1, 25) $, where $dbeta$ is a beta density function, and $p_i$ is a minor allele frequency (MAF) of SNP $i$. Different parameters for the beta weight can be used by changing weights.beta. For example, weight.beta=c(0.5,0.5) will use the Madsen and Browning weight. <>= SKAT(Z, obj, kernel = "linear.weighted", weights.beta=c(0.5,0.5))$p.value @ You can use your own weight vector by using the weights parameter. For the logistic weight, we provide a function to generate the weight. <>= # Shape of the logistic weight MAF<-1:1000/1000 W<-Get_Logistic_Weights_MAF(MAF, par1=0.07, par2=150) par(mfrow=c(1,2)) plot(MAF,W,xlab="MAF",ylab="Weights",type="l") plot(MAF[1:100],W[1:100],xlab="MAF",ylab="Weights",type="l") par(mfrow=c(1,2)) # Use logistic weight weights<-Get_Logistic_Weights(Z, par1=0.07, par2=150) SKAT(Z, obj, kernel = "linear.weighted", weights=weights)$p.value @ \subsection{SKAT-O: Combined Test of burden test and SKAT} A test statistic of the combined test is $$Q_{\rho} = (1-\rho) Q_S + \rho Q_B,$$ where $Q_S$ is a test statistic of SKAT, and $Q_B$ is a score test statistic of the burden test. The $\rho$ value can be specified by using the r.corr parameter (default: r.corr=0). <>= #rho=0, SKAT SKAT(Z, obj, r.corr=0)$p.value #rho=0.9 SKAT(Z, obj, r.corr=0.9)$p.value #rho=1, Burden test SKAT(Z, obj, r.corr=1)$p.value @ If method=``optimal.adj'' or ``SKATO'' (both are equivalent), SKAT-O method will be performed, which computes p-values with eight different values of $\rho=(0, 0.1^2, 0.2^2, 0.3^2, 0.4^2, 0.5^2, 0.5, 1)$ and then uses the minimum p-value as a test statistic. If you want to use the original implementation of SKAT-O, use method=``optimal'', which uses eleven equally spaced $\rho$ values from 0 to 1 as a grid of $\rho$s. We recommend to use ``SKATO'' or ``optimal.adj'', since it has a better type I error control. <>= #Optimal Test SKAT(Z, obj, method="SKATO")$p.value @ \subsection{Combined test of common and rare variants} It is possible that both common and rare variants are associated with phenotypes. To test for combined effects of common and rare variants, SKAT\_CommonRare function can be used. The detailed description of the combined test can be found in the following reference: \\ Ionita-Laza, I., Lee, S., Makarov, V., Buxbaum, J. Lin, X. (2013). Sequence kernel association tests for the combined effect of rare and common variants. \emph{AJHG}, 92(6):841-53. \\ <>= # Combined sum test (SKAT-C and Burden-C) SKAT_CommonRare(Z, obj)$p.value SKAT_CommonRare(Z, obj, r.corr.rare=1, r.corr.common=1 )$p.value # Adaptive test (SKAT-A and Burden-A) SKAT_CommonRare(Z, obj, method="A")$p.value SKAT_CommonRare(Z, obj, r.corr.rare=1, r.corr.common=1, method="A" )$p.value @ \subsection{Impute missing genotypes.} If there are missing genotypes, SKAT automatically imputes them based on Hardy-Weinberg equilibrium. You can choose from ``bestguess'', ``fixed'' or ``random''. The ``bestguess'' imputes missing genotypes as most likely values (0,1,2), the ``fixed'' imputes missing genotypes by assigning the mean genotype value (2p, p is the MAF) and the "random" imputes missing genotypes by generating binomial(2,p) random variables. The default imputation method for the SKAT function is ``fixed'' and for the SKATBinary function is ``bestguess''. <>= # Assign missing Z1<-Z Z1[1,1:3]<-NA # bestguess imputation SKAT(Z1,obj,impute.method = "bestguess")$p.value # fixed imputation SKAT(Z1,obj,impute.method = "fixed")$p.value # random imputation SKAT(Z1,obj,impute.method = "random")$p.value @ \subsection{Resampling} SKAT package provides functions to carry out resampling method to compute empirical p-values and to control for family wise error rate. Two different resampling methods are implemented. ``bootstrap'' conducts a parametric bootstrap to resample residuals from $H_0$ with adjusting for covariates. When there is no covariate, ``bootstrap'' is equivalent to the permutation. ``perturbation'' perturbs the residuals by multiplying standard normal random variables. The default method is ``bootstrap''. From ver 0.7, we do not provide the ``perturbation'' method. <>= # parametric boostrap. obj<-SKAT_Null_Model(y.b ~ X, out_type="D", n.Resampling=5000, type.Resampling="bootstrap") # SKAT p-value re<- SKAT(Z, obj, kernel = "linear.weighted") re$p.value # SKAT p-value Get_Resampling_Pvalue(re) # get resampling p-value detach(SKAT.example) @ When there are many genes/SNP sets to test, resampling methods can be used to control family-wise error rate. Examples are provided in the next section. \subsection{Adjust for kinship} If related individuals exist in your data, you need to adjust for kinship. SKAT\_NULL\_emmaX function uses linear mixed model (EMMAX) to estimate the variance component, which will be subsequently used to adjust for kinship. For the kinship adjustment, SKAT\_NULL\_emmaX function should be used instead of SKAT\_Null\_Model. <>= data(SKAT.fam.example) attach(SKAT.fam.example) # K: kinship matrix obj<-SKAT_NULL_emmaX(y ~ X, K=K) SKAT(Z, obj)$p.value # SKAT-O SKAT(Z, obj, method="SKATO")$p.value detach(SKAT.fam.example) @ \subsection{X chromosome test} Since male has only one copy of X-chromosome, special care is needed to test for associations in X-chromosome. We have developed a method to test for X-chromosome in region based rare variant test with and without X-inactivation. To use it, you need to use SKAT\_Null\_Model\_ChrX to fit the null model and SKAT\_ChrX for association tests. Detailed description of association tests in X-chromosome can be found in the following reference: \\ Ma, C., Boehnke, M., Lee, S., the GoT2D Investigators (2015) Evaluating the Calibration and Power of Three Gene-based Association Tests of Rare Variants for the X Chromosome, \textit{Genetic Epidemiology}, 39 (7): 499-508. <>= data(SKAT.example.ChrX) attach(SKAT.example.ChrX) Z = SKAT.example.ChrX$Z ############################################################# # Compute the P-value of SKAT # binary trait obj.x<-SKAT_Null_Model_ChrX(y ~ x1 +x2 + Gender, SexVar="Gender", out_type="D", data=SKAT.example.ChrX) # run SKAT-O SKAT_ChrX(Z, obj.x, method="SKATO")$p.value @ For Y chromosome, you can use the same null model function for X with Model.Y=TRUE. The p-value can be calculated with SKAT\_ChrY function. The following example use the same genotype matrix previously used to show how these functions can be used. <>= ############################################################# # Compute the P-value of SKAT # binary trait obj.x<-SKAT_Null_Model_ChrX(y ~ x1 +x2 + Gender, SexVar="Gender", out_type="D", data=SKAT.example.ChrX, Model.Y=TRUE) # run SKAT-O SKAT_ChrY(Z, obj.x, method="SKATO")$p.value detach(SKAT.example.ChrX) @ \section{Plink Binary format files} For the genome-wide data analysis, plink binary format files can be used in SKAT. To use plink files, plink bed, bim and fam files, and your own setid file that contains information of SNP sets are needed. Example files can be found on the SKAT/MetaSKAT google group page. <>= # To run this code, first download and unzip example files ############################################## # Generate SSD file # Create the MW File File.Bed<-"./Example1.bed" File.Bim<-"./Example1.bim" File.Fam<-"./Example1.fam" File.SetID<-"./Example1.SetID" File.SSD<-"./Example1.SSD" File.Info<-"./Example1.SSD.info" # To use binary ped files, you have to generate SSD file first. # If you already have a SSD file, you do not need to call this function. Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info) @ Now you can open SSD and Info file and run SKAT. <>= FAM<-Read_Plink_FAM(File.Fam, Is.binary=FALSE) y<-FAM$Phenotype # To use a SSD file, please open it first. After finishing using it, you must close it. SSD.INFO<-Open_SSD(File.SSD, File.Info) # Number of samples SSD.INFO$nSample # Number of Sets SSD.INFO$nSets obj<-SKAT_Null_Model(y ~ 1, out_type="C") @ <>= out<-SKAT.SSD.All(SSD.INFO, obj) @ <>= out @ If you have a plink covariate file, Read\_Plink\_FAM\_Cov function can be used to read both FAM and covariate files. <>= File.Cov<-"./Example1.Cov" FAM_Cov<-Read_Plink_FAM_Cov(File.Fam, File.Cov, Is.binary=FALSE) # First 5 rows FAM_Cov[1:5,] # Run with covariates X1 = FAM_Cov$X1 X2 = FAM_Cov$X2 y<-FAM_Cov$Phenotype obj<-SKAT_Null_Model(y ~ X1 + X2, out_type="C") @ <>= out<-SKAT.SSD.All(SSD.INFO, obj) @ <>= out @ To use custom weight, you need to make a weight file and read it using ``Read\_SNP\_WeightFile'' function. The weight file should have two columns, SNP ID and weight values. The output object of ``Read\_SNP\_WeightFile'' can be used as a parameter in SKAT.SSD functions <>= # Custom weight # File: Example1_Weight.txt obj.SNPWeight<-Read_SNP_WeightFile("./Example1_Weight.txt") @ <>= out<-SKAT.SSD.All(SSD.INFO, obj, obj.SNPWeight=obj.SNPWeight) @ <>= out @ The output object of SKAT.SSD.All has an output dataframe object ``results''. You can save it using write.table function. <>= output.df = out$results write.table(output.df, file="./save.txt", col.names=TRUE, row.names=FALSE) @ If more than one gene/SNP sets are to be tested, multiple test should be adjusted to control for family-wise error rate. It can be done by the bonferroni correction. If gene/SNP sets are correlated, however, this approach can be conservative. Alternatively, you can directly control family wise error rate (FWER) using the resampling method. <>= obj<-SKAT_Null_Model(y ~ 1, out_type="C", n.Resampling=1000, type.Resampling="bootstrap") out<-SKAT.SSD.All(SSD.INFO, obj) @ <>== # No gene is significant with controling FWER = 0.05 Resampling_FWER(out,FWER=0.05) # 1 gene is significnat with controling FWER = 0.5 Resampling_FWER(out,FWER=0.5) @ ``SKAT.SSD.OneSet'' or ``SKAT.SSD.OneSet\_SetIndex'' functions can be used to test for a single gene/SNP set. Alternatively, you can obtain a genotype matrix using ``Get\_Genotypes\_SSD'' function and then run SKAT. <>== obj<-SKAT_Null_Model(y ~ 1, out_type="C") # test the second gene id<-2 SetID<-SSD.INFO$SetInfo$SetID[id] SKAT.SSD.OneSet(SSD.INFO,SetID, obj)$p.value SKAT.SSD.OneSet_SetIndex(SSD.INFO,id, obj)$p.value # test the second gene with the logistic weight. Z<-Get_Genotypes_SSD(SSD.INFO, id) weights = Get_Logistic_Weights(Z, par1=0.07, par2=150) SKAT(Z, obj, weights=weights)$p.value @ SKAT\_CommonRare function also can be used with SSD files. <>== # test all genes in SSD file obj<-SKAT_Null_Model(y ~ X1 + X2, out_type="C") out<-SKAT_CommonRare.SSD.All(SSD.INFO, obj) <>== out @ After finishing to use SSD files, please close them. <>== Close_SSD() @ \subsection{Plink Binary format files: SKATBinary} SKATBinary functions can also be used with plink formatted files. This section shows an example code. Example plink files can be found on the SKAT/MetaSKAT google group page. <>= # File names File.Bed<-"./SKATBinary.example.bed" File.Bim<-"./SKATBinary.example.bim" File.Fam<-"./SKATBinary.example.fam" File.Cov<-"./SKATBinary.example.cov" File.SetID<-"./SKATBinary.example.SetID" File.SSD<-"./SKATBinary.example.SSD" File.Info<-"./SKATBinary.example.SSD.info" # Generate SSD file, and read fam and cov files # If you already have a SSD file, you do not need to call this function. Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info) FAM<-Read_Plink_FAM_Cov(File.Fam, File.Cov, Is.binary=TRUE, cov_header=FALSE) # open SSD files SSD.INFO<-Open_SSD(File.SSD, File.Info) # No adjustment is needed obj<-SKAT_Null_Model(Phenotype ~ COV1 + COV2, out_type="D", data=FAM, Adjustment=FALSE) @ <>== # SKAT out.skat<-SKATBinary.SSD.All(SSD.INFO, obj, method="SKAT") # SKAT-O out.skato<-SKATBinary.SSD.All(SSD.INFO, obj, method="SKATO") @ <>== # First 5 variant sets, SKAT out.skat$results[1:5,] @ The effective number of tests and QQ plots can be obtained using the minimum achievable p-values (MAP). <>= # Effective number of test is smaller than 30 (number of variant sets) # Use SKAT results Get_EffectiveNumberTest(out.skat$results$MAP, alpha=0.05) # QQ plot QQPlot_Adj(out.skat$results$P.value, out.skat$results$MAP) @ \section{Power/Sample Size calculation.} \subsection{Dataset} SKAT package provides a haplotype dataset (SKAT.haplotypes) which contains a haplotype matrix of 10,000 haplotypes over 200kb region (Haplotype), and a dataframe with information on each SNP. These haplotypes were simulated using a calibrated coalescent model (cosi) with mimicking linkage disequilibrium structure of European ancestry. If no haplotype data are available, this dataset can be used to compute power/sample size. <>= data(SKAT.haplotypes) names(SKAT.haplotypes) attach(SKAT.haplotypes) @ \subsection{Power/Sample Size calculation} The following example uses the haplotypes in SKAT.haplotypes with the following parameters. \begin{enumerate} \item Subregion length = 3k bp \item Causal percent = $20 \%$ \item Negative percent = $20 \%$ \item For continuous traits, $\beta = c |log_{10}(MAF)|$ (BetaType = ``Log'') with $\beta = 2$ at MAF = $10^{-4}$ \item For binary traits, $log(OR) = c |log_{10}(MAF)|$ (OR.Type = ``Log'') with OR $= 2$ at MAF = $10^{-4}$, and $50 \%$ of samples are cases and $50 \% $ of samples are controls \end{enumerate} <>== set.seed(500) out.c<-Power_Continuous(Haplotype,SNPInfo$CHROM_POS, SubRegion.Length=5000, Causal.Percent= 20, N.Sim=10, MaxBeta=2,Negative.Percent=20) out.b<-Power_Logistic(Haplotype,SNPInfo$CHROM_POS, SubRegion.Length=5000, Causal.Percent= 20, N.Sim=10 ,MaxOR=7, Negative.Percent=20) out.c out.b Get_RequiredSampleSize(out.c, Power=0.8) Get_RequiredSampleSize(out.b, Power=0.8) @ In this example, N.Sim=10 was used to get the result quickly. When you run the power calculation, please increase it to more than 100. When BetaType = ``Log'' or OR.Type = ``Log'', the effect size of continuous trait and the log odds ratio of binary traits are $c |log_{10}(MAF)|$, where $c$ is determined by Max\_Beta or Max\_OR. For example, $ c= 2/4 = 0.5$ when the Max\_Beta = 2. In this case, a causal variant with MAF=0.01 has $\beta = 1$. For binary traits, $c= log(7)/4 = 0.486$ with MAX\_OR=7. And thus, a causal variant with MAF=0.01 has log OR = 0.972. Power\_Continuous\_R or Power\_Logistic\_R functions can be used to compute power with with non-zero r.corr ($\rho$). Since these functions use slightly different method to compute power, power estimates from Power\_Continuous\_R and Power\_Logistic\_R can be slightly different from estimates from Power\_Continuous and Power\_Logistic even when r.corr=0. If you want to computer the power of SKAT-O by estimating the optimal r.corr, please use r.corr=2. The estimated optimal r.corr is $$ r.corr = p_1^2 ( 2p_2-1)^2, $$ where $p_1$ is the proportion of nonzero $\beta$s, and $p_2$ is the proportion of negative (or positive) $\beta$s among the non-zero $\beta$s. <>== set.seed(500) out.c<-Power_Continuous_R(Haplotype,SNPInfo$CHROM_POS, SubRegion.Length=5000, Causal.Percent= 20, N.Sim=10, MaxBeta=2,Negative.Percent=20, r.corr=2) out.c Get_RequiredSampleSize(out.c, Power=0.8) @ \end{document}