%\VignetteIndexEntry{GWAS Data Cleaning} %\VignetteDepends{GWASTools, GWASdata, SNPRelate} \documentclass[11pt]{article} \usepackage{fullpage} \usepackage{Sweave} \usepackage{amsmath} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage[pdftex,plainpages=false, letterpaper, bookmarks, bookmarksnumbered, colorlinks, linkcolor=blue, citecolor=blue, filecolor=blue, urlcolor=blue]{hyperref} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \begin{document} \title{GWAS Data Cleaning} \author{GENEVA Coordinating Center\\ Department of Biostatistics\\ University of Washington} \maketitle \tableofcontents \SweaveOpts{keep.source=TRUE, eps=FALSE} \section{Overview} This vignette takes a user through the data cleaning steps developed and used for genome wide association data as part of the Gene Environment Association studies (GENEVA) project. This project (\href{http://www.genevastudy.org}{http://www.genevastudy.org}) is a collection of whole-genome studies supported by the NIH-wide Gene-Environment Initiative. The methods used in this vignette have been published in Laurie et al.\ (2010).\footnote{Laurie, Cathy C., et al. Quality Control and Quality Assurance in Genotypic Data for Genome-Wide Association Studies. \textit{Genetic Epidemiology} \textbf{34}, 591-602 (August 2010).} For replication purposes the data used here are taken from the HapMap project. These data were kindly provided by the Center for Inherited Disease Research (CIDR) at Johns Hopkins University and the Broad Institute of MIT and Harvard University (Broad). The data are in the same format as these centers use in providing data to investigators: the content and format of these data are a little different from those for processed data available at the HapMap project site. The data supplied here should not be used for any purpose other than this tutorial. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preparing Data} \subsection{Data formats used in GWASTools} The \Rpackage{GWASTools} package provides containers for storing annotation data called \Rcode{SnpAnnotationDataFrame} and \Rcode{ScanAnnotationDataFrame} (derived from the \Rcode{AnnotatedDataFrame} class in the \Rpackage{Biobase} package). The name ``scan'' refers to a single genotyping instance. Some subjects in a study are usually genotyped multiple times for quality control purposes, so these subjects will have duplicate scans. Throughout this tutorial, ``scan'' or ``sample'' refers to a unique genotyping instance. The AnnotationDataFrame classes provide a way to store metadata about an annotation variable in the same R object as the variable itself. When a new column is added to an AnnotationDataFrame, we also add a column to the metadata describing what that data means. The SNP and scan AnnotationDataFrame objects are stored in R data objects (.RData files) which can be directly loaded into R. The raw and called genotype data can be stored in the Genomic Data Structure (GDS) format (\href{http://corearray.sourceforge.net}{http://corearray.sourceforge.net}), or the Network Common Data Format (NetCDF) (\href{http://www.unidata.ucar.edu/software/netcdf/}{http://www.unidata.ucar.edu/software/netcdf/}). In the \Rpackage{GWASTools} package, access to the GDS files is provided by the \Rcode{GdsGenotypeReader} and \Rcode{GdsIntensityReader} classes. These classes are built on top of the \Rpackage{gdsfmt} package and provide access to a standard set of variables defined for GWAS data. NetCDF files can be accessed with the equivalent classes \Rcode{NcdfGenotypeReader} and \Rcode{NcdfIntensityReader}, which are built on top of the \Rpackage{ncdf} package. The union classes \Rcode{GenotypeReader} and \Rcode{IntensityReader} allow the \Rpackage{GWASTools} functions to use either storage format interchangeably. Additionally, the GDS (or NetCDF) files and SNP and scan annotation can be linked through the \Rcode{GenotypeData} and \Rcode{IntensityData} classes, which have slots for a \Rcode{GenotypeReader} (or \Rcode{IntensityReader}) object, a \Rcode{SnpAnnotationDataFrame} object, and a \Rcode{ScanAnnotationDataFrame} object. When an object of one of these classes is created, it performs checks to ensure that the annotation matches the data stored in the data file and all required information is present. The majority of the functions in the \Rpackage{GWASTools} package take \Rcode{GenotypeData} or \Rcode{IntensityData} objects as arguments. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Creating the SNP Annotation Data Object} \label{create_snpann} All of the functions in \Rpackage{GWASTools} require a minimum set of variables in the SNP annotation data object. The minimum required variables are \begin{itemize} \item{\Rcode{snpID}, a unique integer identifier for each SNP.} \item{\Rcode{chromosome}, an integer mapping for each chromosome, with values 1-27, mapped in order from 1-22, 23=X, 24=XY (the pseudoautosomal region), 25=Y, 26=M (the mitochondrial probes), and 27=U (probes with unknown positions)}. (It is possible to change the default integer mapping, e.g., to match the codes used by PLINK, but this should be done with caution. See the manual pages for more details.) \item{\Rcode{position}, the base position of each SNP on the chromosome.} \end{itemize} We create the integer chromosome mapping for a few reasons. The chromosome is stored as an integer in the data files, so in order to link the SNP annotation with the data file, we use the integer values in the annotation as well. For convenience when using \Rpackage{GWASTools} functions, the chromosome variable is most times assumed to be an integer value. Thus, for the sex chromosomes, we can simply use the \Rcode{chromosome} values. For presentation of results, it is important to have the mapping of the integer values back to the standard designations for the chromosome names, thus the \Rcode{getChromosome()} functions in the \Rpackage{GWASTools} objects have a \Rcode{char=TRUE} option to return the characters 1-22, X, XY, Y, M, U. The position variable should hold all numeric values of the physical position of a probe. {\em The SNP annotation file is assumed to list the probes in order of chromosome and position within chromosome.} <<>>= library(GWASTools) library(GWASdata) # Load the SNP annotation (simple data frame) data(illumina_snp_annot) # Create a SnpAnnotationDataFrame snpAnnot <- SnpAnnotationDataFrame(illumina_snp_annot) # names of columns varLabels(snpAnnot) # data head(pData(snpAnnot)) # Add metadata to describe the columns meta <- varMetadata(snpAnnot) meta[c("snpID", "chromosome", "position", "rsID", "alleleA", "alleleB", "BeadSetID", "IntensityOnly", "tAA", "tAB", "tBB", "rAA", "rAB", "rBB"), "labelDescription"] <- c("unique integer ID for SNPs", paste("integer code for chromosome: 1:22=autosomes,", "23=X, 24=pseudoautosomal, 25=Y, 26=Mitochondrial, 27=Unknown"), "base pair position on chromosome (build 36)", "RS identifier", "alelleA", "alleleB", "BeadSet ID from Illumina", "1=no genotypes were attempted for this assay", "mean theta for AA cluster", "mean theta for AB cluster", "mean theta for BB cluster", "mean R for AA cluster", "mean R for AB cluster", "mean R for BB cluster") varMetadata(snpAnnot) <- meta @ Variables in the SNP annotation data frame can be accessed either with the data frame operators \Rcode{\$} and \Rcode{[[} or with ``get'' methods. <<>>= snpID <- snpAnnot$snpID snpID <- getSnpID(snpAnnot) chrom <- snpAnnot[["chromosome"]] chrom <- getChromosome(snpAnnot) table(chrom) chrom <- getChromosome(snpAnnot, char=TRUE) table(chrom) position <- getPosition(snpAnnot) rsID <- getVariable(snpAnnot, "rsID") @ The following methods are equivalent and can all be used on \Rcode{SnpAnnotationDataFrame} objects: \vspace{5mm} \begin{tabular}{ll} \hline \Rcode{AnnotatedDataFrame} method & \Rpackage{GWASTools} method \\ \hline pData & getAnnotation \\ varMetadata & getMetadata \\ varLabels & getVariableNames \\ \hline \end{tabular} \vspace{5mm} However, only the \Rcode{AnnotatedDataFrame} methods have corresponding ``set'' methods. New variables can be added with \Rcode{\$} or \Rcode{[[}. \Rcode{[} also behaves as expected for standard data frames. <<>>= tmp <- snpAnnot[,c("snpID", "chromosome", "position")] snp <- getAnnotation(tmp) snp$flag <- sample(c(TRUE, FALSE), nrow(snp), replace=TRUE) pData(tmp) <- snp meta <- getMetadata(tmp) meta["flag", "labelDescription"] <- "flag" varMetadata(tmp) <- meta getVariableNames(tmp) varLabels(tmp)[4] <- "FLAG" rm(tmp) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Creating the Scan Annotation Data Object} \label{create_sampann} The scan annotation file holds attributes for each genotyping scan that are relevant to genotypic data cleaning. These data include processing variables such as tissue type, DNA extraction method, and genotype processing batch. They also include individual characteristics such as sex, race, ethnicity, and case status. Since a single subject may have been genotyped multiple times as a quality control measure, it is important to distinguish between the scanID (unique genotyping instance) and subjectID (person providing a DNA sample). The miniumum required variables for the scan annotation data object are \begin{itemize} \item{\Rcode{scanID}, a unique identifier for each scan}. \item{\Rcode{sex}, coded as ``M''/``F''. (Note that a \Rcode{ScanAnnotationDataFrame} object may be valid without a sex variable, but it is required for many \Rpackage{GWASTools} functions.)} \end{itemize} <<>>= # Load the scan annotation (simple data frame) data(illumina_scan_annot) # Create a ScanAnnotationDataFrame scanAnnot <- ScanAnnotationDataFrame(illumina_scan_annot) # names of columns varLabels(scanAnnot) # data head(pData(scanAnnot)) # Add metadata to describe the columns meta <- varMetadata(scanAnnot) meta[c("scanID", "subjectID", "family", "father", "mother", "CoriellID", "race", "sex", "status", "genoRunID", "plate", "batch", "file"), "labelDescription"] <- c("unique ID for scans", "subject identifier (may have multiple scans)", "family identifier", "father identifier as subjectID", "mother identifier as subjectID", "Coriell subject identifier", "HapMap population group", "sex coded as M=male and F=female", "simulated case/control status" , "genotyping instance identifier", "plate containing samples processed together for genotyping chemistry", "simulated genotyping batch", "raw data file") varMetadata(scanAnnot) <- meta @ As for \Rcode{SnpAnnotationDataFrame}, variables in the scan annotation data frame can be accessed either with the data frame operators \Rcode{\$} and \Rcode{[[} or with ``get'' methods. <<>>= scanID <- scanAnnot$scanID scanID <- getScanID(scanAnnot) sex <- scanAnnot[["sex"]] sex <- getSex(scanAnnot) subjectID <- getVariable(scanAnnot, "subjectID") @ The \Rcode{AnnotatedDataFrame} methods and their \Rpackage{GWASTools} equivalents described in Section~\ref{create_snpann} apply to \Rcode{ScanAnnotationDataFrame} as well. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Creating the Data Files} The data for genotype calls, allelic intensities and other variables such as B Allele Frequency are stored as GDS or NetCDF files. This format is used for the ease with which extremely large multi-dimensional arrays of data can be stored and accessed, as many GWAS datasets are too large to be stored in memory at one time. We will create three different GDS files to be used in subsequent cleaning and analysis steps. All data files contain variables for \Rcode{scanID}, \Rcode{snpID}, \Rcode{chromosome}, and \Rcode{position}. The \Rcode{scanID} ordering must match the \Rcode{scanID} values as listed in the sample annotation object (Section~\ref{create_sampann}). Since \Rcode{snpID} is in chromosome and position order, these variables also provide a check on ordering and are often used to select subsets of SNPs for analysis. Analogous to the sample ordering, these values must match the \Rcode{snpID} values listed in the SNP annotation object (Section~\ref{create_snpann}). To prevent errors in ordering samples or SNPs, the functions in the \Rpackage{GWASTools} package take as arguments R objects which will return an error on creation if the sample and SNP annotation does not match the data file. \subsubsection*{Genotype Files} The genotype files store genotypic data in 0, 1, 2 format indicating the number of ``A'' alleles in the genotype (i.e. AA=2, AB=1, BB=0 and missing=-1). The conversion from AB format and forward strand (or other) allele formats can be stored in the SNP annotation file. The genotypic data are stored as a two-dimensional array, where rows are SNPs and columns are samples. To store the genotype data, the raw data files are opened and checked to ensure the sample identifier from the sample annotation file and the genotype data file match. If no discrepencies exist, the probes listed in the file are checked against the expected list of probes, then ordered and written to the data file. This process iterates over each file (sample). Diagnostics are stored as the process continues so that after the data are written one can ensure the function performed as expected. \subsubsection*{Creating the Genotype file} We create a GDS file from a set of plain text files containing the genotypes, one file per sample. The data are written to the GDS file one sample at a time and, simultaneously, the corresponding sample identifier \Rcode{scanID} is written to the sample ID variable. The \Rcode{file} variable from the scan annotation holds the name of the raw data file for each sample/scan; these are the files we must read in to get genotype data for each sample. The function \Rcode{createDataFile} creates the common SNP variables as described above. In this case, we also want the \Rcode{genotype} variable to be created, so the \Rcode{variables} argument must be set to \Rcode{"genotype"}. \Rcode{col.nums} is an integer vector indicating which columns of the raw text file contain variables for input. A set of diagnostic values are written and stored in \Rcode{diag.geno}, so we must look at those to ensure no errors occurred. <<>>= # Define a path to the raw data files path <- system.file("extdata", "illumina_raw_data", package="GWASdata") geno.file <- "tmp.geno.gds" # first 3 samples only scan.annotation <- illumina_scan_annot[1:3, c("scanID", "genoRunID", "file")] names(scan.annotation)[2] <- "scanName" snp.annotation <- illumina_snp_annot[,c("snpID", "rsID", "chromosome", "position")] # indicate which column of SNP annotation is referenced in data files names(snp.annotation)[2] <- "snpName" col.nums <- as.integer(c(1,2,12,13)) names(col.nums) <- c("snp", "sample", "a1", "a2") diag.geno.file <- "diag.geno.RData" diag.geno <- createDataFile(path = path, filename = geno.file, file.type = "gds", variables = "genotype", snp.annotation = snp.annotation, scan.annotation = scan.annotation, sep.type = ",", skip.num = 11, col.total = 21, col.nums = col.nums, scan.name.in.file = 1, diagnostics.filename = diag.geno.file, verbose = FALSE) # Look at the values included in the "diag.geno" object which holds # all output from the function call names(diag.geno) # `read.file' is a vector indicating whether (1) or not (0) each file # specified in the `files' argument was read successfully table(diag.geno$read.file) # `row.num' is a vector of the number of rows read from each file table(diag.geno$row.num) # `sample.match' is a vector indicating whether (1) or not (0) # the sample name inside the raw text file matches that in the # sample annotation data.frame table(diag.geno$sample.match) # `snp.chk' is a vector indicating whether (1) or not (0) # the raw text file has the expected set of SNP names table(diag.geno$snp.chk) # `chk' is a vector indicating whether (1) or not (0) all previous # checks were successful and the data were written to the data file table(diag.geno$chk) @ Run the function \Rcode{checkGenotypeFile} to check that the GDS file contains the same data as the raw data files. <<>>= check.geno.file <- "check.geno.RData" check.geno <- checkGenotypeFile(path = path, filename = geno.file, file.type = "gds", snp.annotation = snp.annotation, scan.annotation = scan.annotation, sep.type = ",", skip.num = 11, col.total = 21, col.nums = col.nums, scan.name.in.file = 1, check.scan.index = 1:3, n.scans.loaded = 3, diagnostics.filename = check.geno.file, verbose = FALSE) # Look at the values included in the "check.geno" object which holds # all output from the function call names(check.geno) # 'geno.chk' is a vector indicating whether (1) or not (0) the genotypes # match the text file table(check.geno$geno.chk) @ \subsubsection*{Reading the Genotype file} The \Rcode{GdsGenotypeReader} class provides a convenient interface for retrieving data from a genotype GDS file. Some of the same ``get'' methods that applied to SNP and scan annotation data objects can be used for \Rcode{GdsGenotypeReader} objects. <<>>= (gds <- GdsGenotypeReader(geno.file)) nscan(gds) nsnp(gds) head(getScanID(gds)) head(getSnpID(gds)) head(getChromosome(gds)) head(getPosition(gds)) # genotypes for the first 3 samples and the first 5 SNPs getGenotype(gds, snp=c(1,5), scan=c(1,3)) close(gds) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection*{Intensity Files} The intensity files store quality scores and allelic intensity data for each SNP. The normalized X and Y intensities as well as the confidence scores are written to the file for all samples, for all SNPs. (To keep file sizes manageable, a separate file will store the B Allele Frequency and Log R Ratio data.) In addition to the sample and SNP identifiers, chromosome, and position, the intensity and quality data are written to the intensity file in a two dimensional format, with SNPs corresponding to rows and samples corresponding to columns. To write the intensity data, the raw data files are opened and the intensities and quality score are read. Like with the genotype data, if all sample and probe identifiers match between the data files and the annotation files, the data are populated in the file and diagnostics are written. \subsubsection*{Creating the Intensity file} We call \Rcode{createDataFile} again, this time specifying quality, X, and Y in the \Rcode{variables} argument. A set of diagnostic values are written and stored in \Rcode{diag.qxy}. <<>>= qxy.file <- "tmp.qxy.gds" col.nums <- as.integer(c(1,2,5,16,17)) names(col.nums) <- c("snp", "sample", "quality", "X", "Y") diag.qxy.file <- "diag.qxy.RData" diag.qxy <- createDataFile(path = path, filename = qxy.file, file.type = "gds", variables = c("quality","X","Y"), snp.annotation = snp.annotation, scan.annotation = scan.annotation, sep.type = ",", skip.num = 11, col.total = 21, col.nums = col.nums, scan.name.in.file = 1, diagnostics.filename = diag.qxy.file, verbose = FALSE) @ Run the function \Rcode{checkIntensityFile} to check that the GDS file contains the same data as the raw data files. <<>>= check.qxy.file <- "check.qxy.RData" check.qxy <- checkIntensityFile(path = path, filename = qxy.file, file.type = "gds", snp.annotation = snp.annotation, scan.annotation = scan.annotation, sep.type = ",", skip.num = 11, col.total = 21, col.nums = col.nums, scan.name.in.file = 1, check.scan.index = 1:3, n.scans.loaded = 3, diagnostics.filename = check.qxy.file, verbose = FALSE) @ \subsubsection*{Reading the Intensity file} The \Rcode{GdsIntensityReader} class provides a convenient interface for retrieving data from an intensity GDS file. Its methods are similar to \Rcode{GdsGenotypeReader}. <<>>= (gds <- GdsIntensityReader(qxy.file)) nscan(gds) nsnp(gds) head(getScanID(gds)) head(getSnpID(gds)) head(getChromosome(gds)) head(getPosition(gds)) # quality score for the first 3 samples and the first 5 SNPs getQuality(gds, snp=c(1,5), scan=c(1,3)) # X intensity for the first 3 samples and the first 5 SNPs getX(gds, snp=c(1,5), scan=c(1,3)) # Y intensity for the first 3 samples and the first 5 SNPs getY(gds, snp=c(1,5), scan=c(1,3)) close(gds) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection*{B Allele Frequency and Log R Ratio Files}\label{sec:baf_lrr} The B Allele Frequency (BAF) and Log R Ratio (LRR) file stores these values for every sample by SNP. For Illumina data, these values are calculated by the BeadStudio software and may be provided by the genotyping center. For a thorough explanation and presentation of an application of these values, please refer to Peiffer, Daniel A., et al. (2006).\footnote{Peiffer, Daniel A., et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. \textit{Genome Research} \textbf{16}, 1136-1148 (September 2006).} For a given sample and SNP, R and $\theta$ are calculated using the X and Y intensities, where \begin{equation} \label{rthe} R=X+Y \end{equation} \begin{equation*} \theta = \frac{2\arctan(Y/X)}{\pi} \end{equation*} $\theta$ corresponds to the polar coordinate angle and R is the sum of the normalized X and Y intensities (not, as one might assume, the magnitude of the polar coordinate vector). The LRR is given below. The expected value of R is derived from a plot of $\theta$ versus R for a given SNP. It is the predicted value of R derived from a line connecting the centers of the two nearest genotype clusters. \begin{equation} {\rm LRR} =\log \left( \frac{R_{\text{observed values}}}{R_{\text{expected values}}}\right) \end{equation} Variation in the LRR across a single chromosome indicates possible duplication or deletion, and is an indication of overall sample quality. The BAF is the frequency of the B allele in the population of cells from which the DNA is extracted. Each sample and SNP combination has a BAF value. Note the BAF values vary for a subject with each DNA extraction and tissue used. After all SNPs have been read and all samples have been clustered for a probe, the mean $\theta$ ``cluster'' value is calculated for each probe, for each of the three genotype clusters, resulting in $\theta_{AA}, \theta_{AB} \text{ and }\theta_{BB}$ for every probe. Then the $\theta$ value for each sample, call it $\theta_{n}$, is compared to $\theta_{AA}, \theta_{AB}\text{ and }\theta_{BB}$. The BAF is calculated \begin{equation*} \text{BAF}= \begin{cases} 0 & \text{if } \theta_{n} < \theta_{AA}\\ \mbox{}\\ \frac{{\textstyle (1/2)(\theta_{n}-\theta_{AA})}}{{\textstyle \theta_{AB}-\theta_{AA}}} & \text{if } \theta_{AA} \leq \theta_{n} < \theta_{AB}\\ \mbox{}\\ \frac{{\textstyle 1}}{{\textstyle 2}}+\frac{{\textstyle (1/2)(\theta_{n}-\theta_{AB})}} {{\textstyle \theta_{BB}-\theta_{AB}}} & \text{if } \theta_{AB} \leq \theta_{n} < \theta_{BB}\\ \mbox{}\\ 1 & \text{if } \theta_{n} \geq \theta_{BB} \end{cases} \end{equation*} A $\theta_{n}$ value of 0 or 1 corresponds to a homozygote genotype for sample $n$ at that particular probe, and a $\theta_{n}$ value of 1/2 indicates a heterozygote genotype. Thus, $BAF \in[0,1]$ for each probe. Across a chromosome, three bands are expected, one hovering around 0, one around 1 and one around 0.5, and any deviation from this is considered aberrant. We use the BAF and LRR values to detect mixed samples or samples of low quality, as well as chromosomal duplications and deletions. Samples that have a significantly large (partial or full chromosome) aberration for a particular chromosome as detected from the BAF values are recommended to be filtered out, for the genotype data are not reliable in these situations. Because of these applications, the BAF and LRR values are a salient part of the data cleaning steps. \subsubsection*{Creating the BAF and LRR file} We call \Rcode{createDataFile} again, this time specifying BAlleleFreq and LogRRatio in the \Rcode{variables} argument. <<>>= bl.file <- "tmp.bl.gds" col.nums <- as.integer(c(1,2,20,21)) names(col.nums) <- c("snp", "sample", "BAlleleFreq", "LogRRatio") diag.bl.file <- "diag.bl.RData" diag.bl <- createDataFile(path = path, filename = bl.file, file.type = "gds", variables = c("BAlleleFreq","LogRRatio"), snp.annotation = snp.annotation, scan.annotation = scan.annotation, sep.type = ",", skip.num = 11, col.total = 21, col.nums = col.nums, scan.name.in.file = 1, diagnostics.filename = diag.bl.file, verbose = FALSE) @ \subsubsection*{Reading the BAF and LRR file} We also use the \Rcode{GdsIntensityReader} class for BAF/LRR data. <<>>= (gds <- GdsIntensityReader(bl.file)) getBAlleleFreq(gds, snp=c(1,5), scan=c(1,3)) getLogRRatio(gds, snp=c(1,5), scan=c(1,3)) close(gds) @ <>= file.remove(geno.file, qxy.file, bl.file) file.remove(diag.geno.file, diag.qxy.file, diag.bl.file) file.remove(check.geno.file, check.qxy.file) @ \subsection{Combining data files with SNP and Scan annotation} The \Rcode{GenotypeData} and \Rcode{IntensityData} objects combine SNP and scan annotation with GDS (or NetCDF) files, ensuring that scanID, snpID, chromosome, and position are consistent. The constructor for a \Rcode{GenotypeData} object takes a \Rcode{GdsGenotypeReader} object as its first argument. Either or both of the \Rcode{scanAnnot} and \Rcode{snpAnnot} slots may be empty (\Rcode{NULL}), but if annotation objects are provided to the constructor, the relevant columns will be checked against the data file during object creation. <<>>= genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") gds <- GdsGenotypeReader(genofile) # only GDS file genoData <- GenotypeData(gds) # with scan annotation genoData <- GenotypeData(gds, scanAnnot=scanAnnot) # with scan and SNP annotation genoData <- GenotypeData(gds, snpAnnot=snpAnnot, scanAnnot=scanAnnot) genoData @ \Rcode{GenotypeData} objects have methods in common with \Rcode{GdsGenotypeReader}, \Rcode{SnpAnnotationDataFrame}, and \Rcode{ScanAnnotationDataFrame}, along with methods to access variables in the annotation slots. <<>>= nsnp(genoData) nscan(genoData) # scan annotation range(getScanID(genoData)) hasSex(genoData) table(getSex(genoData)) hasScanVariable(genoData, "subjectID") head(getScanVariable(genoData, "subjectID")) getScanVariableNames(genoData) # snp annotation range(getSnpID(genoData)) table(getChromosome(genoData, char=TRUE)) head(getPosition(genoData)) hasSnpVariable(genoData, "rsID") head(getSnpVariable(genoData, "rsID")) getSnpVariableNames(genoData) # genotypes getGenotype(genoData, snp=c(1,5), scan=c(1,5)) close(genoData) @ \Rcode{IntensityData} objects behave in the same way as \Rcode{GenotypeData} objects, but take a \Rcode{GdsIntensityReader} object as the first argument. <<>>= # quality score, X and X intensity qxyfile <- system.file("extdata", "illumina_qxy.gds", package="GWASdata") gds <- GdsIntensityReader(qxyfile) qxyData <- IntensityData(gds, snpAnnot=snpAnnot, scanAnnot=scanAnnot) qxyData getQuality(qxyData, snp=c(1,5), scan=c(1,5)) getX(qxyData, snp=c(1,5), scan=c(1,5)) getY(qxyData, snp=c(1,5), scan=c(1,5)) close(qxyData) # BAF/LRR blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata") gds <- GdsIntensityReader(blfile) blData <- IntensityData(gds, snpAnnot=snpAnnot, scanAnnot=scanAnnot) blData getBAlleleFreq(blData, snp=c(1,5), scan=c(1,5)) getLogRRatio(blData, snp=c(1,5), scan=c(1,5)) close(blData) @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Batch Quality Checks} The overall goal of this step is to check the quality of the sample batches. Substantial quality control is done by the genotyping centers prior to releasing the genotype data; however it is our experience that despite the stringent quality controls it is still possible for batches with lower than desired quality to pass the pre-release data quality checks. If a lower quality batch is detected then it may be necessary to re-run the genotyping for that batch. We check the batch quality by comparing the missing call rates between batches and looking for significant allele frequency differences between batches. \subsection{Calculate Missing Call Rate for Samples and SNPs} The first step is to calculate the missing call rates for each SNP and for each sample. A high missing call rate for a sample is often indicative of a poorly performing sample. It has been seen that samples from DNA that has undergone whole-genome amplification (WGA) have a relatively higher missing call rate. Similarly a high missing call rate for a SNP is indicative of a problem SNP. Experience from the GENEVA studies has shown that there seem to be a subset of SNPs from which genotype calls are more difficult to make than others. We calculate the missing call rates in a two step process: first the missing call rates over all samples and SNPs are calculated, then the missing call rates are calculated again, filtering out SNPs and samples that have an initial missing call rate greater than 0.05. The initial SNP missing call rate over all samples is saved in the SNP annotation data file as \Rcode{missing.n1}. The analogous idea is applied to the samples: \Rcode{missing.e1} is saved in the sample annotation file and corresponds to the missing call rate per sample over all SNPs, excluding those SNPs with all calls missing. The \Rcode{missing.n2} is calculated as the call rate per SNP over all samples whose \Rcode{missing.e1} is less than 0.05. Again, similarly for the samples, \Rcode{missing.e2} is calculated for each sample over all SNPs with \Rcode{missing.n2} values less than 0.05. It is important to remember that the Y chromosome values should be calculated for males only, since we expect females to have no genotype values for the Y chromosome. \subsubsection*{Calculate \Rcode{missing.n1} } \label{sec:missing.n1} This step calculates and examines \Rcode{missing.n1}, the missing call rate per SNP over all samples by calling the function \Rcode{missingGenotypeBySnpSex}. This function takes a \Rcode{GenotypeData} object as an argument, and requires that the scan annotation of this object contains a ``sex'' column. There is also an option to send a vector of SNPs to exclude from the calculation, which is what we will use later to find \Rcode{missing.n2}. For now, we will use all SNPs for each sample, being sure to calculate by sex. The function returns a list, with one element that holds the missing counts for each SNP, one element that holds the sex counts, and one element that holds the fraction of missing calls. <<>>= # open the GDS file and create a GenotypeData object gdsfile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") gds <- GdsGenotypeReader(gdsfile) # sex is required for this function, so we need the scan annotation genoData <- GenotypeData(gds, scanAnnot=scanAnnot) # Calculate the number of missing calls for each snp over all samples # for each sex separately miss <- missingGenotypeBySnpSex(genoData) # Examine the results names(miss) head(miss$missing.counts) miss$scans.per.sex head(miss$missing.fraction) @ The Y chromosome should be missing for all females, but an occasional probe on the Y chromosome is called in a female. \texttt{missingGenotypeBySnpSex} excludes females when calculating the missing rate for Y chromosome SNPs. Note this may need to be changed later if there are some sex mis-annotations because the Y chromosome SNP missing call rates may change. We add the missing call rates to the SNP annotation. <<>>= # Make sure ordering matches snp annotation allequal(snpAnnot$snpID, as.numeric(names(miss$missing.fraction))) snpAnnot$missing.n1 <- miss$missing.fraction varMetadata(snpAnnot)["missing.n1", "labelDescription"] <- paste( "fraction of genotype calls missing over all samples", "except that females are excluded for Y chr SNPs") summary(snpAnnot$missing.n1) @ We plot the missing call rates so we can easily identify any outliers. We also find the number of SNPs with 100\% missing, and the fraction of SNPs with missing call rate less than 0.05 for each chromosome type. <>= hist(snpAnnot$missing.n1, ylim=c(0,100), xlab="SNP missing call rate", main="Missing Call Rate for All Probes") @ <<>>= # Find the number of SNPs with every call missing length(snpAnnot$missing.n1[snpAnnot$missing.n1 == 1]) # Fraction of autosomal SNPs with missing call rate < 0.05 x <- snpAnnot$missing.n1[snpAnnot$chromosome < 23] length(x[x < 0.05]) / length(x) # Fraction of X chromosome SNPs with missing call rate < 0.05 x <- snpAnnot$missing.n1[snpAnnot$chromosome == 23] length(x[x < 0.05]) / length(x) # Fraction of Y chromosome SNPs with missing call rate < 0.05 x <- snpAnnot$missing.n1[snpAnnot$chromosome == 25] length(x[x < 0.05]) / length(x) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection*{Calculate \Rcode{missing.e1} } This step calculates \Rcode{missing.e1}, which is the missing call rate per sample over all SNPs, by chromosome. We read in the new SNP annotation file which holds the \Rcode{missing.n1} variable. For those SNPs with a \Rcode{missing.n1} value less than one, we call the \Rcode{missingGenotypeByScanChrom} function that returns a list with one element holding the missing counts per sample by chromosome, one element holding the number of SNPs per chromosome, and one element holding the fraction of missing calls over all chromosomes. <<>>= # Want to exclude all SNP probes with 100% missing call rate # Check on how many SNPs to exclude sum(snpAnnot$missing.n1 == 1) # Create a variable that contains the IDs of these SNPs to exclude snpexcl <- snpAnnot$snpID[snpAnnot$missing.n1 == 1] length(snpexcl) # Calculate the missing call rate per sample miss <- missingGenotypeByScanChrom(genoData, snp.exclude=snpexcl) names(miss) head(miss$missing.counts) head(miss$snps.per.chr) # Check to make sure that the correct number of SNPs were excluded sum(miss$snps.per.chr) nrow(snpAnnot) - sum(miss$snps.per.chr) @ \Rcode{missingGenotypeByScanChrom} calculates the missing call rate for each sample over all SNPs. For females, the missing call rate does not include the probes on the Y chromosome. The values for \Rcode{missing.e1} are added to the sample annotation file. <<>>= head(miss$missing.fraction) # Check the ordering matches the sample annotation file allequal(names(miss$missing.fraction), scanAnnot$scanID) # Add the missing call rates vector to the sample annotation file scanAnnot$missing.e1 <- miss$missing.fraction varMetadata(scanAnnot)["missing.e1", "labelDescription"] <- paste( "fraction of genotype calls missing over all snps with missing.n1<1", "except that Y chr SNPs are excluded for females") summary(scanAnnot$missing.e1) @ We will create a histogram of the overall missing call rate per sample in order to identify any samples with a relatively larger missing call rate. It is known that genotype data taken from DNA that has been through whole-genome amplification (WGA) has an overall higher missing call rate; this is something that we would see at this step if any samples are of WGA origin. We also look at the summary of the missing call rate for females and males separately to ensure there are no large sex differences. Finally, we calculate the number of samples with a missing call rate greater than 0.05. In this case, there are no such samples but in other data this may not be the case. If any samples have a high missing rate, we recommend further investigation of what may be causing the missing calls; the samples with a missing call rate greater than 0.05 should be filtered out due to low sample quality. <>= hist(scanAnnot$missing.e1, xlab="Fraction of missing calls over all probes", main="Histogram of Sample Missing Call Rate for all Samples") @ <<>>= # Look at missing.e1 for males summary(scanAnnot$missing.e1[scanAnnot$sex == "M"]) # Look at missing.e1 for females summary(scanAnnot$missing.e1[scanAnnot$sex == "F"]) # Number of samples with missing call rate > 5% sum(scanAnnot$missing.e1 > 0.05) @ For some analyses we require the missing call rate for autosomes and the X chromosome to be separated. We calculate these values here and add them to the sample annotation file. Also, we will create a logical \Rcode{duplicated} variable. We can identify the duplicated scans in the sample annotation file by identifying the subject ids that occur more than once. Among samples with the same subject id, the one with the lowest \Rcode{missing.e1} value will have the variable \Rcode{duplicated} set to \Rcode{FALSE}. <<>>= auto <- colnames(miss$missing.counts) %in% 1:22 missa <- rowSums(miss$missing.counts[,auto]) / sum(miss$snps.per.chr[auto]) summary(missa) missx <- miss$missing.counts[,"X"] / miss$snps.per.chr["X"] summary(missx) # check they match sample annotation file allequal(names(missa), scanAnnot$scanID) allequal(names(missx), scanAnnot$scanID) # Add these separate sample missing call rates to the sample # annotation scanAnnot$miss.e1.auto <- missa scanAnnot$miss.e1.xchr <- missx # Order scanAnnot by missing.e1 so duplicate subjectIDs # with a higher missing rate are marked as duplicates scanAnnot <- scanAnnot[order(scanAnnot$subjectID, scanAnnot$missing.e1),] scanAnnot$duplicated <- duplicated(scanAnnot$subjectID) table(scanAnnot$duplicated, useNA="ifany") # Put scanAnnot back in scanID order; this is very important!! scanAnnot <- scanAnnot[order(scanAnnot$scanID),] allequal(scanAnnot$scanID, sort(scanAnnot$scanID)) varMetadata(scanAnnot)["duplicated", "labelDescription"] <- "TRUE for duplicate scan with higher missing.e1" @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection*{Calculate \Rcode{missing.n2} } This step calculates \Rcode{missing.n2}, which is the missing call rate per SNP with \Rcode{missing.e1} less than 0.05 over all samples. In some cases, there will be samples with missing call rate greater than 0.05. However, because of the high quality of the HapMap data, there are no such samples in this case. We will continue with the steps as if there are samples we must exclude from the \Rcode{missing.n2} calculation. We call the \Rcode{missingGenotypeBySnpSex} function just as we did to calculate for \Rcode{missing.n1}, but this time we include the list of sample numbers to exclude from the calculation (although here that list is empty). <<>>= # Find the samples with missing.e1 > .05 and make a vector of # scanID to exclude from the calculation scan.exclude <- scanAnnot$scanID[scanAnnot$missing.e1 > 0.05] # Call missingGenotypeBySnpSex and save the output miss <- missingGenotypeBySnpSex(genoData, scan.exclude=scan.exclude) snpAnnot$missing.n2 <- miss$missing.fraction varMetadata(snpAnnot)["missing.n2", "labelDescription"] <- paste( "fraction of genotype calls missing over all samples with missing.e1<0.05", "except that females are excluded for Y chr SNPs") summary(snpAnnot$missing.n2) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection*{Calculate \Rcode{missing.e2} } This step calculates \Rcode{missing.e2}, which is the missing call rate per sample over all SNPs with \Rcode{missing.n2} less than 0.05. <<>>= # Create a vector of the SNPs to exclude. snpexcl <- snpAnnot$snpID[snpAnnot$missing.n2 >= 0.05] length(snpexcl) miss <- missingGenotypeByScanChrom(genoData, snp.exclude=snpexcl) # Add the missing call rates vector to the sample annotation file scanAnnot$missing.e2 <- miss$missing.fraction varMetadata(scanAnnot)["missing.e2", "labelDescription"] <- paste( "fraction of genotype calls missing over all snps with missing.n2<0.05", "except that Y chr SNPs are excluded for females") summary(scanAnnot$missing.e2) @ We will create a histogram of the overall missing call rate per sample in order to identify any samples with a relatively larger missing call rate. <>=> hist(scanAnnot$missing.e2, xlab="Fraction of missing calls over all probes with missing call rate < 0.05", main="Histogram of Sample Missing Call Rate for all Samples") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Calculate Missing Call Rates by Batch} Next, the missing call rate by batch is calculated to check that there are no batches with comparatively lower call rates. Usually a ``batch'' is a plate containing samples that were processed together through the genotyping chemistry. In this case all samples were run on different plates (as controls for another dataset), so we use the simulated variable ``batch.'' We calculate the mean missing call rate for all samples in each of the batches. <<>>= varLabels(scanAnnot) # Check how many batches exist and how many samples are in each batch length(unique(scanAnnot$batch)) table(scanAnnot$batch, useNA="ifany") @ <>= # Plot the distribution of the number of samples per batch. barplot(table(scanAnnot$batch), ylab="Number of Samples", xlab="Batch", main="Distribution of Samples per Batch") @ <<>>= # Examine the mean missing call rate per batch for all SNPs batches <- unique(scanAnnot$batch) bmiss <- rep(NA,length(batches)); names(bmiss) <- batches bn <- rep(NA,length(batches)); names(bn) <- batches for(i in 1:length(batches)) { x <- scanAnnot$missing.e1[is.element(scanAnnot$batch, batches[i])] bmiss[i] <- mean(x) bn[i] <- length(x) } @ To find the slope of the regression line from the mean missing call rate per batch regressed on the number of samples per batch, we will take the results from ANOVA. Then we can plot the mean missing call rate against the number of samples in the batch with the regression line. For studies with more batches, this test can identify any batch outliers with regard to missing call rate for samples in a given batch. We can do the same analysis using the mean missing call rate for autosomal SNPs, or SNPs on the X chromosome in the exact same way, substituting \Rcode{missing.e1} with either \Rcode{miss.e1.auto} or \Rcode{miss.e1.xchr}. Because the results are nearly identical, we will not show them here. <<>>= y <- lm(bmiss ~ bn) anova(y) @ <>= plot(bn, bmiss, xlab="Number of samples per batch", ylab="Mean missing call rate", main="Mean Missing Call Rate vs\nSamples per Batch") abline(y$coefficients) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Chi-Square Test of Allelic Frequency Differences in Batches} In this step, the chi-square test for differences in allelic frequency is performed between each batch individually and a pool of all the other batches in the study. We then look at the mean $\chi^2$ statistic over all SNPs for each batch as a function of the ethnic composition of samples in a batch. We use the \Rcode{batch} variable in the scan annotation to identify the samples in each batch, so we must include the scan annotation in the \Rcode{GenotypeData} object. Then we call the function \Rcode{batchChisqTest} which calculates the $\chi^2$ values from $2 \times 2$ tables for each SNP, comparing each batch with the other batches. This function returns the genomic inflation factors for each batch, as well as matrix of $\chi^2$ values for each SNP. <<>>= res <- batchChisqTest(genoData, batchVar="batch", return.by.snp=TRUE) close(genoData) # chi-square values for each SNP dim(res$chisq) # genomic inflation factor res$lambda # average chi-square test statistic for each of the batches res$mean.chisq @ Next we test for association between batches and population groups, using a $\chi^2$ contingency test. Then we look at the relationship between the ethnic composition of each batch and the previously calculated $\chi^2$ test of allelic frequency between each batch and a pool of the other batches. The point is to look for batches that differ from others of similar ethnic composition, which might indicate a batch effect due to genotyping artifact. In this experiment, there are only a few batches and wide variations in race among batches, so it is difficult to interpret the results. In larger GWAS experiments, we generally observe a U-shaped curve of allelic frequency test statistic as a function of ethnic composition. <<>>= x <- table(scanAnnot$race, useNA="ifany") x x[1] / sum(x) x[2] / sum(x) x <- table(scanAnnot$race, scanAnnot$batch) x # Run an approximate chi-square test to see if there are ethnic effects chisq <- chisq.test(x) chisq$p.value # Calculate the fraction of samples in each batch that are CEU batches <- unique(scanAnnot$batch) eth <- rep(NA,length(batches)); names(eth) <- sort(batches) for(i in 1:length(batches)){ x <- scanAnnot$race[is.element(scanAnnot$batch, batches[i])] xl <- length(x[x == "CEU"]) eth[i] <- xl / length(x) } allequal(names(eth), names(res$mean.chisq)) @ <>= # Plot the average Chi-Square test statistic against the # fraction of samples that are CEU plot(eth, res$mean.chisq, xlab="Fraction of CEU Samples per Batch", ylab="Average Chi-square Test Statistic", main="Fraction of CEU Samples per Batch vs Average Chi-square Test Statistic") abline(v=mean(eth), lty=2, col="red") @ The $\chi^2$ test is not suitable when the $2 \times 2$ tables for each SNP have very small values. For arrays in which many SNPs have very low minor allele frequency, Fisher's exact test is more appropriate. The function \Rcode{batchFisherTest} can be used in a very similar way to \Rcode{batchChisqTest}, but the run time is significantly longer, as it iterates over each SNP. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Sample Quality Checks} In this step we examine sample quality using three methods. We check for outliers in genotype quality score; we check for anomalous sample-chromosome pairs using BAF variance analysis; lastly, we check sample missingness and heterozygosities. \subsection{Sample genotype quality scores} Genotype calling algorithms report quality scores and classify genotypes with insufficient confidence as missing. This code calculates the mean and median genotype quality score for each sample. Calculate quality scores by sample. The \Rcode{qualityScoreByScan} function requires both an \Rcode{IntensityData} object, to read the quality scores, and a \Rcode{GenotypeData} object, to determine which scans have missing genotypes and should be omitted from the calculation. <<>>= qxyfile <- system.file("extdata", "illumina_qxy.gds", package="GWASdata") qualGDS <- GdsIntensityReader(qxyfile) qualData <- IntensityData(qualGDS, scanAnnot=scanAnnot) genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") genoGDS <- GdsGenotypeReader(genofile) genoData <- GenotypeData(genoGDS, scanAnnot=scanAnnot) qual.results <- qualityScoreByScan(qualData, genoData) close(qualData) @ We plot the distribution of median quality scores; it is unsurprising that these are all good, given that some quality checking happens at the genotyping centers. Clear outliers in this plot would be cause for concern that the sample(s) in question were of significantly lower quality than the other samples. <>= hist(qual.results[,"median.quality"], main="Median Genotype Quality Scores of Samples", xlab="Median Quality") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{B Allele Frequency variance analysis} BAF is a standardized version of the polar coordinate angle (Section \ref{sec:baf_lrr}). It calculates the frequency of the B allele within a single sample. Under normal circumstances, the true frequency is 0, $\frac{1}{2}$, or 1. In cases of allelic imbalance the true frequencies may vary. For example, in a population of trisomic cells, the true frequencies would be 0, $\frac{1}{3}$, $\frac{2}{3}$, or 1. Here we calculate the variance of BAF (for SNPs called as heterozygotes) within a sliding window along each chromosome for each sample. Each chromosome is divided into 12 sections with equal numbers of SNPs and the variance is calculated in a window of two adjacent sections (one-sixth of the chromosome), which slides along the chromosome in increments of one section. Regions (windows) with very high BAF variance can indicate chromosomal anomalies. \subsubsection*{Calculate the sliding window BAF standard deviation} This process identifies chromosome-sample pairs that have windows with very high BAF standard deviation, with ``very high'' defined as more than 4 standard deviations from the window's mean BAF standard deviation over all samples. The output is a matrix listing all sample-chromosome pairs with high BAF standard deviations, the number of windows with high SDs in each pair, and the sample's sex. We examine plots of BAF by position for each identified chromosome-sample pair (though only a subset of plots are shown here). First, run the \Rcode{meanBAFbyScanChromWindow} function. This requires both an \Rcode{IntensityData} object with BAF and a \Rcode{GenotypeData} object. Its output is a list of matrices, with one matrix for each chromosome containing the standard deviation of BAF at each window in each scan. <<>>= blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata") blGDS <- GdsIntensityReader(blfile) blData <- IntensityData(blGDS, scanAnnot=scanAnnot) nbins <- rep(12, 3) slidingBAF12 <- sdByScanChromWindow(blData, genoData, nbins=nbins) names(slidingBAF12) dim(slidingBAF12[["21"]]) @ The function \Rcode{meanBAFSDbyChromWindow} calculates the mean and standard deviation of the BAF standard deviations in each window in each chromosome over all samples. For the X chromosome, males and females are calculated separately, and we save the results split by sex. <<>>= sds.chr <- meanSdByChromWindow(slidingBAF12, scanAnnot$sex) sds.chr[["21"]] sds.chr[["X"]] @ Next, identify windows within sample-chromosome pairs that have very high BAF standard deviations compared to the same window in other samples. <<>>= res12bin4sd <- findBAFvariance(sds.chr, slidingBAF12, scanAnnot$sex, sd.threshold=4) head(res12bin4sd) table(res12bin4sd[, "chromosome"]) @ Call \Rcode{chromIntensityPlot} to plot the BAF of all SNPs on the indicated chromosome-sample pairs against position. This yields many plots that must be individually examined to distinguish noisy data from chromosomal abnormalities. <>= scanID <- as.integer(res12bin4sd[, "scanID"]) chrom <- as.integer(res12bin4sd[, "chromosome"]) chrom[res12bin4sd[, "chromosome"] == "X"] <- 23 bincode <- paste("Bin", res12bin4sd[, "bin"], sep = " ") chromIntensityPlot(blData, scanID, chrom, info=bincode, ideogram=FALSE) close(blData) @ At this stage, we have generated plots of those chromosomes (over all chromosomes and samples) that have unusually high BAF standard deviation. The next step in the process is to examine each of these plots to look for evidence of sample contamination or other quality issues. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Missingness and heterozygosity within samples} This step calculates the percent of missing and heterozygous genotypes in each chromosome of each sample. We create boxplots of missingness by individual chromosome, as well as autosomal and X chromosome heterozygosity in each population. This allows for identification of samples that may have relatively high heterozygosity for all chromosomes, indicating a possible mixed sample. Further, we are able to identify any outliers with regard to missingness. Plotting by chromosome enables visualization of chromosomal artifacts on a particular subset of SNPs that lie on a chromosome. We will call the function \Rcode{missingGenotypeByScanChrom} to calculate the missing call rate. Since the function returns missing counts per chromosome as well as snps per chromosome, we divide to find the missing call rate per chromosome. We then make a boxplot of missingness in the autosomes, the X chromosome, and the pseudoautosomal region, and a boxplot of X chromosome missingness for each sex. <<>>= miss <- missingGenotypeByScanChrom(genoData) miss.rate <- t(apply(miss$missing.counts, 1, function(x) { x / miss$snps.per.chr})) miss.rate <- as.data.frame(miss.rate) @ <>= cols <- names(miss.rate) %in% c(1:22, "X", "XY") boxplot(miss.rate[,cols], main="Missingness by Chromosome", ylab="Proportion Missing", xlab="Chromosome") @ <>= boxplot(miss.rate$X ~ scanAnnot$sex, main="X Chromosome Missingness by Sex", ylab="Proportion Missing") @ We will call the function \Rcode{hetByScanChrom} to calculate the heterozygosity. We store the heterozygosity calculations in the sample annotation. <<>>= # Calculate heterozygosity by scan by chromosome het.results <- hetByScanChrom(genoData) close(genoData) # Ensure heterozygosity results are ordered correctly allequal(scanAnnot$scanID, rownames(het.results)) # Write autosomal and X chr heterozygosity to sample annot scanAnnot$het.A <- het.results[,"A"] scanAnnot$het.X <- het.results[,"X"] varMetadata(scanAnnot)["het.A", "labelDescription"] <- "fraction of heterozygotes for autosomal SNPs" varMetadata(scanAnnot)["het.X", "labelDescription"] <- "fraction of heterozygotes for X chromosome SNPs" @ There are two plots for heterozygosity. First is a boxplot of heterozygosity over the autosomes, subsetted by population. We recommend examining BAF plots for high heterozygosity outliers, to look for evidence of sample contamination (more than 3 bands on all chromosomes). Examination of low heterozygosity samples may also identify chromosomal anomalies with wide splits in the intermediate BAF band. Second is a boxplot of female heterozygosity on the X chromosome, subsetted by population. <>= boxplot(scanAnnot$het.A ~ scanAnnot$race, main="Autosomal Heterozygosity") @ <>= female <- scanAnnot$sex == "F" boxplot(scanAnnot$het.X[female] ~ scanAnnot$race[female], main="X Chromosome Heterozygosity in Females") @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Sample Identity Checks} This step performs a series of identity checks on the samples. First, samples are analyzed to determine if there exist any discrepancies between the annotated sex and genetic sex in the sample. Next, the relatedness among samples is investigated through IBD estimation. Finally, the samples are checked for potential population substructure, which if unidentified can threaten the validity of subsequent analyses. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Mis-annotated Sex Check} This section looks for discrepancies between the annotated sex and genetic sex. Sex is usually inferred from X chromosome heterozygosity, but our experience is that this variable can give ambiguous results when used alone (for example, in XXY males or due to genotyping artifacts). Plots of the mean allelic intensities of SNPs on the X and Y chromosomes can identify mis-annotated sex as well as sex chromosome aneuploidies. It is important to have accurate sex annotation not only for completeness but also for analyses which treat male and female samples separately. Any found sex mis-annotations are presented to the investigators in order to resolve discrepancies. If a genetic and recorded sex do not match, a collective decision must be made regarding the inclusion of those genetic data. In some cases a recording error explains the discrepancy, but more often the discrepancy is unexplained. These cases are assumed to be a sample mis-identification and these samples are excluded from subsequent analyses. In order to compare the mean X and Y chromosome intensities for all samples, we must calculate the mean intensity for each sample by chromosome. The function \Rcode{meanIntensityByScanChrom} calculates for each sample the mean and standard deviation of the sum of the two allelic intensities for each probe on a given chromosome. A matrix with one row per sample and one column per chromosome with entries $[i,j]$ corresponding to either the mean or standard deviation of all probe intensities for the $i^{\text th}$ sample and the $j^{\text th}$ chromosome is returned from the function. Note that ``X'' and ``Y'' in the list names refer to the X and Y intensity values and not to the chromosomes. <<>>= qxyfile <- system.file("extdata", "illumina_qxy.gds", package="GWASdata") intenGDS <- GdsIntensityReader(qxyfile) inten.by.chrom <- meanIntensityByScanChrom(intenGDS) close(intenGDS) names(inten.by.chrom) @ Now we will use the calculated mean intensities by sample to identify any sex mis-annotation or sex chromosome aneuploidies. For the plots, we will create a color coding corresponding to the annotated sex, with blue for males and red for females. We also use the SNP annotation to find the probe counts for the X and Y chromosomes; we use these in the plot axis labels. <<>>= mninten <- inten.by.chrom[[1]] # mean intensities dim(mninten) # Check to be sure sample ordering is consistent allequal(scanAnnot$scanID, rownames(mninten)) # Assign each sex a color xcol <- rep(NA, nrow(scanAnnot)) xcol[scanAnnot$sex == "M"] <- "blue" xcol[scanAnnot$sex == "F"] <- "red" nx <- sum(snpAnnot$chromosome == 23) ny <- sum(snpAnnot$chromosome == 25) @ For two of the plots we will create next, we use the autosome and X chromosome heterozygosity values calculated in an earlier step and stored in the sample annotation. Four plots will now be created: mean X chromosome intensity versus mean Y chromosome intensity, mean X chromosome intensity versus X chromosome heterozygosity, mean X chromosome heterozygosity versus mean Y chromosome intensity and mean autosomal heterozygosity versus mean X chromosome heterozygosity. The fourth plot applies to annotated females only, since males are expected to have zero heterozygosity on the X chromosome. <>= #All intensities x1 <-mninten[,"X"]; y1 <- mninten[,"Y"] main1 <- "Mean X vs \nMean Y Chromosome Intensity" #Het on X vs X intensity x2 <- mninten[,"X"]; y2 <- scanAnnot$het.X main2 <- "Mean X Chromosome Intensity vs Mean X Chromosome Heterozygosity" # Het on X vs Y intensity y3 <- mninten[,"Y"]; x3 <- scanAnnot$het.X main3 <- "Mean X Chromosome Heterozygosity vs Mean Y Chromosome Intensity" # X vs A het x4 <- scanAnnot$het.A[scanAnnot$sex == "F"] y4 <- scanAnnot$het.X[scanAnnot$sex == "F"] main4 <- "Mean Autosomal Heterozygosity vs Mean X Chromosome Heterozygosity" cols <- c("blue","red") mf <- c("male", "female") xintenlab <- paste("X intensity (n=", nx, ")", sep="") yintenlab <- paste("Y intensity (n=", ny, ")", sep="") pdf("DataCleaning-sex.pdf") par(mfrow=c(2,2)) plot(x1, y1, xlab=xintenlab, ylab=yintenlab, main=main1, col=xcol, cex.main=0.8) legend("topright",mf,col=cols,pch=c(1,1)) plot(x2, y2, col=xcol, xlab=xintenlab, ylab="X heterozygosity", main=main2, cex.main=0.8) plot(x3, y3, col=xcol, ylab=yintenlab, xlab="X heterozygosity", main=main3, cex.main=0.8) plot(x4,y4, col="red", xlab="Autosomal heterozygosity", ylab="X heterozygosity", main=main4, cex.main=0.8) dev.off() @ \includegraphics[width=5.67in,height=5.67in,keepaspectratio]{DataCleaning-sex} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Relatedness and IBD Estimation} In most studies, there are discrepancies between pedigrees provided and relatedness inferred from the genotype data. To infer genetic relatedness, we estimate coefficients of identity by descent (IBD). It is important to identify and record unannotated relationships so that analyses assuming all subjects are unrelated can use a filtered subset of samples. From our experience, it is difficult to accurately estimate low levels of relatedness, but higher levels can be more reliably determined. Users are encouraged to employ analyses which take into accounts the IBD estimates themselves rather than discrete relationship coefficients for any relationships. The \Rpackage{SNPRelate} package includes three methods for calculating IBD: maximum likelihood estimation (MLE), which is accurate but computationally intensive, PLINK Method of Moments (MoM), which is faster but does not perform well with multiple ancestry groups analyzed together, and KING, which is robust to population structure\footnote{Manichaikul et al, Robust relationship inference in genome-wide association studies, Bioinformatics, 26(22):2867–2873, 2010}. This example will use the KING method. <<>>= library(SNPRelate) gdsfile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") gdsobj <- snpgdsOpen(gdsfile) ibdobj <- snpgdsIBDKING(gdsobj) snpgdsClose(gdsobj) names(ibdobj) dim(ibdobj$kinship) ibdobj$kinship[1:5,1:5] @ We find the expected relationships between samples based on the pedigree data that is stored in the sample annotation file. We will create a subset of the sample annotation that has one line per sample and columns that hold family, father and mother ids, where an entry of \Rcode{0} indicates no familial data. Then the function \Rcode{pedigreeCheck} is called, which determines if there are any duplicates, singleton families, mothers/fathers whose sex does not match, impossible relationships, subfamilies, or missing entries. <<>>= ped <- pData(scanAnnot)[,c("family", "subjectID", "father", "mother", "sex")] dim(ped) names(ped) <- c("family", "individ", "father", "mother", "sex") ped[1:5, ] (chk <- pedigreeCheck(ped)) @ The functions that determine expected relationships require no duplicates in the pedigree, so we remove them with \Rcode{pedigreeDeleteDuplicates}. <<>>= dups <- chk$duplicates uni.ped <- pedigreeDeleteDuplicates(ped, dups) (chk <- pedigreeCheck(uni.ped)) @ There is one parent with no individual entry, so we add a row for that parent. <<>>= ni <- chk$parent.no.individ.entry parent <- data.frame(family=ni$family, individ=ni$parentID, father=0, mother=0, sex="F", stringsAsFactors=FALSE) ped.complete <- rbind(uni.ped, parent) (chk <- pedigreeCheck(ped.complete)) @ There are multiple subfamilies identified, so we will need to assign new family IDs to the subfamilies. One subfamily has two unrelated people (likely founders), so we remove this family from the pedigree. <<>>= ped.complete <- ped.complete[ped.complete$family != 58,] subf <- chk$subfamilies.ident table(subf$family) subf.ids <- subf$individ[subf$subfamily == 2] newfam <- ped.complete$individ %in% subf.ids ped.complete$family[newfam] <- paste0(ped.complete$family[newfam], "-2") table(ped.complete$family) pedigreeCheck(ped.complete) @ The revised pedigree passes all checks. Now from the verified sample list excluding duplicate samples, we can calculate the expected relationships among the samples by calling the function \Rcode{pedigreePairwiseRelatedness}. The relationships looked for as annotated include: unrelated (U), parent/offspring (PO), full siblings (FS), second-degree relatives (half-siblings, avuncular and grandparent-grandchild), and third-degree relatives (first cousins). Families where mothers and fathers are related are also looked for among the family annotations. <<>>= rels <- pedigreePairwiseRelatedness(ped.complete) length(rels$inbred.fam) relprs <- rels$relativeprs relprs[1:5,] table(relprs$relation) @ In order to plot the IBD coefficient estimates color coded by expected relationships, we retrieve a data.frame of sample pairs with $KC > 1/32$. The samples must be coded in terms of subject id and each pair of samples must be annotated with the expected relationship. We can also assign observed relationships based on the values of $k0$ and $k1$. <<>>= samp <- pData(scanAnnot)[,c("scanID", "subjectID")] samp <- samp[match(ibdobj$sample.id, samp$scanID),] names(samp) <- c("scanID", "Individ") ibd <- snpgdsIBDSelection(ibdobj, kinship.cutoff=1/32) ibd <- merge(ibd, samp, by.x="ID1", by.y="scanID") ibd <- merge(ibd, samp, by.x="ID2", by.y="scanID", suffixes=c("1","2")) ibd$ii <- pasteSorted(ibd$Individ1, ibd$Individ2) relprs$ii <- pasteSorted(relprs$Individ1, relprs$Individ2) ibd <- merge(ibd, relprs[,c("ii","relation")], all.x=TRUE) names(ibd)[names(ibd) == "relation"] <- "exp.rel" ibd$exp.rel[ibd$Individ1 == ibd$Individ2] <- "Dup" ibd$exp.rel[is.na(ibd$exp.rel)] <- "U" table(ibd$exp.rel, useNA="ifany") # assign observed relationships ibd$obs.rel <- ibdAssignRelatednessKing(ibd$IBS0, ibd$kinship) table(ibd$obs.rel, useNA="ifany") @ Now the pedigree information is in the proper format for the IBD estimates to be plotted for each pair of samples, color coded by expected relationship. <>= ## thresholds for assigning relationships using kinship coefficients ## in table 1 of Manichaikul (2010) cut.dup <- 1/(2^(3/2)) cut.deg1 <- 1/(2^(5/2)) cut.deg2 <- 1/(2^(7/2)) cut.deg3 <- 1/(2^(9/2)) cols <- c(Dup="magenta", PO="cyan", U="black") plot(ibd$IBS0, ibd$kinship, col=cols[ibd$exp.rel], xlab="Fraction of IBS=0", ylab="Kinship coefficient") abline(h=c(cut.deg1, cut.deg2, cut.deg3, cut.dup), lty=2, col="gray") legend("topright", legend=names(cols), col=cols, pch=1) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Population Structure} \subsubsection*{Principal Component Analysis on all ethnic groups} In this section, we perform principal component analysis (PCA) in order to detect any population substructure that may exist among samples in a study. After calculating the eigenvectors for the samples, we plot the values for each of the first 4 eigenvectors in a pairwise fashion for each individual. By color coding the plots by annotated race and/or ethnicity, we can identify any individuals whose recorded self-identified race/ethnicity differs from their inferred genetic ancestry. Further, we can use the PCA-identified continental ancestry when stratifying samples by population group. It may also be useful to include the values of some eigenvectors as covariates in association tests. For PCA, we use linkage disequilibrium (LD) pruning (\Rcode{snpgdsLDpruning}) to select a set of SNPs within which each pair has a low level of LD (e.g.\ $r^2<0.1$ in a sliding 10 Mb window), from a starting pool of autosomal SNPs with \Rcode{missing.n1} $<0.05$ and MAF $<0.05$. We also remove SNPs in regions with known correlation (2q21 (LCT), HLA, 8p23, and 17q21.31). We must also ensure no duplicate samples are used for the principal component calculations. <<>>= filt <- get(data(pcaSnpFilters.hg18)) chrom <- getChromosome(snpAnnot) pos <- getPosition(snpAnnot) snpID <- getSnpID(snpAnnot) snp.filt <- rep(TRUE, length(snpID)) for (f in 1:nrow(filt)) { snp.filt[chrom == filt$chrom[f] & filt$start.base[f] < pos & pos < filt$end.base[f]] <- FALSE } snp.sel <- snpID[snp.filt] length(snp.sel) sample.sel <- scanAnnot$scanID[!scanAnnot$duplicated] length(sample.sel) gdsobj <- snpgdsOpen(gdsfile) snpset <- snpgdsLDpruning(gdsobj, sample.id=sample.sel, snp.id=snp.sel, autosome.only=TRUE, maf=0.05, missing.rate=0.05, method="corr", slide.max.bp=10e6, ld.threshold=sqrt(0.1)) snp.pruned <- unlist(snpset, use.names=FALSE) length(snp.pruned) @ The \Rcode{snpgdsPCA} function is called with the SNP and sample subsets to calculate the first 32 eigenvectors. <<>>= pca <- snpgdsPCA(gdsobj, sample.id=sample.sel, snp.id=snp.pruned) names(pca) length(pca$eigenval) dim(pca$eigenvect) @ We will make a pairs plot showing the first four eigenvectors. A simple calculation is made to find the fraction of variance among the samples as explained by each eigenvector. <<>>= # Calculate the percentage of variance explained # by each principal component. pc.frac <- pca$eigenval/sum(pca$eigenval) lbls <- paste("EV", 1:4, "\n", format(pc.frac[1:4], digits=2), sep="") samp <- pData(scanAnnot)[match(pca$sample.id, scanAnnot$scanID),] cols <- rep(NA, nrow(samp)) cols[samp$race == "CEU"] <- "blue" cols[samp$race == "YRI"] <- "red" @ <>= pairs(pca$eigenvect[,1:4], col=cols, labels=lbls, main = "CEU: blue, YRI: red") @ \subsubsection*{Parallel Coordinates Plot} A handy method of visualizing the effects of eigenvectors on clusters for a principal components analysis is the parallel coordinates plot. The genetic diversity in the YRI group is apparent in the later eigenvectors, while the remaining groups remain in clusters throughout. <>= par.coord <- pca$eigenvect rangel <- apply(par.coord, 2, function(x) range(x)[1]) rangeh <- apply(par.coord, 2, function(x) range(x)[2]) std.coord <- par.coord for (i in 1:14) std.coord[,i] <- (par.coord[,i] - rangel[i])/(rangeh[i]-rangel[i]) plot(c(0,15), c(0,1), type = 'n', axes = FALSE, ylab = "", xlab = "", main = "Parallel Coordinates Plot CEU: blue, YRI: red") for (j in 1:13) for (i in sample(1:nrow(std.coord)) ) lines(c(j,j+1), std.coord[i,c(j,j+1)], col=cols[i], lwd=0.25) axis(1, at = 1:14, labels = paste("PC",1:14, sep = ".")) @ \subsubsection*{SNP-PC correlation} We confirm that there are no correlations between SNP regions and specific eigenvectors by examining plots of SNP correlation vs. position on the chromosome. Usually we check the first 8--12 eigenvectors, but here we plot only 1--4. <>= corr <- snpgdsPCACorr(pca, gdsobj, eig.which=1:4) snpgdsClose(gdsobj) snp <- snpAnnot[match(corr$snp.id, snpID),] chrom <- getChromosome(snp, char=TRUE) pdf("DataCleaning-corr.pdf") par(mfrow=c(4,1)) for (i in 1:4) { snpCorrelationPlot(abs(corr$snpcorr[i,]), chrom, main=paste("Eigenvector",i), ylim=c(0,1)) } dev.off() @ \includegraphics[width=5.67in,height=5.67in,keepaspectratio]{DataCleaning-corr} \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case-Control Confounding} We recommend checking for case-control confounding as part of the data cleaning process for GWAS. This involves checking both the principal components and the missing call rate for a relationship with case status. \subsection{Principal Components Differences} \label{sec:pca_case_cntrl} This step examines differences in principal components according to case-control status. Collate PCA information with sample number, case-control status, and population group. <<>>= princomp <- as.data.frame(pca$eigenvect) samples.nodup <- pData(scanAnnot)[!scanAnnot$duplicated,] princomp$scanID <- as.factor(samples.nodup$scanID) princomp$case.ctrl.status <- as.factor(samples.nodup$status) princomp$race <- as.factor(samples.nodup$race) @ The code below gives what percent of variation is accounted for by the principal component for the first 32 PCs. <<>>= pc.percent <- 100 * pca$eigenval[1:32]/sum(pca$eigenval) pc.percent lbls <- paste("EV", 1:3, "\n", format(pc.percent[1:3], digits=2), "%", sep="") table(samples.nodup$status) cols <- rep(NA, nrow(samples.nodup)) cols[samples.nodup$status == 1] <- "green" cols[samples.nodup$status == 0] <- "magenta" @ We plot the principal component pairs for the first three PCs by case-control status. We then make boxplots for the first few PCs to show differences between cases and controls, along with a two-factor ANOVA accounting for case-control status and population group. Since we are using randomized case-control status, we do not expect to see a significant difference in principal components between cases and controls, when considering population group. <>= pairs(pca$eigenvect[,1:3], col=cols, labels=lbls, main = "First Three EVs by Case-Control Status") @ <>= boxplot(princomp[, 1] ~ princomp$case.ctrl.status, ylab = "PC1", main = "PC1 vs. Case-control Status") @ <>= boxplot(princomp[, 2] ~ princomp$case.ctrl.status, ylab = "PC2", main = "PC2 vs. Case-control Status") @ <>= boxplot(princomp[, 3] ~ princomp$case.ctrl.status, ylab = "PC3", main = "PC3 vs. Case-control Status") @ <<>>= aov.p1 <- aov(princomp[,1] ~ princomp$race * princomp$case.ctrl.status, princomp) summary(aov.p1) aov.p2 <- aov(princomp[,2] ~ princomp$race * princomp$case.ctrl.status, princomp) summary(aov.p2) aov.p3 <- aov(princomp[,3] ~ princomp$race * princomp$case.ctrl.status, princomp) summary(aov.p3) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Missing Call Rate Differences} This step determines whether there are differences in missing call rates between cases and controls. As in section~\ref{sec:pca_case_cntrl}, we use simulated case-control status to demonstrate this step, since the HapMap II data does not contain information on cases and controls. Investigate the difference in mean missing call rate by case-control status, using the sample annotation variable \Rcode{missing.e1}. Here, since the case-control status was randomly assigned, we do not expect to see a difference in any of the missing call rates with respect to case-control status. <<>>= lm.all <- lm(scanAnnot$missing.e1 ~ scanAnnot$status) summary(aov(lm.all)) @ <>= boxplot(scanAnnot$missing.e1 ~ scanAnnot$status, ylab = "Mean missing call rate", main="Mean missing call rate by case status") @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Chromosome Anomaly Detection} This step looks for large chromosomal anomalies that may be filtered out during the final analysis. \subsection{B Allele Frequency filtering} Create an \texttt{IntensityData} object and a \texttt{GenotypeData} object. <<>>= blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata") blgds <- GdsIntensityReader(blfile) blData <- IntensityData(blgds, snpAnnot=snpAnnot, scanAnnot=scanAnnot) genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") genogds <- GdsGenotypeReader(genofile) genoData <- GenotypeData(genogds, snpAnnot=snpAnnot, scanAnnot=scanAnnot) @ Identify some low quality samples by looking at the standard deviation of BAF. <<>>= baf.sd <- sdByScanChromWindow(blData, genoData, var="BAlleleFreq") med.baf.sd <- medianSdOverAutosomes(baf.sd) low.qual.ids <- med.baf.sd$scanID[med.baf.sd$med.sd > 0.05] @ Decide which SNPs to exclude based on genome build. <<>>= chrom <- snpAnnot$chromosome pos <- snpAnnot$position hla.df <- get(data(HLA.hg18)) hla <- chrom == "6" & pos >= hla.df$start.base & pos <= hla.df$end.base xtr.df <- get(data(pseudoautosomal.hg18)) xtr <- chrom == "X" & pos >= xtr.df["X.XTR", "start.base"] & pos <= xtr.df["X.XTR", "end.base"] centromeres <- get(data(centromeres.hg18)) gap <- rep(FALSE, nrow(snpAnnot)) for (i in 1:nrow(centromeres)) { ingap <- chrom == centromeres$chrom[i] & pos > centromeres$left.base[i] & pos < centromeres$right.base[i] gap <- gap | ingap } ignore <- snpAnnot$missing.n1 == 1 #ignore includes intensity-only and failed snps snp.exclude <- ignore | hla | xtr | gap snp.ok <- snpAnnot$snpID[!snp.exclude] @ We use circular binary segmentation to find change points in BAF. <<>>= scan.ids <- scanAnnot$scanID[1:10] chrom.ids <- 21:23 baf.seg <- anomSegmentBAF(blData, genoData, scan.ids=scan.ids, chrom.ids=chrom.ids, snp.ids=snp.ok, verbose=FALSE) head(baf.seg) @ Filter segments to detect anomalies, treating the low quality samples differently. <<>>= baf.anom <- anomFilterBAF(blData, genoData, segments=baf.seg, snp.ids=snp.ok, centromere=centromeres, low.qual.ids=low.qual.ids, verbose=FALSE) names(baf.anom) baf.filt <- baf.anom$filtered head(baf.filt) @ \subsection{Loss of Heterozygosity} We look for Loss of Heterozygosity (LOH) anomalies by identifying homozygous runs with change in LRR. Change points in LRR are found by circular binary segmentation. Known anomalies from the BAF detection are excluded. <<>>= loh.anom <- anomDetectLOH(blData, genoData, scan.ids=scan.ids, chrom.ids=chrom.ids, snp.ids=snp.ok, known.anoms=baf.filt, verbose=FALSE) names(loh.anom) loh.filt <- loh.anom$filtered head(loh.filt) @ \subsection{Statistics} Calculate statistics for the anomalous segments found with the BAF and LOH methods. <<>>= # create required data frame baf.filt$method <- "BAF" if (!is.null(loh.filt)) { loh.filt$method <- "LOH" cols <- intersect(names(baf.filt), names(loh.filt)) anoms <- rbind(baf.filt[,cols], loh.filt[,cols]) } else { anoms <- baf.filt } anoms$anom.id <- 1:nrow(anoms) stats <- anomSegStats(blData, genoData, snp.ids=snp.ok, anom=anoms, centromere=centromeres) names(stats) @ Plot the anomalies with relevant statistics, one anomaly per plot. Each plot has two parts: upper part is a graph of LRR and lower part is a graph of BAF. <>= snp.not.ok <- snpAnnot$snpID[snp.exclude] anomStatsPlot(blData, genoData, anom.stats=stats[1,], snp.ineligible=snp.not.ok, centromere=centromeres, cex.leg=1) @ \subsection{Identify low quality samples} To identify low quality samples, one measure we use is the standard deviation of BAF and LRR. BAF results were found previously, now we find results for LRR. Unlike for BAF, all genotypes are included. <<>>= lrr.sd <- sdByScanChromWindow(blData, var="LogRRatio", incl.hom=TRUE) med.lrr.sd <- medianSdOverAutosomes(lrr.sd) @ We also need the number of segments found using circular binary segmentation in anomaly detection. <<>>= baf.seg.info <- baf.anom$seg.info loh.seg.info <- loh.anom$base.info[,c("scanID", "chromosome", "num.segs")] @ We identify low quality samples separately for BAF and LOH, using different threshold parameters. A SnpAnnotationDataFrame with an ``eligible'' column is required. BAF detected anomalies for low quality BAF samples tend to have higher false positive rate. LOH detected anomalies for low quality LOH samples tend to have higher false positive rate. <<>>= snpAnnot$eligible <- !snp.exclude baf.low.qual <- anomIdentifyLowQuality(snpAnnot, med.baf.sd, baf.seg.info, sd.thresh=0.1, sng.seg.thresh=0.0008, auto.seg.thresh=0.0001) loh.low.qual <- anomIdentifyLowQuality(snpAnnot, med.lrr.sd, loh.seg.info, sd.thresh=0.25, sng.seg.thresh=0.0048, auto.seg.thresh=0.0006) @ Close the \texttt{IntensityData} and \texttt{GenotypeData} objects. <<>>= close(blData) close(genoData) @ \subsection{Filter anomalies}\label{sec:filt_anom} We can set genotypes in anomaly regions to missing for future analyses (such as Hardy-Weinberg equilibrium and association tests). We use the function \Rcode{setMissingGenotypes} to create a new GDS file with anomaly regions set to \Rcode{NA}. We recommend inspecting the plots from \Rcode{anomStatsPlot} for large anomalies (e.g., $>5$ Mb) to identify those anomalies that case genotyping errors. We can also exclude certain samples, such as duplicates, low quality samples, and samples with unresolved identity issues. <<>>= # anomalies to filter anom.filt <- stats[,c("scanID", "chromosome", "left.base", "right.base")] # whole.chrom column is required and can be used for sex chromosome # anomalies such as XXX anom.filt$whole.chrom <- FALSE # select unique subjects subj <- scanAnnot$scanID[!scanAnnot$duplicated] subj.filt.file <- "subj_filt.gds" setMissingGenotypes(genofile, subj.filt.file, anom.filt, file.type="gds", sample.include=subj, verbose=FALSE) (gds <- GdsGenotypeReader(subj.filt.file)) close(gds) @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{SNP Quality Checks} This step finds SNPs that may not be suitable for use in GWAS studies due to genotyping artifacts. Three methods are used to look at the genotyping error rates for each SNP: duplicate sample discordance, Mendelian error rates and deviation from Hardy-Weinberg equilibrium. \subsection{Duplicate Sample Discordance} This step calculates the discordance of genotype calls between samples that are duplicates. Genotype discordance is evaluated by comparing the genotypes of samples that were genotyped more than once. We can examine the discordance rate with respect to samples or SNPs. The discordance rate for a pair of samples is the fraction of genotype calls that differ over all SNPs for which both calls are non-missing. The discordance rate for a SNP is the number of calls that differ divided by the number of duplicate pairs in which both calls are non-missing. Keep the samples with a low enough value for the missing call rate, \Rcode{missing.e1}. The threshold chosen here is 0.05. <<>>= scan.excl <- scanAnnot$scanID[scanAnnot$missing.e1 >= 0.05] length(scan.excl) @ We make a vector of SNP \Rcode{snpIDs} with \Rcode{missing.n1} = 1 to exclude from the comparison. We then call the \Rcode{duplicateDiscordance} function and save the output file. This function finds subjectIDs for which there is more than one scanID. To look at the discordance results, we will calculate the percentage value and look at the summary of the values for each of the duplicate pairs. We will plot the rates color coded by continental ancestry, since experience has shown the values often differ based upon the population group. <<>>= snp.excl <- snpAnnot$snpID[snpAnnot$missing.n1 == 1] length(snp.excl) genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") genoGDS <- GdsGenotypeReader(genofile) genoData <- GenotypeData(genoGDS, snpAnnot=snpAnnot, scanAnnot=scanAnnot) dupdisc <- duplicateDiscordance(genoData, subjName.col="subjectID", scan.exclude=scan.excl, snp.exclude=snp.excl) names(dupdisc) head(dupdisc$discordance.by.snp) length(dupdisc$discordance.by.subject) dupdisc$discordance.by.subject[[2]] # each entry is a 2x2 matrix, but only one value of each # is important since these are all pairs npair <- length(dupdisc$discordance.by.subject) disc.subj <- rep(NA, npair) subjID <- rep(NA, npair) race <- rep(NA, npair) for (i in 1:npair) { disc.subj[i] <- dupdisc$discordance.by.subject[[i]][1,2] subjID[i] <- names(dupdisc$discordance.by.subject)[i] race[i] <- scanAnnot$race[scanAnnot$subjectID == subjID[i]][1] } dat <- data.frame(subjID=subjID, disc=disc.subj, pop=race, stringsAsFactors=FALSE) summary(dat$disc) # Assign colors for the duplicate samples based on population group. dat$col <- NA dat$col[dat$pop == "CEU"] <- "blue" dat$col[dat$pop == "YRI"] <- "red" dat <- dat[order(dat$disc),] dat$rank <- 1:npair @ <>= # Plot the sample discordance rates color coded by race. plot(dat$disc, dat$rank, col=dat$col, ylab="rank", xlab="Discordance rate between duplicate samples", main="Duplicate Sample Discordance by Continental Ancestry") legend("bottomright", unique(dat$pop), pch=rep(1,2), col=unique(dat$col)) @ Genotyping error rates can be estimated from duplicate discordance rates. The genotype at any SNP may be called correctly, or miscalled as either of the other two genotypes. If $\alpha$ and $\beta$ are the two error rates, the probability that duplicate genotyping instances of the same participant will give a discordant genotype is $2[(1-\alpha - \beta )(\alpha + \beta ) + \alpha \beta ]$. When $\alpha$ and $\beta$ are very small, this is approximately $2(\alpha + \beta)$ or twice the total error rate. Potentially, each true genotype has different error rates (i.e. three $\alpha$ and three $\beta$ parameters), but here we assume they are the same. A rough estimate of the mean error rate is half the median discordance rate over all sample pairs. Duplicate discordance estimates for individual SNPs can be used as a SNP quality filter. The challenge here is to find a level of discordance that would eliminate a large fraction of SNPs with high error rates, while retaining a large fraction with low error rates. The probability of observing $> x$ discordant genotypes in a total of $n$ pairs of duplicates can be calculated using the binomial distribution. <<>>= duplicateDiscordanceProbability(npair) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Mendelian Error Checking} This step calculates and examines the Mendelian error rates. Mendelian errors are detected in parent-offspring trios or pairs as offspring genotypes that are inconsistent with Mendelian inheritance. We use the \Rcode{mendelErr} function to calculate a Mendelian error rate per SNP. Lastly some checks are done on Mendelian error rates per family. To call the Mendelian error checking function, we first must create a \Rcode{mendelList} object. We will call \Rcode{mendelList} that creates a list of trios, checking for any sex inconsistencies among annotated father and mother samples. Then, \Rcode{mendelListAsDataFrame} puts this list into a data frame for easier checking. Finally, we can call the \Rcode{mendelErr} function to find the Mendelian errors for SNPs with \Rcode{missing.n1} less than 0.05. <<>>= men.list <- with(pData(scanAnnot), mendelList(family, subjectID, father, mother, sex, scanID)) res <- mendelListAsDataFrame(men.list) head(res) dim(res) # Only want to use SNPs with missing.n1 < 0.05 snp.excl <- snpAnnot$snpID[snpAnnot$missing.n1 >= 0.05] length(snp.excl) mend <- mendelErr(genoData, men.list, snp.exclude=snp.excl) names(mend) head(mend$trios) names(mend$snp) @ \subsection*{Mendelian Errors per SNP} The Mendelian error rate is calculated for each SNP by dividing the number of errors per SNP for all trios by the number of trios used in the error checking. <<>>= # Calculate the error rate err <- mend$snp$error.cnt / mend$snp$check.cnt table(err == 0, useNA="ifany") @ <>= plot(err, rank(err), xlab="Error Rate (fraction)", ylab="rank", main="Mendelian Error Rate per SNP, ranked") @ Next we will look at the Mendelian error rates among the trios we have in the HapMap data. Looking at the summary of the number of families with at least one error over all SNPs, we can see the maximum number of errors per SNP. Next, we can look at subsets of SNPs with greater than 0 and 1 errors per SNP. Finally, for those SNPs that have valid trios to detect errors, we get the fraction of SNPs with no errors. <<>>= fam <- mend$snp$error.cnt n <- mend$snp$check.cnt summary(fam) # SNPs with errors length(fam[n > 0 & fam > 0]) # SNPs for which more than one family has an error length(fam[n > 0 & fam > 1]) # Get the SNPs with valid trios for error detection val <- length(fam[n > 0]) noerr <- length(fam[n > 0 & fam == 0]) # Divide to get fraction with no errors noerr / val @ We add the Mendelian error values to the SNP annotation. The number of families with at least one error per SNP, \Rcode{mend\$snp\$error.cnt}, gets saved as \Rcode{mendel.err.count}. The number of valid families for checking, \Rcode{mend\$snp\$check.cnt}, gets saved as \Rcode{mendel.err.sampsize}. <<>>= snp.sel <- match(names(mend$snp$error.cnt), snpAnnot$snpID) snpAnnot$mendel.err.count[snp.sel] <- mend$snp$error.cnt snpAnnot$mendel.err.sampsize[snp.sel] <- mend$snp$check.cnt allequal(snpAnnot$snpID, sort(snpAnnot$snpID)) # The high number of NA values is due to the filtering out of SNPs # before the Mendelian error rate calculation sum(is.na(snpAnnot$mendel.err.count)) sum(is.na(snpAnnot$mendel.err.sampsize)) varMetadata(snpAnnot)["mendel.err.count", "labelDescription"] <- paste("number of Mendelian errors detected in trios averaged over", "multiple combinations of replicate genotyping instances") varMetadata(snpAnnot)["mendel.err.sampsize", "labelDescription"] <- "number of opportunities to detect Mendelian error in trios" @ To further investigate SNPs with a high Mendelian error rate, we will make genotyple cluster plots for the 9 SNPs with the highest Mendelian error rate. Blue indicates a sample with a ``BB'' genotype, green is an ``AB'' genotype and red is an ``AA'' genotype. The black X marks indicate a sample with a missing genotype for that SNP. We expect the plots to lack defined genotype clusters, leading to a poor call rate. <>= # Get a vector of SNPs to check snp <- pData(snpAnnot) snp$err.rate <- snp$mendel.err.count / snp$mendel.err.sampsize snp <- snp[order(snp$err.rate, decreasing=TRUE),] snp <- snp[1:9,] xyfile <- system.file("extdata", "illumina_qxy.gds", package="GWASdata") xyGDS <- GdsIntensityReader(xyfile) xyData <- IntensityData(xyGDS, snpAnnot=snpAnnot, scanAnnot=scanAnnot) pdf(file="DataCleaning-mendel.pdf") par(mfrow = c(3,3)) mtxt <- paste("SNP", snp$rsID, "\nMendelian Error Rate", format(snp$err.rate, digits=5)) genoClusterPlot(xyData, genoData, snpID=snp$snpID, main.txt=mtxt, cex.main=0.9) dev.off() close(xyData) @ \includegraphics[width=5.67in,height=5.67in,keepaspectratio]{DataCleaning-mendel} \subsection*{Mendelian Errors per Family} This section does some analyses on the Mendelian Errors for each family (trio). The variable \Rcode{all.trios} contains results of all combinations of duplicate samples. The variable \Rcode{trios} contains the averages of unique trios (averages of duplicates from \Rcode{all.trios}). <<>>= # Calculate the fraction of SNPs with an error for each trio trios <- mend$trios trios$Mend.err <- trios$Men.err.cnt/trios$Men.cnt summary(trios$Mend.err) # Start by pulling out the vectors needed from `trios' tmp <- trios[, c("fam.id", "Mend.err")]; dim(tmp) # Change fam.id to match the sample annotation column name names(tmp) <- c("family", "Mend.err.rate.fam") # Merge the variables into the sample annotation file scanAnnot$mend.err.rate.fam <- NA for (i in 1:nrow(tmp)) { ind <- which(is.element(scanAnnot$family, tmp$family[i])) scanAnnot$mend.err.rate.fam[ind] <- tmp$Mend.err.rate.fam[i] } head(scanAnnot$mend.err.rate.fam) varMetadata(scanAnnot)["mend.err.rate.fam", "labelDescription"] <- "Mendelian error rate per family" @ The Mendelian error rate per family, broken up by continental ancestry, could illuminate issues with SNPs that may not be accurately called across all ethnicities for the minor allele. We will plot the Mendelian error rate per family, color coded by population group. The error rates are higher for the YRI families as a whole, which is expected due to the higher level of genetic diversity. <<>>= # Get the families that have non-NA values for the family # Mendelian error rate fams <- pData(scanAnnot)[!is.na(scanAnnot$mend.err.rate.fam) & !duplicated(scanAnnot$family), c("family", "mend.err.rate.fam", "race")] dim(fams) table(fams$race, useNA="ifany") # Assign colors for the different ethnicities in these families pcol <- rep(NA, nrow(fams)) pcol[fams$race == "CEU"] <- "blue" pcol[fams$race == "YRI"] <- "red" @ <>= plot(fams$mend.err.rate.fam*100, rank(fams$mend.err.rate.fam), main="Mendelian Error rate per Family, ranked", xlab="Mendelian error rate per family (percent)", ylab="rank", col=pcol) legend("bottomright", c("CEU", "YRI"), pch=c(1,1), col=c("blue", "red")) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Hardy-Weinberg Equilibrium Testing} This section uses Fisher's exact test to examine each SNP for departure from Hardy-Weinberg Equilibrium. For each SNP, p-values are obtained; those SNPs with extremely low values will be considered for filtering. QQ-plots of the p-values are made for both the autosomes and X chromosome. To run the Hardy-Weinberg test, we will filter out duplicates and non-founders. We will run \Rcode{exactHWE} for the samples with European continental ancestry only, although the process is just the same for all population groups. The X chromosome must be run separately from the autosomes since it filters out males. <<>>= head(pData(scanAnnot)[,c("father", "mother")]) nonfounders <- scanAnnot$father != 0 & scanAnnot$mother != 0 table(nonfounders) scan.excl <- scanAnnot$scanID[scanAnnot$race != "CEU" | nonfounders | scanAnnot$duplicated] length(scan.excl) chr <- getChromosome(genoData) auto <- range(which(chr %in% 1:22)) X <- range(which(chr == 23)) hwe <- exactHWE(genoData, scan.exclude=scan.excl, snpStart=auto[1], snpEnd=auto[2]) hweX <- exactHWE(genoData, scan.exclude=scan.excl, snpStart=X[1], snpEnd=X[2]) hwe <- rbind(hwe, hweX) close(genoData) @ We will look at the values calculated from the function call to \Rcode{exactHWE}, which include p-values, minor allele frequency, and genotype counts for each SNP on each of the chromosome types. <<>>= names(hwe) dim(hwe) # Check on sample sizes for autosomes and X chromosome hwe$N <- hwe$nAA + hwe$nAB + hwe$nBB summary(hwe$N[is.element(hwe$chr,1:22)]) summary(hwe$N[is.element(hwe$chr,23)]) hwe$pval[1:10] sum(is.na(hwe$pval[hwe$chr == 23])) # X hwe$MAF[1:10] hwe[1:3, c("nAA", "nAB", "nBB")] @ Next we want to estimate the inbreeding coefficient per SNP calculated using the minor allele frequencies and the total sample number count. A histogram shows the distribution is centered around 0, which indicates there is most likely no significant population substructure. <<>>= summary(hwe$f) @ <>= hist(hwe$f, main="Histogram of the Inbreeding Coefficient For CEU Samples", xlab="Inbreeding Coefficient") @ <<>>= # Check the MAF of those SNPs with f=1 chkf <- hwe[!is.na(hwe$f) & hwe$f==1,]; dim(chkf) summary(chkf$MAF) @ To see at what value the SNPs begin to deviate from the Hardy-Weinberg expected values, we will make QQ-plots that exclude SNPs where MAF = 0. We plot of the observed p-values vs. the expected p-values for the autosomes and X chromosome separately by calling the function \Rcode{qqPlot}. <<>>= hwe.0 <- hwe[hwe$MAF > 0,]; dim(hwe.0) # Only keep the autosomal SNPs for first plot pval <- hwe.0$pval[is.element(hwe.0$chr, 1:22)] length(pval) pval <- pval[!is.na(pval)] length(pval) # X chromosome SNPs for plot 2 pval.x <- hwe.0$pval[is.element(hwe.0$chr, 23)] length(pval.x) pval.x <- pval.x[!is.na(pval.x)] length(pval.x) @ <>= pdf(file = "DataCleaning-hwe.pdf") par(mfrow=c(2,2)) qqPlot(pval=pval, truncate = FALSE, main="Autosomes, all") qqPlot(pval=pval, truncate = TRUE, main="Autosomes, truncated") qqPlot(pval=pval.x, truncate = FALSE, main="X chromosome, all") qqPlot(pval=pval.x, truncate = TRUE, main="X chromosome, truncated") dev.off() @ \includegraphics[width=5.67in,height=5.67in,keepaspectratio]{DataCleaning-hwe} We plot the p-values against MAF for all SNPs with MAF greater than zero. <>= plot(hwe.0$MAF, -log10(hwe.0$pval), xlab="Minor Allele Frequency", ylab="-log(p-value)", main="Minor Allele Frequency vs\nP-value") @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preliminary Association Tests} The final step in the data cleaning process is to perform preliminary association tests. This step creates and examines QQ, `Manhattan' signal, and genotype cluster plots. If significant SNPs appear as a result of the association test, SNP cluster plots must be examined to determine if the association is driven from a poorly clustering SNP. Note that HapMap data do not come with phenotypic outcomes, thus, for purposes of the tutorial we use simulated binary outcomes instead. The tests conducted are usually logistic regression based tests; the samples are filtered by quality criteria and only unrelated subjects are included. In the code below we do not include any covariates in the logistic regression as these data are not part of a case control study. For data in the GENEVA project and other GWA studies we discuss which variables should be considered for inclusion as covariates in the preliminary association tests. The determination is made by analyzing a model including these covariates but without genotypes; covariates with significant effects are then included in the final model. \subsection{Association Test} To run the association test, we call the function \Rcode{assocRegression}. (For survival analysis, the function \Rcode{assocCoxPH} may be used.) We will use a logistic regression model with status as the outcome, and sex and the first principal component as covariates. We will use the filtered GDS file with unique subjects only that we made in section~\ref{sec:filt_anom}. Here we use all unique subjects, but we would use the argument \Rcode{scan.exclude} for those samples we wish to filter out for the association test (such as low-quality or related samples). Typically, we do not filter out SNPs for the association test -- we run all SNPs and then filter the results. The omission of filters may cause some SNPs to return significant p-values for association. In this example, we will use the \Rcode{snpStart} and \Rcode{snpEnd} arguments to select a few SNPs on each chromosome. <<>>= genoGDS <- GdsGenotypeReader(subj.filt.file) subjAnnot <- scanAnnot[scanAnnot$scanID %in% getScanID(genoGDS),] subjAnnot$sex <- as.factor(subjAnnot$sex) subjAnnot$EV1 <- pca$eigenvect[match(subjAnnot$scanID, pca$sample.id), 1] genoData <- GenotypeData(genoGDS, scanAnnot=subjAnnot) chr <- getChromosome(genoData) assoc.list <- lapply(unique(chr), function(x) { ## Y chromsome only includes males, cannot have sex as a covariate covar <- ifelse(x == 25, "EV1", c("sex", "EV1")) start <- which(chr == x)[1] assocRegression(genoData, outcome="status", covar=covar, model.type="logistic", snpStart=start, snpEnd=start+50) }) assoc <- do.call(rbind, assoc.list) close(genoData) @ After running the association test on the selected subset of SNPs and samples we must analyze the results to determine if any probes with significant p-values are spurious or truly associated with the phenotype of interest. Quantile-quantile, `Manhattan' and SNP cluster plots will all be used to further understand those probes with significant p-values. \subsection{QQ Plots} We create QQ plots of the ordered Wald test p-values versus the ordered expected p-values. Given that the samples were split randomly between cases and controls, not surprisingly there are no outliers visible in the QQ plot. <>= qqPlot(pval=assoc$Wald.pval, truncate=TRUE, main="QQ Plot of Wald Test p-values") @ \subsection{``Manhattan'' Plots of the P-Values} To create the `Manhattan' plots, we will call the function \Rcode{manhattanPlot}. We take the negative log transformation of the p-values and plot them for each probe. <>= chrom <- getChromosome(snpAnnot, char=TRUE) snp.sel <- getSnpID(snpAnnot) %in% assoc$snpID manhattanPlot(assoc$Wald.pval, chromosome=chrom[snp.sel]) @ \subsection{SNP Cluster Plots} Next, we will create SNP cluster plots for the probes with significant p-values. It is important to examine cluster plots of all top hits, as poor clusters not picked up by other quality checking steps may still show up as having low p-values. We plot SNPs with the 9 most significant p-values from the Wald test. <>= # Identify SNPs with lowest p-values snp <- pData(snpAnnot)[snp.sel, c("snpID", "rsID")] snp$pval <- assoc$Wald.pval snp <- snp[order(snp$pval),] snp <- snp[1:9,] xyfile <- system.file("extdata", "illumina_qxy.gds", package="GWASdata") xyGDS <- GdsIntensityReader(xyfile) xyData <- IntensityData(xyGDS, snpAnnot=snpAnnot, scanAnnot=scanAnnot) genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata") genoGDS <- GdsGenotypeReader(genofile) genoData <- GenotypeData(genoGDS, snpAnnot=snpAnnot, scanAnnot=scanAnnot) pdf(file="DataCleaning-cluster.pdf") par(mfrow = c(3,3)) mtxt <- paste("SNP", snp$rsID, "\np =", format(snp$pval, digits=4)) genoClusterPlot(xyData, genoData, snpID=snp$snpID, main.txt=mtxt) dev.off() close(xyData) close(genoData) @ \includegraphics[width=5.67in,height=5.67in,keepaspectratio]{DataCleaning-cluster} These cluster plots look like high-quality SNPs with well-defined clusters (red=AA, green=AB, blue=BB). <>= unlink(subj.filt.file) @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Acknowledgements} This manual reflects the work of many people. In the first place the methods described were developed and implemented by a team headed by Cathy Laurie. The team included David Crosslin, Stephanie Gogarten, David Levine, Caitlin McHugh, Sarah Nelson, Jess Shen, Bruce Weir, Qi Zhang and Xiuwen Zheng. Before any the work started, valuable advice was provided by Thomas Lumley and Ken Rice. Preparation of the manual began with a team headed by Ian Painter and Stephanie Gogarten. The team included Marshall Brown, Matthew Conomos, Patrick Danaher, Kevin Rubenstein, Emily Weed and Leila Zelnick. The data cleaning activities of the GENEVA Coordinating Center have been greatly helped by the experience and advice from other participants in the GENEVA program: the genotyping centers at CIDR and the Broad; the dbGaP group at the National Center for Biotechnology Information (NCBI); and the many study investigators. Particular thanks to Kim Doheny and Elizabeth Pugh at CIDR and Stacey Gabriel and Daniel Mirel at the Broad and Justin Paschall at NCBI. Funding for the GENEVA project includes HG 004446 (PI: Bruce Weir) for the Coordinating Center, U01 HG 004438 (PI: David Vallee) for CIDR, HG 004424 (PI: Stacey Gabriel) for the Broad. The continuing guidance of Dr. Teri Manolio of NHGRI is deeply appreciated. \end{document}