%\VignetteIndexEntry{1. XPS Vignette: Overview} %\VignetteDepends{xps} %\VignetteKeywords{Expression, Affymetrix, Oligonucleotide Arrays} %\VignettePackage{xps} \documentclass{article} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\xps}{\Rpackage{xps}} \newcommand{\ROOT}{\Robject{ROOT}} \begin{document} \title{Introduction to the xps Package: Overview} \date{March, 2012} \author{Christian Stratowa} \maketitle \tableofcontents \section{Introduction} Affymetrix GeneChip oligonucleotide arrays use several probes to assay each targeted transcript. Specialized algorithms have been developed to summarize low-level probe set intensities to get one expression measure for each transcript. Some of these methods, such as MAS 4.0's AvDiff \citep{affy4}, MAS5's signal \citep{affy5} or RMA \citep{iriz:etal:2003}, are implemented in package \Rpackage{affy} \citep{gaut:cope:bols:iriz:2003}. Further methods, such as FARMS \citep{hochr:etal:2006} or DFW \citep{chen:etal:2007} are custom methods that can be registered for use with package \Rpackage{affy}. Advantages in technology allow Affymetrix to supply whole-genome expression arrays such as the new GeneChip Exon array systems (Exon 1.x ST) and Gene array systems (Gene 1.x ST). The amount of data created with the new exon arrays poses a great challenge, since R stores all objects in memory. Package \xps\ ~- \Rclass{eXpression Profiling System} - is designed to analyze Affymetrix GeneChip expression and exon arrays on computers with limited amounts of memory (1 GB RAM). To achieve this goal, \xps\ takes advantage of \ROOT, a framework especially developed to handle and analyse large amounts of data in a memory efficient way. \noindent{\bf Important installation note:} Package \xps\ is based on two powerful frameworks, namely \Robject{R} and \ROOT. It is thus absolutely essential to install the \ROOT\ framework before \xps\ can be built and installed. For instructions how to install \ROOT\ see the README file provided with package \xps. \section{Why ROOT?} ROOT (\url{http://root.cern.ch}) is an object-oriented framework that has been developed at CERN for distributed data warehousing and data mining of particle data in the petabyte range, such as the data created with the new LHC collider. Data are stored as sets of objects in machine-independent files, and specialized storage methods are used to get direct access to separate attributes of selected data objects. For more information see the ROOT User Guide (The ROOT \citet{root:2009}). Taking advantage of these features, package \xps\ stores all data in portable \ROOT\ files. Data describing microarray layout, probe information and metadata for genes are stored as \ROOT\ Trees in \Rclass{scheme} files, accessible from \Robject{R} as \Rclass{scheme} objects. Raw probe intensities, i.e. CEL-files for each project are stored as \ROOT\ Trees in \Rclass{data} files, accessible from \Robject{R} as \Rclass{data} objects. All analysis is done independent of \Robject{R} such avoiding inherent memory limitations. {\bf Note:} Absolutely no knowledge of \ROOT\ is required to use package \xps. However, the interested user could use package \xps\ independent of \Robject{R} by writing \ROOT\ macros, examples of which can be found in file "macro4XPS.C", located in subdirectory xps/examples. \section{Getting Started} First you need to load the \xps\ package. \begin{Sinput} R> library(xps) \end{Sinput} <>= library(xps) @ As an initial step, which needs to be done only once, you must import Affymetrix chip definition files, probe files and annotation files for all arrays that you are using, into \ROOT\ \Rclass{scheme} files. This is described in Appendix A1, here we use the \ROOT\ \Rclass{scheme} file supplied with the package. Throughout this tutorial we will use a set of four {\it CEL} files supplied with the package. The necessary \ROOT\ \Rclass{scheme} file \Rclass{SchemeTest3.root} for GeneChip {\it Test3.CDF} is also supplied as well as the \ROOT\ \Rclass{data} file \Rclass{DataTest3\_cel.root}. These files need to be loaded for every new \Robject{R}-session, unless the session has been saved. {\bf Note:} Please see Appendix A2 for many additional examples on how to use \xps. \subsection{Reading CEL file information} The {\it CEL} files can be located in a common directory or in different directories, see \Rcode{?import.data} how to import {\it CEL} files from different directories. {\it CEL} files will be imported into a \ROOT\ \Rclass{data} file as \ROOT\ \Rclass{Trees}. Once the \ROOT\ \Rclass{data} file is created, the {\it CEL} files are no longer needed, since subsequent \Robject{R}-sessions need only load the \ROOT\ \Rclass{data} file. However, it is possible to load only a subset of {\it CEL} files, and it is also possible to save new {\it CEL} files in the same \ROOT\ \Rclass{data} file at a later time. In this demo we will show how to achieve this. First we load the \xps\ package. <>= library(xps) @ For this demonstration {\it CEL} files are located in a common directory, in our case in: <<>>= celdir <- file.path(.path.package("xps"), "raw") @ Since our {\it CEL} files were created for GeneChip {\it Test3.CDF}, we need to load the corresponding \ROOT\ \Rclass{scheme} file first: <<>>= scheme.test3 <- root.scheme(file.path(.path.package("xps"), "schemes", "SchemeTest3.root")) @ Now we can import the {\it CEL} files, in our case a subset first: <<>>= celfiles <- c("TestA1.CEL","TestA2.CEL") data.test3 <- import.data(scheme.test3, "tmpdt_DataTest3", celdir=celdir, celfiles=celfiles, verbose=FALSE) @ To see, which {\it CEL} files were imported as \ROOT\ \Rclass{Trees}, we can do: <<>>= unlist(treeNames(data.test3)) @ Now we can import additional {\it CEL} files: <<>>= celfiles <- c("TestB1.CEL","TestB2.CEL") data.test3 <- addData(data.test3, celdir=celdir, celfiles=celfiles, verbose=FALSE) @ Instead of getting the imported tree names from the created instance \Rclass{data.test3} of S4 class \Rclass{DataTreeSet}, we can also get the tree names directly from the \ROOT\ \Rclass{data} file: <<>>= getTreeNames(rootFile(data.test3)) @ Now we have all {\it CEL} files imported as \ROOT\ \Rclass{Trees}. In later \Robject{R}-sessions we only need to load the corresponding \ROOT\ \Rclass{data} file using function \Rfunction{root.data()}. In this tutorial we will not use the file just created but the \ROOT\ \Rclass{data} file \Rclass{DataTest3\_cel.root}. \\ {\bf Note 1:} It is also possible to import `phenotypic-data' describing samples and further project--relevant data for the experiment, see S4 class \Rclass{ProjectInfo}. {\bf Note 2:} Since \ROOT\ \Rclass{data} files contain the raw data, it is recommended to create them in a common system directory, e.g. 'rootdata', which is accessible to other users, too. {\bf Note 3:} In order to distinguish \ROOT\ \Rclass{data} files containing the raw data from other \ROOT\ files, extension `\_cel' is automatically added to the file name. Thus creating a raw data file with name \Rclass{DataTest3} will result in a \ROOT\ file with name \Rclass{DataTest3\_cel.root}. Extension `root' is always added to each \ROOT\ file. {\bf Note 4:} Usually, \ROOT\ \Rclass{data} files are kept permanently. Thus it is not possible to accidently overwrite a \ROOT\ \Rclass{data} file with another file of the same name; you will get an error message. If you want to create a temporary \ROOT\ \Rclass{data} file, which can be overwritten, the name must start with `tmp\_'. However, in the example above we needed to use `tmpdt\_' otherwise R CMD check would produce an error on Windows. Please note that `tmpdt\_' will not work with function \Rfunction{import.data()} for the reason described in Note 3 above. {\bf Note 5:} It is highly recommended to keep the default setting \Rcode{verbose=TRUE}, especially when working with exon arrays. On Windows you will see the verbose messages only when starting \Robject{R} from the command line, i.e. using RTerm. \subsection{Accessing raw data} Currently, the data from the imported {\it CEL} files are saved as \ROOT\ \Rclass{Trees} in the \ROOT\ \Rclass{data} file, however, they are not accessible from within \Robject{R}. The corresponding slot \Rcode{data} of instance \Rclass{data.test3} of S4 class \Rclass{DataTreeSet}, a data.frame, is empty. This setting allows to import e.g. an (almost) unlimited number of {\it CEL} files from GeneChip Exon arrays on computers with 1GB RAM only. When we try to access the raw data, we get: <<>>= tmp <- intensity(data.test3) head(tmp) @ Thus, we need to attach the raw data first to \Rclass{data.test3}: <>= data.test3 <- attachInten(data.test3) @ Now we get: <<>>= tmp <- intensity(data.test3) head(tmp) @ Alternatively, it is also possible to attach only a subset to the current object \Rclass{data.test3}, or to a copy \Rclass{subdata.test3}: <<>>= subdata.test3 <- attachInten(data.test3, c("TestB1.cel","TestA2")) tmp <- intensity(subdata.test3) head(tmp) @ Often it is useful to obtain the intensities for a certain probeset only. As an example let us find the intensities for probeset '93822\_at'. For this purpose we need to get the internal UNIT\_ID first: <<>>= data.test3 <- attachUnitNames(data.test3) id <- transcriptID2unitID(data.test3, transcriptID="93822_at", as.list=FALSE) id @ If we know the gene symbol, we could also do: <<>>= id <- symbol2unitID(data.test3, symbol="Rpl37a", as.list=FALSE) id @ Now we can extract the PM intensities for the UNIT\_ID: <<>>= data <- validData(data.test3, which="pm", unitID=id) data @ To avoid the above message that slot 'mask' is empty we can do: <<>>= data.test3 <- attachMask(data.test3) @ \pagebreak Finally we can plot the PM and MM intensities, in this case for a subset only. \begin{center} <>= probesetplot(data.test3, unitID="93822_at", unittype="transcript", which="both", names=c("TestA1","TestA2"), add.legend="topright") @ \end{center} When we no longer need the raw data, we can remove them from \Rclass{data.test3}, thus avoiding memory consumption of \Robject{R}: <<>>= data.test3 <- removeInten(data.test3) tmp <- intensity(data.test3) head(tmp) @ This step is not necessary for small datasets or if the computer has sufficient RAM. \pagebreak \section{Converting raw data to expression measures} When we start a new \Robject{R}-session, it is necessary to load the \ROOT\ \Rclass{scheme} and \ROOT\ \Rclass{data} files first: <>= library(xps) scheme.test3 <- root.scheme(file.path(.path.package("xps"), "schemes", "SchemeTest3.root")) data.test3 <- root.data(scheme.test3, file.path(.path.package("xps"),"rootdata", "DataTest3_cel.root")) @ This step is not necessary when objects \Rclass{scheme.test3} and \Rclass{data.test3} are already saved in an \Robject{R}-session. \\ Converting raw data to expression measures and computing detection calls is fairly simple. It is not necessary to attach any \Rcode{data} or \Rcode{mask} data.frames, since all computations are done independently from \Robject{R}. \subsection{Calculating expression levels} Let us first preprocess the raw data using method `RMA' to compute expression levels, and store the results as \ROOT\ \Rclass{Trees} in \ROOT\ file \Rclass{tmpdt\_Test3RMA.root}: <>= data.rma <- rma(data.test3, "tmpdt_Test3RMA", verbose=FALSE) @ {\bf Note:} In this example and the following examples we suppress the usual output. Furthermore, once again we use `tmpdt\_', which adds date and time to the tmp-file, otherwise R CMD check would produce an error on Windows. Usually, you want to create a permanent file, however, if you want to create a temporary file it is recommended to use `tmp\_' as temporary file which will be overwritten. \\ Then we preprocess the raw data using method `MAS5' to compute expression levels, and store the results in \ROOT\ file \Rclass{tmpdt\_Test3MAS5.root}: <>= data.mas5 <- mas5(data.test3, "tmpdt_Test3MAS5", normalize=TRUE, sc=500, update=TRUE, verbose=FALSE) @ Now we want to compare the results by plotting the expression levels for the first sample. For this purpose we need to extract the expression levels from the resulting S4 classes \Rclass{ExprTreeSet} as data.frames first: <<>>= expr.rma <- validData(data.rma) expr.mas5 <- validData(data.mas5) @ Now we can plot the results for the first sample: \begin{center} <>= plot(expr.rma[,1],expr.mas5[,1],log="xy",xlim=c(1,20000),ylim=c(1,20000)) @ \end{center} {\bf Note:} For both methods, `RMA' and `MAS5', true expression levels are extracted, which is in contrast to other packages which extract the $\log_2$-values for `RMA'. \subsection{Calculating detection calls} Let us now compute the MAS5 detection calls: <>= call.mas5 <- mas5.call(data.test3,"tmpdt_Test3Call", verbose=FALSE) @ Alternatively, let us compute the DABG (detection above background) calls: <>= call.dabg <- dabg.call(data.test3,"tmpdt_Test3DABG", verbose=FALSE) @ {\bf Note:} YES, in principle it is indeed possible to compute the DABG call not only for exon arrays but for expression arrays, too. However, computation may take a long time, e.g. on a computer with 2.3GHz Intel Core 2 Duo processor and 2GB RAM, computing DABG calls for HG-U133\_Plus\_2 arrays will take about 45 min/array. \\ \pagebreak Both, detection call and detection p-value can be extracted as \Robject{data.frame}: <<>>= pres.mas5 <- presCall(call.mas5) head(pres.mas5) @ <<>>= pval.mas5 <- pvalData(call.mas5) head(pval.mas5) @ \section{Quality control through data exploration} Quality Control (QC) assessment is a crucial step in successful analysis of microarray data, it has to be done at every step of the analysis. For this purpose every S4 class of package \xps\ provides it's own set of methods. In addition \xps\ contains a special S4 class, called \Rclass{QualTreeSet}, to allow a more extensive quality control. \subsection{\Rclass{DataTreeSet} based evaluation of raw data} Class \Rclass{DataTreeSet} allows an initial evaluation of the quality of the raw data. \pagebreak \subsubsection{Basic quality plots} As a first step we want create some plots with the raw data. {\bf Note:} Since the following plots import the necessary data directly from the \ROOT\ \Rclass{data} file it is no longer necessary to \Rfunction{attachInten()}. \\ First, we create a density plot: \begin{center} <>= hist(data.test3) @ \end{center} \pagebreak The corresponding boxplots are: \begin{center} <>= boxplot(data.test3, which="userinfo:fIntenQuant") @ \end{center} {\bf Note:} Using parameter \Rcode{which} with \Rcode{userinfo} allows to use pre-calculated quantile values for function \Rfunction{boxplot()}, see the help \Rcode{?treeInfo}. This allows to use \Rfunction{boxplot()} without the need to fill slot \Rcode{data}. \pagebreak It is also possible to create an image for e.g. sample TestA1: \begin{center} <>= image(data.test3, names="TestA1.cel") @ \end{center} {\bf Note 1:} With the current version of package \xps\ the above plots no longer depend on filling slot \Rcode{data} using function \Rfunction{attachInten()}. Instead, all data will be imported from the corresponding \ROOT\ \Rclass{data} file on demand. Thus, it is now possible to apply functions \Rfunction{hist()}, \Rfunction{boxplot()} and \Rfunction{image()}, respectively, to datasets containing many samples, and to exon array data on computers with 1-2 GB RAM only. \\ {\bf Note 2:} In addition to the R-graphics, package \xps\ also supports \ROOT\ graphics as an alternative possibility to create plots from large data. This is described in Appendix A4. \\ \pagebreak \subsubsection{Additional quality assessment} As an additional QC step we include a PM-MM-plot of the data. However, in this case we need not only attach the raw data, as shown above, but also slot \Rcode{mask} of \Rclass{scheme.test3}, since slot \Rcode{mask} contains the information which oligos on the array are PM, MM, or control oligos, respectivly. See Appendix A1 for an explanation and how to avoid this step. <>= data.test3 <- attachMask(data.test3) data.test3 <- attachInten(data.test3) @ {\bf Note:} We have applied method \Rfunction{attachMask()} to \Rclass{data.test3} and not to \Rclass{scheme.test3}, since \Rclass{data.test3} contains its own copy of \Rclass{scheme.test3}. \\ Now we create the PM-MM-plot: \begin{center} <>= pmplot(data.test3) @ \end{center} After we are done, we remove the data from \Rclass{data.test3} to free \Robject{R} memory: <<>>= data.test3 <- removeInten(data.test3) data.test3 <- removeMask(data.test3) @ Since the dependence of intensity on probe sequence is a well established fact it may be of interest to visualize the influence that the G/C content of all probes has on the intensity distribution of each hybridization. For this purpose we can draw boxplots of the $\log_2$-intensities as a function of the G/C content. \\ First we need to attach the pre-computed G/C content to slot \Rcode{probe} and optionally also slot \Rcode{mask}: <>= data.test3 <- attachProbeContentGC(data.test3) data.test3 <- attachMask(data.test3) @ Now we can create the boxplot of probe intensities stratified by GC content: \begin{center} <>= intensity2GCplot(data.test3, treename = "TestA1.cel", which="mm") @ \end{center} Here we have have used the MM probes only to demonstrate the strong dependency of the background $\log_2$-intensities of sample "TestA1.cel" on the number of G or C bases in the probe sequency. \\ After we are done, we remove the data from \Rclass{data.test3} to free \Robject{R} memory: <<>>= data.test3 <- removeMask(data.test3) data.test3 <- removeProbeContentGC(data.test3) @ \pagebreak \subsection{\Rclass{ExprTreeSet} based evaluation of normalized expression measures} Class \Rclass{ExprTreeSet} has some methods to assess the quality of expression measures. \subsubsection{Basic quality plots} In the following sections we want to create some quality plots for the expression levels. In contrast to the raw data, expression levels are already stored in slot \Rcode{data} of S4 class\Rclass{ExprTreeSet}, e.g. in \Rclass{data.rma}. \\ First, we create a density plot: \begin{center} <>= hist(data.rma, add.legend=TRUE) @ \end{center} \pagebreak The corresponding boxplots are: \begin{center} <>= boxplot(data.rma, bmar=list(b=9, cex.axis=0.8)) @ \end{center} \pagebreak It is also possible to create M vs A plots for one or more samples: \begin{center} <>= mvaplot(data.rma, pch=20, ylim=c(-2,2), names="TestB1.mdp_LEVEL") @ \end{center} \pagebreak \subsubsection{Additional quality assessment} Another possibility to identify problematic arrays is to do between array comparisons. For this purpose we can compute between arrays correlations and between arrays distances. \\ In order to correlate all arrays from an experiment with each other we compute the array-array Spearman rank correlation coefficients and draw a heat map: \begin{center} <>= corplot(data.rma, add.legend=TRUE) @ \end{center} \begin{figure}[h] \begin{center} \includegraphics{Corrplot_legend.png} % \includegraphics[width=70mm]{Corrplot_legend.png} \end{center} \end{figure} Correlation plots are useful for detecting outliers, failed hybridizations, or mistracked samples. Specifically, these plots can assess between array quality, e.g. arrays belonging to the same set of replicates should show high correlations, and are able to show patterns that reveal the experimental design. \pagebreak Now let us determine the between arrays distances, computed as the MAD of the M-values of each pair of arrays, and drawn as an array-array expression level distance plot (heat map): \begin{center} <>= madplot(data.rma, add.legend=TRUE) @ \end{center} \begin{figure}[h] \begin{center} \includegraphics{MADplot_legend.png} % \includegraphics[width=70mm]{MADplot_legend.png} \end{center} \end{figure} A MAD plot is an exploratory plot that can help detecting outlier arrays and batch effects: If there is an outlier array you will see vertical and horizontal stripes of darker color in the plot. Batch effects can be seen as blocks along the diagonal. \pagebreak Finally we can plot the first two principal components from a principal components analysis (PCA). This is used to show the overall structure of the data: \begin{center} <>= pcaplot(data.rma, group=c("GrpA","GrpA","GrpB","GrpB"), add.labels=TRUE, add.legend=TRUE) @ \end{center} PCA-plots can be very useful to detect outlier arrays between replicates as well as between different groups. In most cases we expect replicates or groups to group together, indicating general similarity in overall expression patterns. \pagebreak \subsection{\Rclass{CallTreeSet} based evaluation of detection calls} Another way to evaluate chip quality is to compare the percentage of present/absent calls. Since the statistics is already pre-calcualted it can be obtained as follows: <<>>= treeInfo(call.mas5, treetype="dc5", varlist ="userinfo:fPcAbsent:fPcMarginal:fPcPresent") @ We can also plot the detection calls: \begin{center} <>= callplot(call.mas5) @ \end{center} \pagebreak \subsection{\Rclass{QualTreeSet} based quality assessment} In addition to the quality assessments presented above, the dedicated S4 Class \Rclass{QualTreeSet} allows a detailed evalutation of raw data and normalized data by fitting probe level models: <>= rlm.all <- fitRLM(data.test3,"tmpdt_Test3RLM", qualopt="all", option="transcript", verbose=FALSE) @ \subsubsection{Evaluating chip quality} First we produce an RNA degradation plot, which can give some idea of how much degradation of mRNA has occured: \begin{center} <>= rnadeg <- AffyRNAdeg(rlm.all) plotAffyRNAdeg(rnadeg, add.legend=TRUE) @ \end{center} {\bf Note:} Although RNA degradation plots were initially created for expression arrays only function \Rfunction{AffyRNAdeg()} can also be applied to whole genome arrays and exon arrays. \\ \pagebreak Next we create a "border elements plot" by analyzing the positive and negative control elements on the outer edges of the Affymetrix arrays. This helps to visualise how consistent the hybridization is around the edges of the arrays: \begin{center} <>= borderplot(rlm.all) @ \end{center} Large variations in positive controls can indicate non-uniform hybridization or gridding problems. Variations in the negative controls indicate background fluctuations. \\ \pagebreak As a further test we create "Center Of Intensity" (COI) plots of positive and negative border elements: \begin{center} <>= coiplot(rlm.all) @ \end{center} If the hybridization is uniform across the array, the location of the COI for the positive/negative elements will be located at the physical center of the array. In this case \Rfunction{coiplot()} will return \Rcode{NULL}. Spatial variations in the hybridization or misalignment of the grid used to determine the cell intensities will cause the COI to move from center. Then the names of affected samples will be returned. \\ \pagebreak \subsubsection{Fitting probe level models} Chip pseudo-images are used to detect artifacts on arrays that could pose potential quality problems such as e.g. bubbles or scratches on the chip. Weights and residuals from model fitting procedures can be accessed using methods \Rmethod{weights()} and \Rmethod{residuals()}, respectively, and can be graphically displayed using the method \Rmethod{image()}. \\ As an example we plot a pseudo-image of one array with the "weights": \begin{center} <>= image(rlm.all, type="weights", names="TestA1_raw.res", add.legend=FALSE) @ \end{center} {\bf Note:} Chip pseudo-images can also be applied to whole genome arrays and exon arrays. \\ \pagebreak Normalized Unscaled Standard Errors (NUSE) can also be used for assessing chip quality. The SE estimates are normalized such that for each probe set the median standard error across all arrays is equal to 1. An array were there are elevated standard errors relative to the other arrays is typically of lower quality. Boxplots of NUSE values can be used to compare the arrays: \begin{center} <>= nuseplot(rlm.all, names="namepart") @ \end{center} {\bf Note:} Here we show NUSE plots for raw data, background-corrected data and normalized data. However, usually boxplots are drawn for normalized data only: <>= nuseplot(rlm.all, names="namepart:normalized") @ \pagebreak Relative Log Expression (RLE) plots are another useful measure to assess array quality. For each probeset and array ratios are calculated between the log-expression of a probeset and the median expression of this probeset across all arrays. Assuming that only few genes are differentially expressed across arrays means that most of these RLE values will be centered close to 0. An RLE boxplot can be produced using: \begin{center} <>= rleplot(rlm.all, names="namepart") @ \end{center} {\bf Note:} Here we show RLE plots for raw data, background-corrected data and normalized data. However, usually boxplots are drawn for normalized data only: <>= rleplot(rlm.all, names="namepart:normalized") @ \pagebreak \section{Filtering expression measures} The \xps\ package can also be used to filter (select) genes according to a variety of different filtering mechanisms, similar to Bioconductor package \Rpackage{genefilter}. It is important to note that filters can be split into the non--specific filters and the specific filters. Usually, non--specific filters are used to reduce the number of genes remaining for further analysis e.g. by reducing the noise in the dataset. In contrast, specific means that we are filtering with reference to a particular covariate. For example we want to select genes that are differentially expressed in two groups. Here we use the term `prefilter' for non--specific filters and the term `unifilter' for specific filters applied to two groups. \subsection{Applying non--specific filters: PreFilter} Applying non--specific filters is a simple two-step process: First, select the filters of interest using constructor \Rcode{PreFilter}. Second, apply the resulting class \Rclass{PreFilter} to an instance of class \Rclass{ExprTreeSet} using function \Rcode{prefilter}. Currently it is possible to select up to ten non--specific filters which are defined in S4 class \Rclass{PreFilter}. For this example let us initialize the following three non--specific filters: \begin{enumerate} \item \Rcode{madFilter}: A `median absolute deviation' filter, which selects only genes where \Rcode{mad} across all samples is at least 0.5, i.e. \Rcode{mad >= 0.5}. \item \Rcode{lowFilter}: A `lower threshold' filter to select genes where the trimmed mean of the \Rcode{log2}--expression levels is above 7.0 (with \Rcode{trim = 0.02}). \item \Rcode{highFilter}: An `upper threshold' filter to select genes that are \Rcode{log2}--expressed below 10.5 in at least 80 percent of the samples. \end{enumerate} Furthermore, a gene should be selected for further analysis only if it satisfies at least two of the three filters. \\ Initialization of the filters is done using the constructor \Rcode{PreFilter}: <>= prefltr <- PreFilter(mad=c(0.5), lothreshold=c(7.0,0.02,"mean"), hithreshold=c(10.5,80.0,"percent")) str(prefltr) @ This filter is then applied to expression data \Rclass{data.rma} created earlier, using function \Rfunction{prefilter} with parameter \Rcode{minfilters=2}: <>= data.rma@rootfile <- paste(.path.package("xps"),"rootdata/tmp_Test3RMA.root",sep="/") data.rma@filedir <- paste(.path.package("xps"),"rootdata",sep="/") @ <>= rma.pfr <- prefilter(data.rma, "tmpdt_Test3Prefilter", getwd(), filter=prefltr, minfilters=2, verbose=FALSE) @ The resulting filter mask can be extracted as \Robject{data.frame}: <<>>= tmp <- validData(rma.pfr) head(tmp) dim(tmp[tmp[,"FLAG"]==1,]) @ The data show that 181 genes of the 345 genes on the Test3 GeneChip are selected for further analysis. \subsection{Applying specific filters for two groups: UniFilter} Applying univariate filters is also a simple two-step process: First, select the filters of interest using constructor \Rcode{UniFilter}. Second, apply the resulting class \Rclass{UniFilter} to an instance of class \Rclass{ExprTreeSet} using function \Rcode{unifilter}. Currently it is possible to select three univariate filters which are defined in S4 class \Rclass{UniFilter}. For this example let us initialize the following two filters: \begin{enumerate} \item \Rcode{fcFilter}: A `fold--change' filter, which selects only genes with an absolute fold--change of at least 1.3, i.e. \Rcode{abs(mean(GrpB)/mean(GrpA)) >= 1.3}. \item \Rcode{unitestFilter}: A `unitest' filter to select genes where the p--value of the applied unitest, i.e. the \Rcode{t.test}, is less than 0.1 (\Rcode{pval <= 0.1}). \end{enumerate} Only genes satisfying both filters are considered to be differentially expressed. {\bf Note:} If you want to change the default settings for \Rcode{t.test} and/or compute an adjusted p--value for multiple comparisons you need to initialize method \Rcode{uniTest}, too. \\ Initialization of the filters is done using the constructor \Rcode{UniFilter}: <>= unifltr <- UniFilter(foldchange=c(1.3,"both"), unifilter=c(0.1,"pval")) @ This filter is then applied to expression data \Rclass{data.rma} using function \Rfunction{unifilter} where parameter \Rcode{group} allocates each sample to one of two groups. Furthermore, since we want to use only the pre--selected genes from \Rfunction{prefilter} we need to set \Rcode{xps.fltr=rma.pfr}: <>= data.rma@rootfile <- paste(.path.package("xps"),"rootdata/tmp_Test3RMA.root",sep="/") data.rma@filedir <- paste(.path.package("xps"),"rootdata",sep="/") @ <>= rma.ufr <- unifilter(data.rma, "tmpdt_Test3Unifilter", getwd(), unifltr, group=c("GrpA","GrpA","GrpB","GrpB"), xps.fltr=rma.pfr, verbose=FALSE) @ The resulting data can be extracted as \Robject{data.frame}: <<>>= tmp <- validData(rma.ufr) tmp @ The data show that only 9 genes of the pre--selected 181 genes are considered to be differentially expressed. \\ {\bf Note:} If you want to extract all data as \Robject{data.frame} as well as the resulting filter mask you can do: <>= msk <- validFilter(rma.ufr) tmp <- validData(rma.ufr, which="UnitName") tmp <- cbind(tmp, msk) @ However, the recommended way to extract all data together with the filter mask as well as the gene annotation is: <<>>= tmp <- export.filter(rma.ufr, treetype="stt", varlist="fUnitName:fName:fSymbol:fc:pval:flag", as.dataframe=TRUE, verbose=FALSE) head(tmp) @ Now all 181 pre--selected genes are extracted as \Robject{data.frame} together with the corresponding annotation and the filter mask. \\ \pagebreak It is also possible to create a fold-change vs p-value plot, called \Rcode{volcanoplot}. Setting the parameter \Rcode{labels="fSymbol"} allows us to draw the corresponding gene symbols, if known: \begin{center} <>= volcanoplot(rma.ufr, labels="fSymbol") @ \end{center} \pagebreak \appendix \section{Appendices} \subsection{Importing chip definition and annotation files} %\label{appA1} In contrast to other packages, which rely on the Bioconductor method for creating cdf environments, we need to create \ROOT\ \Rclass{scheme} files directly from the Affymetrix source files, which need to be downloaded first from the Affymetrix web site. However, once created, it is in principle possible to distribute the \ROOT\ \Rclass{scheme} files, too. Here we will demonstrate, how to create a \ROOT\ \Rclass{scheme} file for Affymetrix GeneChip {\it Test3.CDF}. We assume that the following files were downloaded, unzipped, and saved in subdirectories \Rcode{libraryfiles} and \Rcode{Annotation}, respectively: \begin{itemize} \item GeneChip chip definition file: {\it Test3.CDF} \item Probe sequence file: {\it Test3\_probe.tab} \item Probeset annotation file: {\it Test3.na32.annot.csv} \end{itemize} In a new \Robject{R}-session we load our library and define the directories, where the library files and the annotation files are saved, respectively, and the directory, where the \ROOT\ \Rclass{scheme} files should be saved: <>= library(xps) libdir <- "/path/to/Affy/libraryfiles" anndir <- "/path/to/Affy/Annotation" scmdir <- "/path/to/CRAN/Workspaces/Schemes" @ Now we can create a \ROOT\ \Rclass{scheme} file: <>= scheme.test3 <- import.expr.scheme("SchemeTest3", filedir = scmdir, schemefile = file.path(libdir, "Test3.CDF"), probefile = file.path(libdir, "Test3_probe.tab"), annotfile = file.path(anndir, "Test3.na32.annot.csv")) @ The \Robject{R} object \Rclass{scheme.test3} is not needed lateron, since in every new \Robject{R}-session the \ROOT\ \Rclass{scheme} file need to be imported first, using: <>= scmdir <- "/path/to/CRAN/Workspaces/Schemes" scheme.test3 <- root.scheme(file.path(scmdir,"SchemeTest3.root")) @ Package \xps\ includes a file "script4schemes.R" which contains code to import some of the main CDF and annotation files, which can be copied to an \Robject{R}-session, including code to create \ROOT\ \Rclass{scheme} files for the currently available Exon arrays (Exon 1.x ST) and Whole Genome arrays (Gene 1.x ST). {\bf Note 1:} Since \ROOT\ \Rclass{scheme} files need to be created only once, it is recommended to save them in a common system directory, e.g. 'Schemes', which is accessible to other users, too. {\bf Note 2:} As mentioned earlier, slot \Rcode{mask} of \Rclass{scheme.test3} needs to be attached to instances of S4 class \Rclass{DataTreeSet} before accessing raw data, since slot \Rcode{mask} contains the information which oligos on the array are PM, MM, or control oligos, respectivly. If you want to avoid this step you can create instances of \Rclass{SchemeTreeSet}, which contain this information already, by setting parameter \Rcode{add.mask} of function \Rfunction{import.expr.scheme} to \Rcode{add.mask=TRUE}, e.g.: <>= scheme.test3 <- import.expr.scheme("SchemeTest3",..., add.mask=TRUE) @ {\bf Note 3:} Please note that for the new GeneChip Exon array systems and Whole Genome array systems Affymetrix no longer supports CDF-files, but uses the new CLF-files and PGF-files instead. For this reason package \xps\ also uses CLF-, PGF-files to create the root scheme files, and does not use the inofficial CDF-files. See the help files \Rfunction{?import.exon.scheme} and \Rfunction{?import.genome.scheme} for more information. \subsection{Additional examples} %\label{appA2} Additional examples how to use package \xps\ can be found in file "script4xps.R", located in subdirectory 'xps/examples'. Most of these examples are easily adaptable to users need and can be copied with no or only minor modifications. Furthermore, a second file, "script4exon.R", shows how to use \xps\ with the novel Affymetrix Whole Genome and Exon arrays. Both files use the Affymetrix "Human Tissue Datasets" for arrays HG-U133\_Plus\_2, HuEx-1\_0-st-v2 and HuGene-1\_0-st-v1, respectively. \subsection{Using Biobase class ExpressionSet} Some users may prefer to use S4 class \Rclass{ExpressionSet}, defined in the \Rpackage{Biobase} package of Bioconductor, for further analysis of expression measures. Package \Rpackage{Biobase} contains a vignette "ExpressionSetIntroduction.pdf", which describes how to build an \Rclass{ExpressionSet} from scratch. Here we create a minimal \Rclass{ExpressionSet} containing the expression measures determined using RMA: First, we need to load library \Rpackage{Biobase}, then extract the expression levels from instance \Robject{data.rma} of class \Rclass{ExprTreeSet}, convert the data.frame to a matrix, and finally create an instance of class \Rclass{ExpressionSet}: <>= library(Biobase) expr.rma <- validData(data.rma) minimalSet <- new("ExpressionSet", exprs = as.matrix(expr.rma)) @ As described in vignette "ExpressionSetIntroduction.pdf", we can now access the data elements. For this example we create a new \Rclass{ExpressionSet} consisting of the 5 features and the first 3 samples: <>= vv <- minimalSet[1:5,1:3] featureNames(vv) sampleNames(vv) exprs(vv) @ This class \Rclass{ExpressionSet} can now be used from within other Bioconductor packages. \subsection{ROOT graphics} As noted earlier, package \xps\ allows to analyze Exon arrays on computers with only 1GB RAM. However, in some cases it may not possible to use R-based plots. For this purpose \xps\ takes advantage of the \ROOT\ graphics capabilities, which do not suffer from such memory limitations. In the following we will demonstrate some of the \ROOT\ graphics capabilities using the 33 exon array data of all 11 tissues from the Affymetrix Exon Array Data Set "Tissue Mixture" (see file "script4exon.R"). \\ Let us first create an image using function \Rfunction{root.image}: <>= root.image(data.exon, treename="BreastA.cel", zlim=c(3,11), w=400, h=400) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Image.png} \includegraphics[width=70mm]{ImageZoom.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak The left side of the figure shows the image created, while the right side shows the figure after zooming-in (see \Rcode{?root.image} how to save the image and how to zoom-in). \\ Now let us create density-plots for the raw intensities of all 33 exon arrays, as well as for the RMA-normalized expression levels: <>= root.density(data.exon, "*", w=400, h=400) root.density(data.x.rma, "*", w=400, h=400) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot.png} \includegraphics[width=70mm]{DensityPlotRMA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} In addition we create profile plots for the RMA-normalized expression levels: <>= root.profile(data.x.rma, w=640, h=400) @ \pagebreak \begin{figure}[h] \begin{center} \includegraphics{ProfileExonRMA.png} % \includegraphics[width=70mm]{ProfileExonRMA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} As you see, the profile plots shows both a histogram and a boxplot for each sample. \\ It is also possible to draw scatter-plots for the raw intensities between any two arrays, as well as between two RMA-normalized expression levels: <>= root.graph2D(data.exon, "BreastA.cel", "BreastB.cel") root.graph2D(data.x.rma, "BreastA.mdp", "BreastB.mdp") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Graph2D.png} \includegraphics[width=70mm]{Graph2DRMA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} The left scatter-plot compares the raw intensities of two breast tissue replicas for all probes on the exon array, while the right scatter-plot compares the respective normalized expression levels. \\ \pagebreak Besides using scatter-plots it is also possible to plot the same data as 2D-histograms: <>= root.hist2D(data.exon, "BreastA.cel", "BreastB.cel", option="COLZ") root.hist2D(data.x.rma, "BreastA.mdp", "BreastB.mdp", option="COLZ") root.hist2D(data.x.rma, "BreastA.mdp", "BreastB.mdp", option="SURF2") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Histogram2DRMA.png} \includegraphics[width=70mm]{Histogram2DRMAsurf2.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} Here we show two different ways to plot the 2D-histogram for the normalized expression levels by simply changing the parameter \Rcode{option}. The left histogram uses the default \Rcode{option="COLZ"} while the right histogram was created using \Rcode{option="SURF2"} to allow a 3-dimensional view of the expression level distribution. \\ Finally, it is also possible to create 3D-histograms: <>= root.hist3D(data.exon, "BreastA.cel", "BreastB.cel", "BreastC.cel", option="SCAT") root.hist3D(data.x.rma, "BreastA.mdp", "BreastB.mdp", "BreastC.mdp", option="SCAT") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Histogram3D.png} \includegraphics[width=70mm]{Histogram3DRMA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak The left 3D-histogram compares the raw intensities of the breast tissue triplicates for all probes on the exon array, while the right scatter-plot compares the respective normalized expression levels. Since quite often samples are hybridized onto arrays as triplicates, 3D-histograms are helpful in getting a first impression on the quality of the triplicates. {\bf Note:} The 3D-histograms can be rotated interactively, see \Rcode{?root.hist3D}. \subsection{Using methods FARMS and DFW} Analogously to method \Rcode{medianpolish}, used for \Rcode{rma}, both \Rcode{farms} and \Rcode{dfw} are multichip summarization methods. The algorithm for FARMS (Factor Analysis for Robust Microarray Summarization) is described in \citep{hochr:etal:2006} and is available as package \Rpackage{farms}. The algorithm for DFW (Distribution Free Weighted Fold Change) is described in \citep{chen:etal:2007} and the R implementation can be downloaded from the web site of M.McGee. Both authors claim that their respective methods outperform method \Rcode{rma} (see also Affycomp II: A benchmark for Affymetrix GeneChip expression measures). \\ The \Robject{R} implementation of both methods requires package \Rpackage{affy} since both methods must be registered with \Rpackage{affy}. In contrast, package \Rpackage{xps} implements both summarization methods in C++ and thus does not require any additional package. In general, summarization methods are implemented in package \Rpackage{xps} as C++ classes derived from base class \Rmethod{XExpressor}. Thus summarization method \Rcode{medianpolish} is implemented as class \Rmethod{XMedianPolish}, while methods \Rcode{farms} and \Rcode{dfw} are implemented as classes \Rmethod{XFARMS} and \Rmethod{XDFW}, respectively. \\ To use FARMS you simply do: <>= data.farms <- farms(data.test3,"tmp_Test3FARMS",verbose=FALSE) @ To use DFW you simply do: <>= data.dfw <- dfw(data.test3,"tmp_Test3DFW",verbose=FALSE) @ Since the authors of both algorithms recommend to use their summarization methods with quantile normalization but without background correction, methods \Rcode{farms} and \Rcode{dfw} follow these suggestions. Users wanting to use both methods with a background correction method need to use the general method \Rcode{express} (see \Rcode{?express}). \\ In addition to FARMS as summarization method the authors have also implemented a novel filtering method, called I/NI-calls, to exlcude non-informative genes, see \citep{tallo:etal:2007}. This method cannot only be used with FARMS but also together with other methods to compute expression measures such as RMA. \\ To use I/NI-calls you simply do: <>= call.ini <- ini.call(data.test3,"tmp_Test3INI",verbose=FALSE) @ {\bf Note:} Although package \Rpackage{farms} is available under the GNU General Public License, the authors state on their web site that: "This package (i.e. \Rpackage{farms\_1.x}) is only free for non-commercial users. Non-academic users must have a valid license." Since I do not know if this statement applies for my C++ implementation, too, it is recommended that respective users contact the authors of the original package. \bibliographystyle{plainnat} \bibliography{xps} \end{document}