%\VignetteIndexEntry{virtualArray Vignette} %\VignetteDepends{} %\VignetteKeywords{microarray virtual combine batch effect} %\VignettePackage{virtualArray} \documentclass[12pt]{article} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \SweaveOpts{keep.source=TRUE} \begin{document} \title{virtualArray Package Vignette} \author{Andreas Heider} \date{11 August 2011} \maketitle \section{Introduction} Current tools for the analysis of microarray data only allow the comparison of datasets generated on the same platform and chip generation, and restrict meta-analysis studies to the evaluation of study results from different groups, rather than the direct comparison of the available raw datasets. The aim of the virtualArray package is to enable the combination of raw expression data of different microarray platforms into one "virtual array". Thus, the user may compare his own data to other datasets including public sources, regardless of the platform and chip generation used, or perform meta-analysis directly based on available raw datasets. The package generates a combined virtual array as a "ExpressionSet" object from different datasets by matching raw data entries based on probe, transcript, gene or protein identifiers. Redundancies, gaps, and batch effects are removed before proceeding with data analysis. virtualArray consists of several subsequent functions, requiring minimal user input. Briefly, (1) raw data are loaded into R, (2) probe sets for each platform are annotated, (3) genes, proteins or transcripts common to all platforms are matched, including checks for redundancy or missing values, (4) data are compiled into a new "virtual array", (5) normalized and (6) subjected to batch effect removal using empirical Bayes methods [1] or another method of choice. The generated "virtual array" can than be directly analyzed in R/Bioconducor or exported for use in other suitable software (e.g. MeV). There are essentially four modes of operation: Firstly, the "virtualArrayCompile" function can integrate the major (but not all) human microarray platfroms in a default mode requiering minimal user input. Secondly the "virtualArrayExpressionSets" function. This approach allows to integrate any kind of raw expression data that can be loaded into an ExpressionSet object in R/BioC. The downside in this case is that the user will have to deal with details such as log2-transformations, 16 bit - 20 bit transformations, assignment of correct annotations, etc. Additionally, each of these two approaches can be used in a supervised or non-supervised mode. The non-supervised mode uses empirical Bayes networks (implemented through "ComBat.R", [1]) to adjust for batch effects between the individual datasets. In the supervised mode the user assigns additional covariates next to the batch assignment, such as "treated" and "untreated", or "non-differentiated" and "differentiated". Note that this information has to be valid, as it impacts on the results you will get. Last but not least it is possible to use the package to integrate data without batch effect removal, so that other, user-defined, methods of batch effect removal can be employed. The combined data is presented as a regular Bioconductor "ExpressionSet" object, which permits using all of R/Bioconductor's power on the dataset as a whole. To load the package type: <>= options(width=72) @ <<>>= library(virtualArray) @ \section{More in depth step by step explanations} Now that you read the introduction I will explain in more detail the different steps that are made during processing and finally combining the data to the new virtual array. \begin{enumerate} \item The raw data of each chip/platform or simply batch are to be read in to form an ExpressionSet in Bioconductor. This is to be done by means of other packages e.g. "affy", "lumi" or "limma". Depending on the chip or package used, you may have to fill or fix the "annotation" slot of the ExpressionSet to contain available Bioconductor annotation packages without the "*.db" extension. This is particularly important when pulling data from NCBI GEO or EBI ArrayExpress. \item Each batch, even if it is based on the same platform, may have been scanned in using different hardware or different modes of usage. Because of this, we need to transform each dataset to common scale (log2, log10 or linear) and resolution (12, 14 , 16 or 20 bit). Again, we can do this using standard R functions on the "exprs" slot of the ExpressionSets. \item When we want to combine data from different platforms we cannot use manufacturer identifiers like 1000\_s\_at or ILM\_123456 to find matching pairs. Indeed we have to annotate each dataset with additional identifiers. When starting the processing using e.g. "virtualArrayExpressionSets()" we define which additional identifier to pull from the annotation package. The default is to use gene symbols (named "SYMBOL" in the annotation packages). However, it can be anything present in the annotation packages that gives a 1:1 mapping of identifiers. This will be fixed in future versions of the package. \item Before attempting to match common identifiers we need to collapse rows that target the same identifier or gene in the default case. This is done by either "median" (the default) or a user supplied function. This results in a reduced expression matrix. \item Now we proceed with matching common identifiers. A new expression matrix is built, that just includes rows for identifiers present in all datasets. \item virtualArray constructs a new ExpressionSet object with the just built expression matrix and a "pData" that contains relations between batches and samples. \item The newly generated ExpressionSet can now either be returned without further modifications or directly subjected to batch effect removal using empirical Bayes methods or another method of choice. Please see the documentation of the "virtualArray.ExpressionSets" function for details. This can be decided by the user with the logical or character vector "removeBatchEffects". Note, however, that the contents of the resulting ExpressionSet is not, and actually can not be, a simple concatenation of the input expression matrices. On the one hand incompatible probes/probesets are excluded during the process. On the other hand expression values targetting the same identifier (e.g. gene) are collapsed by the function (e.g. "median") defined in the first place. The option to not remove the batch effect has been implemented, so you can actually see the impact of it in the combined data. In other words, you could not create a figure like in section 3.3. (first figure) in an easy way without using this package. Furthermore this option can be used to pass the non-corrected ExpressionSet on to other batch effect removal methods. \end{enumerate} \section{A short example with real world data} I will now go through a short example using the "virtualArrayExpressionSets" approach. For this we will have to pull some data from the NCBI GEO database. I selected two induced pluripotent stem cell (iPSCs) datasets. The first one includes human embryonic stem cells (ESCs), fibroblasts and iPSCs [2]. The second one includes fibroblasts and iPSCs derived from them [3]. The task in this example is to combine and compare data from highly similar cell types that were generated on different platforms: the Agilent G4112A and the Affymetrix HG-U133Plus2 respectively. \subsection{Loading and preparing the data} We use the getGEO function from the GEOquery package to retrieve the desired GEO series. <<>>= library("GEOquery") GSE23402 <- getGEO("GSE23402",GSEMatrix=T,AnnotGPL=FALSE) GSE26428 <- getGEO("GSE26428",GSEMatrix=T,AnnotGPL=FALSE) @ We select a subset of the data, so the example will run faster. <<>>= GSE23402 <- GSE23402[[1]][,1:24] GSE26428 <- GSE26428[[1]] @ Agilent and Affimetrix use different raw data formats. Therefore it is absolutely essential to check if the raw data need to be transformed initially. Let's have a look at the data with the simple summary function. <<>>= summary(exprs(GSE23402)[,1:3]) summary(exprs(GSE26428)) @ We can see that GSE23402 data is not log-scaled, as we have values ranging high over 20000. On the other hand GSE26428 data is in log-scale at a resolution of 20 bit, because we have all values under 100. The maxima are above 16 and 17, but below 20 and 19. Now we cannot have 17 or 19 bit, but 20 bit. Also you can find information from Agilent, that the chip used here is scanned at 20 bit resolution. To resolve this problem, we log2-transform the Affymetrix data, and transform the Agilent data from 20 bit to 16 bit resolution. <<>>= exprs(GSE23402) <- log2(exprs(GSE23402)) exprs(GSE26428) <- (exprs(GSE26428)/20*16) @ Here are the results of our transformations. You can appreciate that both datasets now spread to the same data format: 16 bit log2-transformed. <<>>= summary(exprs(GSE23402)[,1:4]) summary(exprs(GSE26428)) @ We next need to set the correct Bioconductor annotations <<>>= annotation(GSE23402) <- "hgu133plus2" annotation(GSE26428) <- "hgug4112a" @ Now we create an empty object to hold our virtual arrays <<>>= my_virtualArrays <- NULL @ We have generated two ExpressionSets, which have their expression data in the same "space", and also have the correct annotation packages attached to them. We have also created an empty object to hold the ExpressionSets being generated. This is necessary, because we will create 2 of them and the function we call next, scans the current environment for ExpressionSets to combine them. So if we stored it in the current environment, we would end up adding our newly generated virtual array to the input data in the second run. \subsection{Building the virtual array} In the next step, we will compile the new ExpressionSet. <>= options(width=60) @ <<>>= my_virtualArrays$iPSC_hESC_noBatchEffect <- virtualArrayExpressionSets() @ We will now compile a second ExpressionSet WITHOUT REMOVING BATCH EFFECTS. \begin{Schunk} \begin{Sinput} > my_virtualArrays$iPSC_hESC_withBatchEffect <- + virtualArrayExpressionSets(removeBatcheffect=F) \end{Sinput} \begin{Soutput} Now preprocessing raw data of GSE23402 : Loading annotations... Now preprocessing raw data of GSE23402 : Collapsing expression values to their median ... Now preprocessing raw data of GSE23402 : Annotating expression values with SYMBOL ... [1] 19798 24 Size of expression matrix of GSE23402 : 19798 rows and 24 columns Now preprocessing raw data of GSE26428 : Loading annotations... Now preprocessing raw data of GSE26428 : Collapsing expression values to their median ... Now preprocessing raw data of GSE26428 : Annotating expression values with SYMBOL ... [1] 18575 3 Size of expression matrix of GSE26428 : 18575 rows and 3 columns Matching and merging all data sets. This could take some time... Size of expression matrix of whole dataset: 17608 rows and 27 columns Now combining pData of all data sets... Now setting up pData of GSE23402 ... Now setting up pData of GSE26428 ... The file 'sample_info.txt' has been written to your current working directory. Please modify it appropriately! Using only 'Batch' column as information for batch effect removal. \end{Soutput} \end{Schunk} This step just adds some sensible phenoData, including colors(!). <<>>= pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[5] <- c(as.character(pData(GSE23402)[,8]),as.character(pData(GSE26428)[,1])) pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[6] <- c(rep("red",24),rep("blue1",3)) @ The virtualArray package is now completely executed. At this stage one could proceed with your intended analysis of the first ExpressionSet, or apply an alternative batch effect removal method using the second ExpressionSet generated without batch effect removal. Note that we used some defaults of the function that are good to know about: We used gene symbols (identifiers="SYMBOL") to annotate each expression matrix and we selected the median (collapse\_fun=median) of rows targetting the same gene symbol (that is e.g. Affymetrix probe sets targeting the same gene). This results in the removal of redundancies and the reduction of the number of rows in each expression matrix (see the output above). Furthermore, the package removes gaps by selecting only those entries that are present in all datasets. This is necessary, because all platforms detect slightly different fractions of the genome, but not the whole genome. Hence, the number of rows is further reduced. \subsection{A glimpse at the results} To have a first look at the data, we create distance matrices and perform a hierarchical clustering. Distances between observations are calculated using euclidian distances. <<>>= dist_iPSC_hESC_noBatchEffect <- dist(t(exprs(my_virtualArrays$iPSC_hESC_noBatchEffect)), method="euclidian") @ \begin{Schunk} \begin{Sinput} > dist_iPSC_hESC_withBatchEffect <- + dist(t(exprs(my_virtualArrays$iPSC_hESC_withBatchEffect)), + method="euclidian") \end{Sinput} \end{Schunk} Trees are formed using average distance between clusters <<>>= hc_iPSC_hESC_noBatchEffect <- hclust(dist_iPSC_hESC_noBatchEffect, method="average") hc_iPSC_hESC_noBatchEffect$call <- NULL @ \begin{Schunk} \begin{Sinput} > hc_iPSC_hESC_withBatchEffect <- + hclust(dist_iPSC_hESC_withBatchEffect, method="average") > hc_iPSC_hESC_withBatchEffect$call <- NULL \end{Sinput} \end{Schunk} We will plot the hclust in color, because the naming itself isn't very clear \begin{Schunk} \begin{Sinput} > virtualArrayHclust(hc_iPSC_hESC_withBatchEffect, + lab.col=pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[,6], + lab=pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[,5], + main="batch effect NOT removed",cex=0.7, + xlab="sample names") \end{Sinput} \end{Schunk} \includegraphics{virtualArray-016} <>= virtualArrayHclust(hc_iPSC_hESC_noBatchEffect, lab.col=pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[,6], lab=pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[,5], main="batch effect removed",cex=0.7, xlab="sample names") @ We colored all samples from Guenther et al. in red and all sampels from Maekama et al. in blue. the first image shows, that there are indeed significant batch effects between the two experiments, as fibroblasts and iPS cells cluster according to the platform they were generated on, rather than according to their biological similarity. After batch effect removal, we see that the non-supervised mode yields a significant improvement. The terminally differentiated fibroblasts now cluster together, whereas the pluripotent iPS cells cluster together with pluripotent ESC lines, irrespective of the origin of the data. \subsection{A final run in supervised mode} In order to get the supervised mode to run, we need to edit the "sample\_info.txt" file, that is written to the current working directory on the fly. We will modify the 4th column to look like in table 1. % latex table generated in R 2.13.0 by xtable 1.5-6 package % Mon Oct 17 10:27:59 2011 \begin{table}[htbp] \label{tab:table1} \begin{center} \caption{sample\_info.txt} \begin{tabular}{rcccc} & $Array.name$ & $Sample.name$ & $Batch$ & $Covariate.1$ \\ \hline 1 & GSM574058 & GSM574058 & GSE23402 & fibroblast \\ 2 & GSM574059 & GSM574059 & GSE23402 & fibroblast \\ 3 & GSM574060 & GSM574060 & GSE23402 & fibroblast \\ 4 & GSM574061 & GSM574061 & GSE23402 & ESC\_or\_iPSC \\ 5 & GSM574062 & GSM574062 & GSE23402 & ESC\_or\_iPSC \\ 6 & GSM574063 & GSM574063 & GSE23402 & ESC\_or\_iPSC \\ 7 & GSM574064 & GSM574064 & GSE23402 & ESC\_or\_iPSC \\ 8 & GSM574065 & GSM574065 & GSE23402 & ESC\_or\_iPSC \\ 9 & GSM574066 & GSM574066 & GSE23402 & ESC\_or\_iPSC \\ 10 & GSM574067 & GSM574067 & GSE23402 & ESC\_or\_iPSC \\ 11 & GSM574068 & GSM574068 & GSE23402 & ESC\_or\_iPSC \\ 12 & GSM574069 & GSM574069 & GSE23402 & ESC\_or\_iPSC \\ 13 & GSM574070 & GSM574070 & GSE23402 & ESC\_or\_iPSC \\ 14 & GSM574071 & GSM574071 & GSE23402 & ESC\_or\_iPSC \\ 15 & GSM574072 & GSM574072 & GSE23402 & ESC\_or\_iPSC \\ 16 & GSM574073 & GSM574073 & GSE23402 & ESC\_or\_iPSC \\ 17 & GSM574074 & GSM574074 & GSE23402 & ESC\_or\_iPSC \\ 18 & GSM574075 & GSM574075 & GSE23402 & ESC\_or\_iPSC \\ 19 & GSM574076 & GSM574076 & GSE23402 & ESC\_or\_iPSC \\ 20 & GSM574077 & GSM574077 & GSE23402 & ESC\_or\_iPSC \\ 21 & GSM574078 & GSM574078 & GSE23402 & ESC\_or\_iPSC \\ 22 & GSM574079 & GSM574079 & GSE23402 & ESC\_or\_iPSC \\ 23 & GSM574080 & GSM574080 & GSE23402 & ESC\_or\_iPSC \\ 24 & GSM574081 & GSM574081 & GSE23402 & ESC\_or\_iPSC \\ 25 & GSM648497 & GSM648497 & GSE26428 & ESC\_or\_iPSC \\ 26 & GSM648498 & GSM648498 & GSE26428 & ESC\_or\_iPSC \\ 27 & GSM648499 & GSM648499 & GSE26428 & fibroblast \\ \end{tabular} \end{center} \end{table} Note, that we have thus chosen to order the samples into 2 groups in addition to the 2 batches. This tells the algorithm that we expect the biological variances to be greatest between these 2 groups. Now we run the package once again, but this time we need to modify the "sample\_info.txt" file when prompted. After editing please select "y" to run in supervised mode. <>= my_virtualArrays$iPSC_hESC_supervised <- virtualArrayExpressionSets(supervised=TRUE) @ <>= my_virtualArrays$iPSC_hESC_supervised <- virtualArrayExpressionSets(supervised=TRUE,sampleinfo=system.file("extdata","sample_info_mod.txt",package="virtualArray")) @ <<>>= dist_iPSC_hESC_supervised <- dist(t(exprs(my_virtualArrays$iPSC_hESC_supervised)), method="euclidian") @ <<>>= hc_iPSC_hESC_supervised <<- hclust(dist_iPSC_hESC_supervised, method="average") hc_iPSC_hESC_supervised$call <- NULL @ Again we will plot the hclust in color, because the nomenclature alone is not sufficient. <>= virtualArrayHclust(hc_iPSC_hESC_supervised, lab.col=pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[,6], lab=pData(my_virtualArrays$iPSC_hESC_noBatchEffect)[,5], main="batch effect removed - supervised mode",cex=0.7, xlab="sample names") @ To get a final view of the data we perform a principle component analysis. Coloring is based on batches. <>= pca_supervised <- prcomp(t(exprs(my_virtualArrays$iPSC_hESC_supervised))) plot(pca_supervised$x, pch=19, cex=2, col=c(rep("red",24),rep("blue",3),pch=17)) legend("topleft",c("GSE23402","GSE26428"),col=c("red","blue"),pch=19,cex=1) @ You can see that especially the fibroblasts cluster together more closely now. On the other hand each sample is on the same relative position, as when running in non-supervised mode. This indicates, that the batch effects have been further reduced, without losing biological information. This example shows, how "virtualArray" can be used to integrate samples from different labs generated on different platforms into one virtual array. Similarly, published datasets from two or more clinical studies could be combined into one large raw data based meta-analysis, e.g. to investigate transcriptional signatures of malignant tissues. Therefore, I hope this tool will be valuable for basic as well as clinical researchers. Now have fun using it! \section{Literature} \begin{enumerate} \item \textbf{Li C, Rabinovic A.} Adjusting batch effects in microarray expression data using empirical Bayes methods. \textbf{Biostatistics} 2007;8:118-127. \item \textbf{Guenther et al.} Chromatin structure and gene expression programs of human embryonic and induced pluripotent stem cells. \textbf{Cell Stem Cell} 2010;7(2);249-57. \item \textbf{Maekawa et al.} Direct reprogramming of somatic cells is promoted by maternal transcription factor Glis1. \textbf{Nature} 2011;474(7350);225-9. \end{enumerate} \end{document}