%\VignetteIndexEntry{Using the inSilicoMerging package} %\VignetteDepends{inSilicoDb} \documentclass{article} \usepackage{url} \usepackage{pdflscape} \usepackage{color} \newcommand{\todo}[1]{\textcolor{red}{\textbf{#1}}} \begin{document} \title{Using the inSilicoMerging package} \author{Jonatan Taminau$\footnote{\texttt{jtaminau@vub.ac.be}}$, Stijn Meganck, Cosmin Lazar} \date{CoMo, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium} \maketitle \section{Merging Gene Expression Data} An increasing amount of gene expression datasets is available through public repositories like for example GEO \cite{GEO} and ArrayExpress \cite{ArrayExpress}. Combining such data from different studies could be beneficial for the discovery of new biological insights and could increase the statistical power of gene expression analysis. However, the use of different experimentation plans, platforms and methodologies by different research groups introduces undesired batch effects in the gene expression values. This problem hinders and complicates further analysis and can even lead to incorrect conclusions \cite{Leek_2010}. Several methods to remove the batch effect but at the same time to preserve the biological variance inside the data are proposed in the last years. The inSilicoMerging package combines several of the most used methods to remove these unwanted batch effects for integrative analysis or large-scale analysis. All methods are implemented in such a way that they can be consistently used inside the Bioconductor framework. \section{Using the inSilicoMerging package} Using the inSilicoMerging package is straightforward since it mainly involves only a single function: \begin{verbatim} > merge(esets, method="NONE"); \end{verbatim} \noindent with \texttt{esets} a list of ExpressionSet objects and \texttt{method} one of the following options: \texttt{BMC}, \texttt{COMBAT}, \texttt{DWD}, \texttt{GENENORM}, \texttt{NONE} and \texttt{XPN}. Each of those methods is already extensively reported in literature but is nevertheless briefly explained in the following section.\\ \subsection{Evaluation through Visualization Tools} \noindent In order to visually inspect a merged dataset to have some direct feedback on its effect, five different visual validation methods are provided: \begin{verbatim} > plotMDS(eset, ...) > plotRLE(eset, ...) > plotDendrogram(eset, ...) > plotGeneWiseBoxPlot(eset, ...) > plotGeneWiseDensity(eset, ...) \end{verbatim} \noindent \texttt{plotMDS} creates a \emph{double-labeled} Multidimensional Scaling (MDS) plot. In this plot, all samples can be labeled by color and by symbol. This might be useful since for each sample its biological phenotype of interest and the study it originates from can be visualized simultaneously, giving an indication of the effectiveness of the used merging method. \texttt{plotRLE} creates a relative log expression (RLE) plot, which was initially proposed to measure the overall quality of a dataset but can also be used in this context. \texttt{plotDendrogram} creates a dendrogram after applying hierarchical clustering. Finally, \texttt{plotGeneWiseBoxPlot} and \texttt{plotGeneWiseDensity} provide a local visualization by looking at the boxplots/densities of genes across samples. All these methods are illustrated in the examples section.\\ \subsection{Evaluation through Quantitative Measures} \noindent In addition of the five visual inspection methods the \texttt{inSilicoMerging} package also provides six different quantitative evaluation measures: \begin{verbatim} > measureAsymetry(eset, ...) > measureGenesOverlap(eset, ...) > measureSamplesOverlap(eset, ...) > measureGenesMeanCorrCoef(eset1, eset2, ...) > measureSamplesMeanCorrCoef(eset1, eset2, ...) > measureSignificantGenesOverlap(eset1, eset2, ...) \end{verbatim} \noindent \texttt{measureAsymetry} compares the distribution of samples asymmetry, quantified by the \textit{skewness}, before and after batch removal. This index should have a value close to 0. \texttt{measureGenesOverlap} and \texttt{measureSamplesOverlap} measure the expected overlap of genes and samples between two independent studies, the higher this overlap, the better the integration process. \texttt{measureGenesMean\\CorrCoef} and \texttt{measureSamplesMeanCorrCoef} both indicate how the data is effected after batch removal. The method that least affects the data should be preferred. Note that this evaluation method does not give any clues on how effective a method is. The idea of the \texttt{measureSignificantGenesOverlap} method is that the quality of batch effect removal is proportional to the number of significant differentially expressed genes (DEGs) found in the newly combined study which are also found in all of the individual studies to be combined. \section{Different Merging Methods} Below we list, alphabetically, the merging techniques implemented in this package. Note that after using any of those methods the resulting merged dataset only contains the \textit{common} list of genes/probes between all studies. \subsubsection*{BMC} In \cite{BMC} they successfully applied a technique similar to z-score normalization for merging breast cancer datasets. They transformed the data by batch mean-centering, which means that the mean is subtracted: \begin{equation} \hat{x}_{ij}^k=x_{ij}^k-\overline{x}_{i}^k \end{equation} This technique was proposed to eliminate multiplicative bias. \subsubsection*{COMBAT} Empirical Bayes \cite{COMBAT} is a method that estimates the parameters of a model for mean and variance for each gene and then adjusts the genes in each batch to meet the assumed model. The parameters are estimated by pooling information from multiple genes in each batch. It is assumed that measured gene expression values of gene $i$ in sample $j$ of batch $k$ can be expressed as: \begin{equation} x_{ij}^k = \alpha_i + \mathbf{C}\beta_i + \gamma_i^k+\delta_i^k\epsilon_{ij}^k \end{equation} where $\alpha_i$ is the overall gene expression, $\mathbf{C}$ is a design matrix for sample conditions, $\beta_i$ is the vector of regression coefficients corresponding to $\mathbf{X}$, $\gamma_i^k$ and $\delta_i^k$ are the additive and multiplicative batch effects for gene $i$ in batch $k$ respectively and $\epsilon_{ij}^k$ are error terms. \subsubsection*{DWD} By searching for the separating hyperplane between data coming from different batches, Distance-weighted discrimination (DWD), an adaptation of Support Vector Machines (SVM), allows to remove bias by projecting the different batches on the hyperplane, calculating the batch mean $\overline{b}$ distance to the hyperplane and then subtracting the normal vector $\Delta$ of this plane multiplied by the mean \cite{DWD}. \begin{equation} \hat{x}_{ij}^k = x_{ij}^k - \overline{b}\Delta \end{equation} \subsubsection*{GENENORM} One of the simplest mathematical transformations to make datasets more comparable is z-score normalization. In this method, for each gene expression value $x_{ij}$ in each study separately all values are modified by subtracting the mean $\overline{x}_{i}$ of the gene in that dataset divided by its standard deviation $\sigma_i$: \begin{equation} \hat{x}_{ij}^k= \frac{x_{ij}^k-\overline{x}_{i}^k}{\sigma_{i.}^k} \end{equation} \subsubsection*{NONE} The most basic approach to combine two datasets is to simply \textit{paste} them together without any transformation. This can be used as a baseline against which other techniques can be compared. \subsubsection*{XPN} The basic idea behind the cross-platform normalization \cite{XPN} approach is to identify homogeneous blocks (clusters) of gene and samples in both studies that have similar expression characteristics. In XPN, a gene measurement can be considered as a scaled and shifted block mean. For a platform $k$, gene $i$ and sample $j$, the recorded gene expression is given by: %The basic idea behind the cross-platform normalization \cite{XPN} approach is to find blocks (clusters) of gene-sample pairs in both studies that have similar expression characteristics. In \texttt{XPN}, a gene measurement can be considered as a scaled and shifted block mean. For a platform $k$, gene $i$ and sample $j$: \begin{equation} \label{eq:shaba} x_{ij}^k = A^k_{\alpha^*(i),\beta_k^*(j)} b_i^k + c^k_i + \sigma^k_i\epsilon^k_{ij} \end{equation} where $A^k_{\alpha^*,\beta^*}$ is a block mean and $b^k_{i}$ and $c^k_i$ represent gene and platform specific sensitivity and offset parameters respectively. The functions $\alpha^*()$ and $\beta^*()$ map a specific gene measurement in a sample to their corresponding multi-platform cluster. The noise variables $\epsilon^k_{ij}$ are assumed independent standard gaussians. \texttt{XPN} uses an iterative scheme to update the parameters in Equation \ref{eq:shaba} until convergence to a local minimum, giving: \begin{equation} \hat{x}_{ij}^k = \hat{A}^k_{\alpha^*(i),\beta_k^*(j)} \hat{b}_i^k + \hat{c}^k_i + \hat{\sigma}^k_i\hat{\epsilon}^k_{ij} \end{equation} More details can be found in \cite{XPN}. \subsection{Merging two-by-two} \noindent Some merging technique are only reported and implemented to merge exactly two studies (e.g. \texttt{XPN} \cite{XPN} and \texttt{DWD} \cite{DWD}). In order to be able to merge any number of studies, this package added an additional step. This step combines all studies two-by-two and is called recursively on the intermediate results until only one, merged, dataset remains. Its behavior is illustrated in the following example: \begin{verbatim} list of studies = [ A ; B ; C ; D ; E ] m(X,Y) = applying merging technique 'm' on dataset 'X' and 'Y' combineByTwo: iteration 1 : [ E ; m(A,B) ; m(C,D) ] => [ E ; AB ; CD ] iteration 2 : [ CD ; m(E,AB) ] => [ CD ; EAB ] iteration 3 : [ m(CD,EAB) ] => [ CDEAB ] \end{verbatim} \section{Example} \noindent For this example we retrieve two Lung Cancer datasets using the inSilicoDb package \cite{inSilicoDb}. Both datasets were assayed on different platforms (Affymetrix Human Genome U133A Array and Affymetrix Human Genome U133 Plus 2.0 Array) and then preprocessed using fRMA \cite{fRMA}. <>= library(inSilicoDb) eset1 = getDataset("GSE19804", "GPL570", norm="FRMA", genes=TRUE); eset2 = getDataset("GSE10072", "GPL96", norm="FRMA", genes=TRUE); esets = list(eset1,eset2); @ \noindent Both studies contain normal and tumor samples and are already consistently annotated with a common \texttt{Disease} feature: <>= table(pData(eset1)[,"Disease"]); table(pData(eset2)[,"Disease"]); @ \noindent We now can simply merge both studies without applying any transformation: <>= library(inSilicoMerging); eset_NONE = merge(esets, method="NONE"); @ \noindent To further investigate the combined data we can use the \texttt{plotMDS} function to have a first visual inspection. <>= plotMDS(eset_NONE, targetAnnot="Disease", batchAnnot="Study", main="NONE (No Transformation)"); @ \noindent From this plot we can immediately notice a very strong dataset-bias (probably due to the difference in platform) while we would expect that all control samples from both studies would cluster together. Let us try another method to see if we can solve this issue: <>= eset_COMBAT = merge(esets, method="COMBAT"); plotMDS(eset_COMBAT, targetAnnot="Disease", batchAnnot="Study", main="COMBAT"); @ \noindent This clearly looks better. Both studies are mixed together and the biological phenotype of interest (tumor versus normal) is preserved in the merged dataset.\\ \noindent In a similar way we can use the other visualization methods too. Another method to look at the distances between samples is for example looking at the result of (hierarchical) clustering. This is provided by the \texttt{plotDendrogram} function:\\ <>= par(mfrow=c(1,2)) select = sample(1:ncol(eset_NONE),25); plotDendrogram(eset_NONE[,select], batchAnnot="Study", legend=FALSE, main="NONE"); plotDendrogram(eset_COMBAT[,select], batchAnnot="Study", legend=FALSE, main="COMBAT"); @ <>= ofile = "example_6.pdf"; pdf(file=ofile, paper="special", width=10, height=6); par(mfrow=c(1,2)) select = sample(1:ncol(eset_NONE),25); plotDendrogram(eset_NONE[,select], batchAnnot="Study", legend=FALSE, main="NONE"); plotDendrogram(eset_COMBAT[,select], batchAnnot="Study", legend=FALSE, main="COMBAT"); for(i in 1:1) { dev.off(); } cat("\\begin{center}\\includegraphics[width=\\textwidth]{", ofile, "}\\end{center}\n\n", sep=""); @ \noindent For clarity purposes we only select 25 (random) samples. The different samples are labeled, also for clarity purposes, by numbers corresponding to the study they originate from instead of their full sample name. We can compare the merging without transformation (\texttt{NONE}) on the left and after using the \texttt{COMBAT} method on the right.\\ \noindent To illustrate the RLE plots we again only select 25 samples. In this plot the samples are colored based on the study they originate from.\\ <>= par(mfrow=c(1,2)) select = sample(1:ncol(eset_NONE),25); plotRLE(eset_NONE[,select], batchAnnot="Study", legend=FALSE, main="NONE"); plotRLE(eset_COMBAT[,select], batchAnnot="Study", legend=FALSE, main="COMBAT"); @ <>= ofile = "example_7.pdf"; pdf(file=ofile, paper="special", width=10, height=6); par(mfrow=c(1,2)) select = sample(1:ncol(eset_NONE),25); plotRLE(eset_NONE[,select], batchAnnot ="Study", legend=FALSE, main="NONE"); plotRLE(eset_COMBAT[,select], batchAnnot ="Study", legend=FALSE, main="COMBAT"); for(i in 1:1) { dev.off(); } cat("\\begin{center}\\includegraphics[width=\\textwidth]{", ofile, "}\\end{center}\n\n", sep=""); @ \noindent In contrast to the three previous methods which illustrated the global bias between two datasets we also can obtain a very local view this time. With the following two functions, the local effect of each method on the gene level can also be illustrated. This effect is of course different for every gene. We arbitrary select the \texttt{MYL4} gene to illustrate the batch effects through gene-wise box plots and through gene density plots. <>= eset_BMC = merge(esets, method="BMC"); eset_DWD = merge(esets, method="DWD"); gene = "MYL4"; par(mfrow=c(2,2)); plotGeneWiseBoxPlot(eset_NONE, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=TRUE, main="NONE"); plotGeneWiseBoxPlot(eset_COMBAT, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=FALSE, main="COMBAT"); plotGeneWiseBoxPlot(eset_BMC, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=FALSE, main="BMC"); plotGeneWiseBoxPlot(eset_DWD, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=FALSE, main="DWD"); @ <>= eset_BMC = merge(esets, method="BMC"); eset_DWD = merge(esets, method="DWD"); ofile = "example_8.pdf"; pdf(file=ofile, paper="special", width=8, height=6); gene = "MYL4"; par(mfrow=c(2,2)); plotGeneWiseBoxPlot(eset_NONE, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=TRUE, main="NONE"); plotGeneWiseBoxPlot(eset_COMBAT, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=FALSE, main="COMBAT"); plotGeneWiseBoxPlot(eset_BMC, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=FALSE, main="BMC"); plotGeneWiseBoxPlot(eset_DWD, targetAnnot ="Disease", batchAnnot ="Study", gene=gene, legend=FALSE, main="DWD"); for(i in 1:1) { dev.off(); } cat("\\begin{center}\\includegraphics[width=\\textwidth]{", ofile, "}\\end{center}\n\n", sep=""); @ <>= gene = "MYL4"; par(mfrow=c(2,2)); plotGeneWiseDensity(eset_NONE, batchAnnot ="Study", gene=gene, legend=TRUE, main="NONE"); plotGeneWiseDensity(eset_COMBAT, batchAnnot ="Study", gene=gene, legend=FALSE, main="COMBAT"); plotGeneWiseDensity(eset_BMC, batchAnnot ="Study", gene=gene, legend=FALSE, main="BMC"); plotGeneWiseDensity(eset_DWD, batchAnnot ="Study", gene=gene, legend=FALSE, main="DWD"); @ <>= ofile = "example_9.pdf"; pdf(file=ofile, paper="special", width=8, height=6); gene = "MYL4"; par(mfrow=c(2,2)); plotGeneWiseDensity(eset_NONE, batchAnnot ="Study", gene=gene, legend=TRUE, main="NONE"); plotGeneWiseDensity(eset_COMBAT, batchAnnot ="Study", gene=gene, legend=FALSE, main="COMBAT"); plotGeneWiseDensity(eset_BMC, batchAnnot ="Study", gene=gene, legend=FALSE, main="BMC"); plotGeneWiseDensity(eset_DWD, batchAnnot ="Study", gene=gene, legend=FALSE, main="DWD"); for(i in 1:1) { dev.off(); } cat("\\begin{center}\\includegraphics[width=\\textwidth]{", ofile, "}\\end{center}\n\n", sep=""); @ \section{Conclusion} \noindent As this example illustrates, it is now straightforward to merge a number of gene expression studies by applying different existing methods. A number of simple visualization tools are provided for a first inspection of the merged dataset(s). In a similar way the quantitative measures can be applied as well. \section{Session Info} <<>>= sessionInfo() @ \bibliographystyle{plain} \bibliography{inSilicoMerging} \end{document}