%\VignetteIndexEntry{canceR Vignette} %\VignetteEngine{R.rsp::tex} %\VignetteAuthor{Karim Mezhoud} %\VignetteDepends{} %\VignetteKeywords{ A Graphical User Interface for accessing and modeling the MSKCC Cancer Genomics Data.} %\VignettePackage{canceR} %% Reduce the size of Vignette.pdf file %tools::compactPDF("/home/mezhoud/CGDS-R/canceR/vignettes/canceR.pdf", gs_quality="ebook") \documentclass[a4paper]{article} %\usepackage[OT1]{fontenc} \usepackage{zi4} \usepackage{Sweave} \usepackage{url} \usepackage[pdftex,plainpages=true,pdfpagelabels, colorlinks=true ,urlcolor= blue, linkcolor= red, citecolor=magenta]{hyperref} \usepackage{graphicx} \usepackage{amsmath} \usepackage[margin=1in]{geometry} \setkeys{Gin}{width=0.9\textwidth} % set figures width (this is the default) \begin{document} \input{canceR-concordance} %\SweaveOpts{concordance=TRUE} \title{canceR: A Graphical User Interface for accessing and modeling the Cancer Genomics Data of MSKCC} \author{Karim Mezhoud} \date{January 5, 2015} \maketitle \tableofcontents \newpage \section{Introduction} \texttt{canceR} is a graphical user friendly interface to explore, compare, and analyse all available Cancer Data (Clinical data, Gene Mutation, Gene Methylation, Gene Expression, Protein Phosphorylation, Copy Number Alteration) hosted by the Computational Biology Center (cBio) at Memorial-Sloan-Kettering Cancer Center (MSKCC). \texttt{canceR} implements functions from various packages: \begin{enumerate} \item to acces, explore and extract Genomics Cancers Data Base of MSKCC (\texttt{cgdsr},\cite{Cerami2012,Gao2013}), \item to associate phenotypes with gene expression (\texttt{phenoTest}, \cite{Planet2013}), \item to predict which biological process or pathway or immune system are significantly different under the phenotypes and which genes are associated (\texttt(GSEA-R)~\cite{Subramanian2005,Subramanian2007}), \item to predict the most up/down regulated gene sets belonging to one of MSigDB collections~\cite{Subramanian2005} (\texttt{GSEAlm},\cite{Oron2008}), \item to classify genes by diseases (\texttt{geNetClassifier},\cite{Aibar2013}), or \item to classify genes by variable or phenotype (\texttt{rpart}, \cite{Therneau2014}), \item to plot genes correlations. \end{enumerate} % The Cancer Genomics Data has open-access Web service at \url{http://www.cbioportal.org/public-portal/} for interactive exploration, vizualisation, and analysis~\cite{Cerami2012,Gao2013}, Web API, R package (\url{http://cran.r-project.org/web/packages/cgdsr/index.html}), and MATLAB toolbox (\url{http://www.mathworks.com/matlabcentral/fileexchange/31297-mskcc-cgds-cancer-genomics-toolbox}) for direct programmatic access. \section{Installation} No R packages: \begin{itemize} \item tktable \url{http://tktable.sourceforge.net/} \item BWidget \url{http://sourceforge.net/projects/tcllib/files/BWidget/} \item GnuWin32 (for Windows) \url{http://sourceforge.net/projects/getgnuwin32/?source=typ_redirect} \end{itemize} For Debian distribution (GNU/Linux) \begin{Schunk} \begin{Sinput} sudo apt-get insall ("r-cran-tcltk2","r-cran-tkrplot") sudo apt-get install('Tk-table', 'BWidget') sudo apt-get install libcurl4-openssl-dev sudo apt-get install r-cran-xml \end{Sinput} \end{Schunk} For Windows distribution LibXml2: parser for XML \url{http://sourceforge.net/projects/gnuwin32/files/} For OS X distribution X11: graphics device \\ run \texttt{R} and write theses lines in console to install dependancies. \begin{Schunk} \begin{Sinput} ## dependencies from CRAN install.packages("RCurl", "XML") install.packages(c("cgdsr", "tkrplot", "Formula", "RSvgDevice","RCurl" )) ## dependencies from Bioconductor source("http://bioconductor.org/biocLite.R") biocLite("GSEABase", "GSEAlm","geNetClassifier","Biobase", "phenoTest") biocLite("canceR") library(canceR) canceR() \end{Sinput} \end{Schunk} Get the development version from github \begin{Schunk} \begin{Sinput} devtools::install\_git("kmezhoud/canceR") \end{Sinput} \end{Schunk} \section{Starting Window} run \texttt{R} and write theses lines in console to run \texttt{canceR} package. \begin{Schunk} \begin{Sinput} library(canceR) canceR() \end{Sinput} \end{Schunk} The starting window (Figure~\ref{Fig1}) loads all available Cancer Studies (Figure~\ref{Fig1} 3) or search some ones by key word (Figure~\ref{Fig1} 4). Before to get Cancers Data (Figure~\ref{Fig1} 7), it is important to set workspace for output files (Figure~\ref{Fig1} 1). The starting window displays Help menu where user can get this vignette (Figure~\ref{Fig1} 2). \begin{figure}[!ht] \centering \includegraphics[scale=0.5]{image/starting.png} \caption{Starting Windows. 1, File Menu; 2, Help Menu; 3, Button to get all available studies; 4, Button to get only matched studies using key words; 5, list box that displays the number of studies listed in 6; 6, list box that displays the result of quering the Cancer Genomics Data Server. User can select one or multiple Studies; 7, Button to get Genetic Profiles and Clinical data for selected Studies.} \label{Fig1} \end{figure} \subsection{Setting Workspace} \texttt{canceR} package uses input files to compute models and generates output files for biological knowledges. It is important to set workspace and know the location of used files and results. The Button "Set Workspace" allows user to set easily workspace (Figure~\ref{Fig2}). User needs just to browse workspace folder or creates a new one. The others necessary folders would be created by simple pressing "Set" buttons. \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/workspace.png} \caption{Setting Workspace} \label{Fig2} \end{figure} %\subsection{Help Menu} \section{Main Window} After selecting studies and pressing on "Get Cases and Genetic Profiles" Button, the main window appears (Figure~\ref{Fig3}) and displays the progress of loading data of selected studies. The Main Window has a \texttt{Toolbar} with Menus (see following paragraphs). It is subdivised in two columns. The first column lists Cases for all selected studies. The first line of every study indicates its Index and its short description. The remain lines enumerate Cases with short description of data type and the number of samples. The second list box shows selected Cases. Similarly, the second column displays informations of Genetic Profiles. User can select a single or multiple lines with attention to correspond the Case with appropriate Genetic Profile. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/mainWindow.png} \caption{Main Window of \texttt{canceR} package. 1, Toolbar, 2, list box of loaded Cases; 3, list box of selected Cases; 4, list box of loaded Genetic Profiles; 5, list box of selected Genetic Profiles } \label{Fig3} \end{figure} %\subsection{Toolbar of the Main Window} %\label{toolbar} \subsection{\texttt{Gene List}} The first step to get genomics data is to specify what are interesting genes for user. The \texttt{Gene List} button browses folders to load Gene list file or displays examples of genes list. The genes could be in text file (.txt) with one gene by line using HUGO gene Symbol. The function removes automatically duplicate genes. \subsection{\texttt{Clinical Data}} The \texttt{Multiple Cases} button displays successively selected Cases. Results are returned in a table with row for each case and a column for each clinical attribute (Figure~\ref{Fig4}B). User could select all or some clinical data by checking dialog box (Figure~\ref{Fig4}A). For example, we select clinical attributes: \begin{itemize} \item Overall Survival months: Overall survival, in months. \item Overall Survival Status: Overall survival status, usually indicated as "LIVING" or "DECEASED". \item Disease Free Survival months: Disease free survival, in months. \item Disease Free Survival status: Disease free survival status, usually indicated as "DiseaseFree" or "Recurred/Progressed". \item Age at diagnosis: Age at diagnosis. \end{itemize} \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/clinicalData.png} \caption{Getting clinical data for Breast Invasive Carcinoma. A, Dialog Check Box to select clinical data; B, Results of quering clinical data of Breast Invasive Carcinoma (TCGA, Nature 2012). } \label{Fig4} \end{figure} \subsection{\texttt{Mutation}} User can search all mutation in gene list of all selected studies. He needs to select \texttt{All tumors samples} in Cases and \texttt{Mutations} in gentics profiles to get mutations (Figure~\ref{Fig5}). \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/Mutation1.png} \caption{Get all/specific mutations for selected Cases } \label{Fig5} \end{figure} \texttt{Mutation} function allows user to select about 15 informations corresponding to mutations~(Figure~\ref{Fig6}A). The results is a table with rows for each sample/case, and columns corresponding to the informations cheched in dialog mutation check box~(Figure~\ref{Fig6}B). \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/mutation.png} \caption{A, mutation dialog; B, mutation data} \label{Fig6} \end{figure} User can filter mutation result only for specific amino acid change (Figure~\ref{Fig7}). \begin{figure}[!ht] \centering \includegraphics[scale=0.5]{image/specificMutation.png} \caption{Filtering mutation by specific amino acid change. A, the search is done using regular express language. In this case the query specify 3 options: \texttt{D438.}, search all mutation of \texttt{D} Amino acid at position 438 \texttt{or D.*K} search the position of amino acid change of \texttt{D} to \texttt{K or D.*.} search for all \texttt{D} amino acid changes; B, Results of the merge of the 3 options filter of \texttt{D} amino acid changes.} \label{Fig7} \end{figure} \subsection{\texttt{Methylation}} User can search gene methylation and its correlation with mRNA expression. User needs to select Cases and Genetic Profiles with same methylation assay (HM450 or HM27) for the same study. Multiple Cases selection is allowed for one gene list (Figure~\ref{Fig8}). \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/methylation.png} \caption{Selecting methylation data} \label{Fig8} \end{figure} The dialog box of \texttt{methylation} function allows user to specify the threshold of the correlation rate (Figure~\ref{Fig9}A). cBioportal~\cite{Cerami2012,Gao2013} includes only methylation data from the probe with the strongest negative correlation between the methylation signal and the gene's expression. The result table (Figure~\ref{Fig9}B) lists genes with median of rate upper than 0.8. \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/methylationResult.png} \caption{A, Methylation dialog box; B, Methylation results. Correlation of silensing gene expression by methylation (HM27) with correlation rate \texttt{r > 0.8} for gene list in Breast Invasive Carcinoma (TCGA, Nature 2012).} \label{Fig9} \end{figure} \subsection{\texttt{Profiles}} The function get Profile Data depends on gene list, cases, and genetic profiles. If a \texttt{Single gene} option is done, dialog box appears to specify gene symbol (Figure~\ref{Fig10}A). The returned dialog check box allows user to choose some/all profiles data (Figure~\ref{Fig10}B). The result (table) lists some/all genetic profiles data in columns (CNA, Met, Mut, mRNA,RPPA) and all available samples in rows (Figure~\ref{Fig10}C). Oppositely, if \texttt{Multiple genes} option is done, the returned table displays genes expression for gene list (column) for all samples (rows). In the case of multiple genes, the tables are saved in \texttt{Results/ProfilesData} folder. \begin{figure}[!ht] \centering \includegraphics[scale=0.5]{image/singleProfile.png} \caption{Getting profile data of a single gene} \label{Fig10} \end{figure} \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/multipleProfiles.png} \caption{Profile data of a multiple genes} \label{Fig11} \end{figure} \subsection{\texttt{PhenoTest}} The function was implemented from package \texttt{PhenoTest}~\cite{Planet2013}. The object of this function is to predict the association between a list of phenotype variables (Survival, DFS~Status, OS~Status) and the gene expression. There are two possible formula to get associations: \begin{enumerate} \item Three variables: \emph{Survival} status (event/time as Dead-Living / 30 Months), \emph{Categorical} or \emph{ordinal} description (DSF~STATUS or Tumor stage), and \texttt{Continuous} value (DFS~MONTHS, Tumor size). \item Two variables: \emph{Categorical} or \emph{ordinal} description (DSF~STATUS or Tumor stage), and \texttt{Continuous} value (DFS~MONTHS, Tumor size). In this case user does not need to select any variables for survival variable in the phenoTest dialog box (Figure~\ref{PhenoTestpradbroad}A). \end{enumerate} The output of this function does not expect to give systematically a relevant association between all formula of the chosen variables, although in some cases it is possible to cluster a list of genes significantly regulated (gene expression) at a range of tumor size (continuous) or tumoral stage (ordinal) for recurred or DiseaseFree cases (survival and categorical). The type of variables could be explored with Clinical Data tables and selected in the phenoTest dialog box (Figure~\ref{PhenoTestpradbroad}A). The effect of both \emph{continuous}, \emph{categorical} and \emph{ordinal} phenotype variables on gene expression levels are tested via \texttt{lmFit} from \texttt{limma} package~\cite{Wettenhall2004}. Gene expression effects on \emph{survival} are tested via Cox proportional hazards model~\cite{Cox1972}, as implemented in function \texttt{coxph} from \texttt{survival} package. \begin{quote} NB: Continuous or Categories can not have more than 4 classes. \end{quote} \paragraph{Examples} \paragraph{Study: Prostate Adenocarcinoma (Broad/Cornal, Call 2013)} \begin{itemize} \item Cases: All tumor samples (57 samples), \item Genetic Profiles: mRNA expression, \item Gene list: 1021.txt file \item Survival variable: empty \item Categorical variable: Pathology Tumor Stage \item Numeric variable: Serum PSA level \item pVal adjust method: BH \item \texttt{PhenoTest} with Two variables (Figure~\ref{PhenoTestpradbroad}) \end{itemize} After running \texttt{Pheno/Exp}, \texttt{PhenoTest} function returns two tables. The first table ranks gene list by \texttt{pval} (Figure~\ref{PhenoTestpradbroad}B). The first part (red square) displays pValues of the association between gene expression and Tumor stage. The second part (blue square) displays the fold change (fc) by PSA level rang.\\ \textbf{Interpretation:} Notice that a single pValue is reported for each phenotype variable. For categorical variables these corresponds to the overall null hypothesis that there are no differences between groups.\\ In the second table, \texttt{PhenoTest} function filters only gene that has significant pval (pval <0.05, Figure~\ref{PhenoTestpradbroad}C red). Here we see that tumor stage has been categorized into 2 groups (pT2c, pT3c) and PSA level has been ranged into 2 groups (7.3-12.9, 12.9-16.7). This results shows that \texttt{ANO3} gene is significantly down regulated (negative fold change) for the two pathology tumor stages (pT2c, pT3c). \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/prad_broad-2013Results.png} \caption{PhenoTest; A, Dialog Boxused to select variables; B, Results; C, Only significant pValues.} \label{PhenoTestpradbroad} \end{figure} \paragraph{Heteroneous Clinical Data} In some cases it is possible to have digital (0-9) and character (a-z) data in the same variable. in this case phenoTest function considers it as Categorical variable~(Figure~\ref{heteroClinData}). \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/heterogeneous1.jpeg} \caption{Clinical Data heterogeneity. In PURITY ABSOLUTE column is there digital and character values. in this cas PhenoTest consider this variable as Categoric.} \label{heteroClinData} \end{figure} \paragraph{Study: Prostate Adenocarcinoma, Metastatic (Michigan, Nature 2012)} \begin{itemize} \item Cases: All tumor samples (61 samples), \item Genetic Profiles: mRNA expression \item Gene list: 1021.txt file \item Survival variable: OS MONTHS, OS STATUS \item Categorical variable: OS STATUS \item Numeric variable: Serum PSA level \item pVal adjust method: BH \item \texttt{PhenoTest} with three variables (Figure~\ref{PhenoTestprad_Michigan}) \end{itemize} In This test, Overall Survival (OS\_STATUS) was used in survival and caterogical variables. The Clinical Data does not have enougth categorical variables. Figure~\ref{PhenoTestprad_Michigan} B and C shows signicant association between 7 genes and Living Status (OS\_STATUS.Living.pval column). The two last columns show opposite regulation of the 7 genes expression in living patient with serum PSA level. The Cox proportion hazard model does not give results with survival variables (OS\_STATUS column). \begin{figure}[!ht] \centering \includegraphics[scale=0.8]{image/prad_MichiganResults.png} \caption{PhenoTest: Prostate Adenocarcinoma, Metastatic (Michigan, Nature 2012)} \label{PhenoTestprad_Michigan} \end{figure} \paragraph{Study:Lung Adenocarcinoma (TCGA, Nature, in press)} \begin{itemize} \item Cases: All Samples with mRNA expression data (230 samples), \item Genetic Profiles: mRNA expression z-Scores (RNA Seq V2 RSEM) \item Gene list: 1021.txt file \item Survival variable: empty \item Categorical variable: OS STATUS \item Numeric variable: OS\_MONTHS \item pVal adjust method: BH \item \texttt{PhenoTest} with two variables. \end{itemize} In this Lung cancer Study, the test shows significant association between living patient and 10 genes expression (Figure~\ref{PhenoTestLaungNature}). \begin{figure}[!ht] \centering \includegraphics[scale= 0.7]{image/LungNatureResults.png} \caption{PhenoTest: Lung Adenocarcinoma (TCGA, Nature, in press)} \label{PhenoTestLaungNature} \end{figure} \newpage \subsection{\texttt{GSEA-R}} Gene Set Enrichment Analysis (GSEA) is computational method that uses expression matrix of thousands of genes with phenotypes data (two biological states) and Molecular Signatures DataBase (MSigDB) to define which biological process or pathway or immune system are significantly different under the phenotypes and which genes are associated~\cite{Subramanian2005}. \subsubsection{Preprocessing of Exprimental Data} \texttt{getGCT\_CLS} function loads Profile and Clinical data of selected study and saves two files into "gct\_cls" folder (Figure~\ref{figgctcls}C). \begin{description} \item[The GCT file] contents genes expression values with genes in the rows and samples in the columns. \item[The CLS file] contents the two biological phenotypes selected from Clincical data. User needs to select clinical phenotype only with two classes. \end{description} \subsubsection{Molecular Signatures DataBase} The Molecular Signatures DataBase (MSigDB) is a collection of annotated gene sets for use with GSEA computational method. The MSigDB gene sets are divided into 7 collections (positional gene sets, curated gene sets, motif gene sets, computational gene sets, GO gene sets, oncogenic signatures, and immunological signatures). All these collections are available at (\url{http://www.broadinstitute.org/gsea/msigdb/index.jsp}). Every collections consists in a tab delimited file format (\texttt{.GMT} file) that describes gene sets. Each row shows annotation terme with associated genes. User needs to download \texttt{.gmt} file with genes Symbols and saves them into "workspace/MSigDB/" folder. The MSigDB folder is created with the file menu in the starting windows~(Figure~\ref{Fig1}). For more detail about \texttt{GCT, CLS, GMT} files, see this link (\url{http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats}). \paragraph{MSigDB Collection} \begin{description} \item[C1: Positional Gene Sets] Gene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene. These gene sets are helpful in identifying effects related to chromosomal deletions or amplification, epigenetic silencing, and other region effects. \item[C2: Curated Gene Sets into Pathways] Gene sets collected from various sources such as online pathway databases. \begin{description} \item[CGP: Chemical and Genetic Perturbation] Gene sets represent expression signatures of genetic and chemical perturbations. \item[CP: Reactome gene sets] Gene sets derived from the Reactome pathway database. \end{description} \item[C3: Motifs Gene Sets] Gene sets that contain genes that share: \begin{description} \item [MIR: microRNA targets] A 3'-UTR microRNA binding motif. \item [TFT: tanscription factor targets] A transcription factor binding site defined in the TRANSFAC (version 7.4, \url{http://www.gene-regulation.com/}) database. \end{description} \item[C4: Computational Gene Sets] Computational gene sets defined by mining large collections of cancer-oriented microarray data. \item[C5: GO Gene Sets] Gene sets are named by GO term (\url{http://geneontology.org/}) and contain genes annotated by that term: Biological Process, Cellular Component, and Molecular Function. \item[C6: Oncogenic Signatures] Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO. \item[C7: Immunologic Signatures] Gene sets that represent cell states and perturbations within the immune system. This resource is generated as part of the Human Immunology Project Consortium (HIPC; \url{http://www.immuneprofiling.org/}). \end{description} \subsubsection{Examples} \paragraph{Study: Uterine Corpus Endometrioid Carcinoma (TCGA, Nature 2013)} \begin{itemize} \item Cases: All Samples with mRNA, CNA, and sequencing data (232 samples), \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item MSigDB: c5.bp.v4.0.symbols \item Gene list: 1021.txt file \item Nbr of Samples: 100 \item Phenotype: DFS\_STATUS \end{itemize} Based only on Gene list the function \texttt{getGCT,CLS files} builts the \texttt{.gct} and \texttt{.cls} files and save them under the folder "/gct\_cls/". The Figure~\ref{figgctcls} shows the preporcessing steps to get \texttt{gct} and \texttt{cls} files. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/gct_cls.png} \caption{Preprocessing of \texttt{gct} and \texttt{cls} files. A, Sampling the Cases from Uterine Corpus Endometrioid Carcinoma study; B, selecting phenotype variable with only two classes; C, Displaying \texttt{cls} and \texttt{gct} files. The names of the two files are indicated in the title of the windows.} \label{figgctcls} \end{figure} For enrichment, \texttt{GSEA} function needs three files. The \texttt{gct} file with gene expression, the \texttt{cls} with phenotypes and \texttt{gmt} file with Molecular signature of Gene Sets. There are two options to load \texttt{gmt} file, from examples (Figure~\ref{figGSEA-R}A, MSigDB.gmt button) available into \texttt{canceR} package or from "workspace/MSigDB/" folder (Figure~\ref{figGSEA-R}A, browse button) . In the two ways the \texttt{gmt} files must be from Broad Institute (\url{http://www.broadinstitute.org/gsea/index.jsp}) and has gene Symbols. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/GSEA-R.png} \caption{Gene Set Enrichment Analysis of Uterine Corpus Endometrioid Carcinoma study using "DiseaseFree/Recurred" phenotypes. A, Selecting \texttt{cls}, \texttt{gmt}, \texttt{gmt} files ans setting output folder; B, Specifying the phenotype; C, displying the classes of the selected phenotype; D, Selecting Summary results files of the output and setting FDR; E, displying specific (FDR=0.25) Gene Sets (GS) involved in DiseaseFree phenotype. In this GSEA there is not significant GS involved specifically for Recurred phenotype.} \label{figGSEA-R} \end{figure} \paragraph{Study: Breast Invasive Carcinoma (TCGA, Provosional)} \begin{itemize} \item Cases: All Samples with mRNA expression data (562 samples), \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item MSigDB: c5.bp.v4.0.symbols \item Gene list: 1021.txt file \item Nbr of Samples: 100 \item Phenotype: OS\_STATUS \end{itemize} The Figure~\ref{figbreastGSEA} summarizes the gene sets enrichment analysis of Breast Invasive Carcinoma study using "Deceased/living" phenotypes. The Figure~\ref{figbreastGSEA}C shows 4 \emph{vs} 1 biological process involved respectively into Deceased/Living phenotypes. The size and the genes of appropriate GS are indicated in the second (size) and third (source) columns. The report of significant GS are saved into "/Results/GSEA/name\_of\_folder/". Every GS has a report (\texttt{.txt} file) indicating the gene list and which genes (CORE\_ENRICHMENT column: YES) are involved in the specific phenotype (DECEASED for RESPONSE\_To\_STRESS). The plots of significant GS are save with \texttt{.pdf} format file. In a heat map, expression values are represented as colors for every patient, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest). For more details about result interpretation see user guide of GSEA (\url{http://www. broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html}). \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/breastGSEA.png} \caption{Gene Sets Enrichment Analysis of Breast Invasive Carcinoma (TCGA, Provosional) study using "Deceased/Living" phenotypes. A, Selecting \texttt{cls}, \texttt{gmt}, \texttt{gmt} files ans setting output folder; B, Selecting Summary results files of the output and setting FDR; C, displying specific (FDR=0.25) Gene Sets (GS) involved in Deceased and Living phenotypes.} \label{figbreastGSEA} \end{figure} \subsubsection{\texttt{GSEA-R} Result Interpretation} The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. A positive value indicates correlation with the first phenotype and a negative value indicates correlation with the second phenotype. For continuous phenotypes (time series or PSA level), a positive value indicates correlation with the phenotype profile and a negative value indicates no correlation or inverse correlation with the profile~\cite{Subramanian2005}. The number of enriched gene sets that are significant, as indicated by a false discovery rate (FDR) of less than 25\%. Typically, these are the gene sets most likely to generate interesting hypotheses and drive further research. The number of enriched gene sets with a nominal p-value of less than 1\% and of less than 5\%. The nominal p-value is not adjusted for gene set size or multiple hypothesis testing; therefore, it is of limited value for comparing gene sets. The false discovery rate (FDR) is the estimated probability that a gene set with a given NES represents a false positive finding. For example, an FDR of 25\% indicates that the result is likely to be valid 3 out of 4 times. The GSEA analysis report highlights enrichment gene sets with an FDR of less than 25\% as those most likely to generate interesting hypotheses and drive further research, but provides analysis results for all analyzed gene sets. In general, given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, an FDR cutoff of 25\% is appropriate. However, if you have a small number of samples and use geneset permutation (rather than phenotype permutation) for your analysis, you are using a less stringent assessment of significance and would then want to use a more stringent FDR cutoff, such as 5\%~\cite{Subramanian2005}. \paragraph{Resolved limits} \begin{itemize} \item GSEA does not accept negative values from Profile data. In this case the adding of absolute of less negative value to all the matrix is done. \item GSEA does not accept missing value in \texttt{cls} file. The sampling is done only on existing phenotype information. \item GSEA needs more than 10000 to be robust but url/gcds-r package accept less that 1000 genes. \item The size of samples is between 50 and 100 \item Removing and cleaning the heterogeneity of the data frames of gene expression: space, character, empty boxes and convert them to readable form by GSEA-R function. \end{itemize} \subsection{Linear Modeling of GSEA (\texttt{GSEAlm})} \texttt{GSEAlm} is a function implemented from \texttt{GSEAlm} package~\cite{Oron2008}. \texttt{GSEAlm} function is a Linear Model inference and diagnostics for Gene Set Enrichment Analysis. It uses mRNA expression matrix of gene list, variable(s) from clinical data (phenotype) and Molecular Signature Data Base (MSigDB) that groups genes sharing common biological function, chromosomal location, pathway or regulation~\cite{Subramanian2005}. The result is a prediction of the most up/down regulated gene sets belonging to one of MSigDB collections~\cite{Subramanian2005, Liberzon2011}. The linear model assumes that the mean of the response variable has a linear relationship with the explanatory variable(s). The data and the model are used to calculate a fitted value for each observation (gene expression). It is strongly recommanded to use crude gene expression data and avoid z-score or standardized preprocessing data. we recommend to use the mRNA expression (RNA Seq V2 RSEM). Two options are available to get linear model. In the first option (phenotypes into disease) the model predicts which gene sets are modulated between patients having the same disease and different phenotypes (OS\_STATUS, DFS\_STATUS). In the second option (disease \emph{vs} disease) the model predicts which gene sets are modulated between patient having different disease. The output is saved in \texttt{/Results/GSEAlm/selected-disease-name} folder. The names of output files describe their content. Three files are saved by model. The \texttt{pVal\_MSigDB\_GeneList\_Disease.txt} file lists full results. The \texttt{down/upRegulated\_MSigDB\_GeneList\_Disease.txt} files filter only significant gene sets. The filtered result is merged in one table and displayed in the screen. \paragraph{Results interpretation:} \begin{itemize} \item NA~~NA: The gene set is unrepresentative in the used gene list \item 1~~1: No significant different of mRNA expression between the two phenotypes \end{itemize} \paragraph{Limitations:} \begin{itemize} \item Use categorical phenotype (not numeric) only with TWO classes \item Use crude and not normalised mRNA expression data \item The accuracy of the results is better with high number of genes (1000 - 10000) and 1000 permutations. This request takes a while to run. User is recommended to reduce the permutation to 100 instead 1000 and/or reduce the size of gene list. \item If two or more phenotypes are checked, only the latter is taken. \end{itemize} \subsubsection{Which Molecular Signature Data base (MSigDB) for gene list} This function matches genes list with MSigDB files selected from example or directory (Figure~\ref{whichMSigDB}) and computes the mean of matched Genes by Gene Sets as: $$ Mean_{(Gene\in GeneSet)} = \frac{ \sum_{1,1}^{I,J} Gene_i \in GeneSet_j}{\sum_{1}^{I} Gene_i} $$ With I is the number of genes in the list and J is the number of Gene Sets in the MSigDB file. \\ The returned table indicates the mean of gene number that matched for every Gene Set. For example in the first row (Figure~\ref{whichMSigDB}), there are about 10 genes that mached for every Genes Set in c2.cp.reactome.v4.0.Symbol.gmt. \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/whichMSigDB.png} \caption{Match gene list with MSigDB folder.} \label{whichMSigDB} \end{figure} \subsubsection{get SubMSigDb for genes list} The \texttt{SubMSigDb} is a subset of MSigDB generated from gene list in the expression Set (eSet). This specific subsetting reduces the time of GSEA computing. User needs to run before \texttt{eSet} from phenoTest menu. \subsubsection{\texttt{GSEAlm}: Phenotypes into Disease} \paragraph{Disease Free Status (DFS\_STATUS) into Prostate Cancer:} \begin{itemize} \item Study: Prostate Adenocarcinoma (TCGA, Provisional) \item Cases: All Samples with mRNA expression data (246 samples), \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item MSigDB: c2.cp.reactome.v4.0.symbols.gmt \item Gene list: 73.txt file \item Phenotype: DFS\_STATUS \item Permutation: 1000 \item pVal: 0.05 \end{itemize} \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/pradGSEAlm.png} \caption{Linear model of GSEA of Prostate Adenocarcinoma (TCGA, Provosional) study using "DiseaseFree/Reccured" phenotypes. A, Selecting MSigDB file; B, Selecting Phenotypes, permutation, and pVal parameters; C, displying down and up regulated gene sets in table.} \label{pradGSEAlm} \end{figure} This example uses Reactome pathways gene sets (Figure~\ref{pradGSEAlm}A, \url{http://www.reactome.org/}) to compare the gene expression of patients with reccured prostate disease versus patient with free prostate disease. The run was done with 1000 permuattion and pVal 0.05 (Figure~\ref{pradGSEAlm}B). The result (Figure~\ref{pradGSEAlm}C) shows only up regulated gene sets from Reactome pathways were observed in disease free patients. \paragraph{Copy Number Cluster Level into Stomach Adenocarcinoma:} \begin{itemize} \item Study: Stomach Adenocarcinoma (TCGA, Nature 2014) \item Cases: All Samples that have mRNA, CNA, and sequencing data (258 samples), \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item MSigDB: c5.bp.reactome.v4.0.symbols.gmt \item Gene list: 73.txt file \item Phenotype: Copy\_Number\_Cluster (Low, High) \item Permutation: 1000 \item pVal: 0.05 \end{itemize} The losses and the gains of DNA can contribute to alterations in the expression of tumor suppressor genes and oncogenes. Therefore, the identification of DNA copy number alterations in tumor genomes may help to discover critical genes associated with cancers and, eventually, to improve therapeutic approaches. In this study we test GSEAlm algorithm with the level of the Copy Number cluster using Biological Process gene sets from Reactome. The Figure~\ref{stadGSEAlm}C shows significant down/up regulated gene sets in patients with low level compared to patients with high level of copy number cluster. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/stadGSEAlm.png} \caption{Linear model of GSEA of Stomach Adenocarcinoma (TCGA, Nature 2014) study using "Low/High" level of copy number cluster phenotypes. A, Selecting MSigDB file; B, Selecting Phenotypes, permutation, and pVal parameters; C, displying down and up regulated gene sets in table.} \label{stadGSEAlm} \end{figure} This finding suggests to repeat modeling with copy number alteration profile to predict which biological process (gene set) could be with low level of copy number cluster. The Figure~\ref{stadGSEAlmCNA}D lists gene sets with low level of copy number cluster in the case of stomach adenocarcinoma. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/stadGSEAlmCNA.png} \caption{Linear model of GSEA of Stomach Adenocarcinoma (TCGA, Nature 2014) study using copy number alteration data (GISTIC algorithm) as matrix and copy number cluster level (Low/High) as phenotypes. A, Selecting genetic profile; B, Selecting MSigDB file; c, Selecting Phenotypes, permutation, and pVal parameters; D, displying gene sets with low copy number cluster level.} \label{stadGSEAlmCNA} \end{figure} \newpage \subsubsection{\texttt{GSEAlm}: Disease \emph{vs} Disease} \paragraph{Breast \emph{vs} Prostate Cancers:} In this function, the linear modeling uses the disease type as phenotype for the run. User needs to select two diseases and a gene list. \begin{itemize} \item Studies: Breast Invasive Carcinoma (TCGA, Provisional) versus Prostate Adenocarcinoma (TCGA, Provisional) \item Cases: All Samples with mRNA expression data (959/257 samples), \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item MSigDB: c2.cp.reactome.v4.0.symbols.gmt \item Gene list: 73.txt file \item Phenotype: Diseases type \item Permutation: 1000 \item pVal: 0.05 \item Samples number: 50 \end{itemize} The Figure~\ref{brstprstGSEAlm} shows the results of the linear modeling. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/BrstPrstGSEAlm.png} \caption{Linear model of GSEA of Breast Invasive carcinoma (TCGA, Provisional) and Prostate Adenocarcinoma (TCGA, Provitional) studies. A, Sampling; B, Selecting MSigDB file; c, Selecting Phenotypes, permutation, and pVal parameters; D, displying gene sets upregulated in prostate cancer.} \label{brstprstGSEAlm} \end{figure} %\paragraph{Errors} %\texttt{Error in subset.mask[, r] <- as.numeric(c(subset.class1, subset.class2)) : number of items to replace is not a multiple of replacement length} when there is blank or NA or na value in the cls or gct file. \subsection{Genes Classification using mRNA expression (\texttt{Classification})} The \texttt{Classification} menu displays two functions to rank genes by phenotypes into- and inter-diseases, depending on mRNA expression data. \subsubsection{Genes \emph{vs} Diseases (inter-diseases)} The first classifier is implemented from \texttt{geNetClassifier} package~\cite{Aibar2013}. It uses \texttt{calculateGenesRanking} function which based on Parametric Empirical Bayes method included in \texttt{EBarrays} package~\cite{Kendziorski2003}. This method implements an expectation-maximization (EM) algorithm for gene expression mixture models, which compares the patterns of differential expression across multiple conditions and provides a \emph{posterior probability}. The posterior probability is calculated for each gene-class pair, and represents how much each gene differentiates a class from the other classes; being 1 the best value, and 0 the worst. In this way, the posterior probability allows to find the genes that show significant differential expression when comparing the samples of one class versus all the other samples (One-versus-Rest comparison)~\cite{Aibar2013}. In the following examples, we would like to predict specific modulated genes by cancer type. The features of the runs are: \paragraph{Example 1: Breast \emph{vs} Glioblastoma \emph{vs} Liver \emph{vs} Lung Cancers (Figure~\ref{GenesClass}A):} \begin{itemize} \item Studies: \begin{itemize} \item Breast Invasive Carcinoma (TCGA, Provisional) \item Glioblastoma Multiforme (TCGA, Provisional) \item Liver Hepatocellular Carcinoma (TCGA, Provisional) \item Lung Squamous Cell Carcinoma (TCGA, Provisional) \end{itemize} \item Cases: All Samples with mRNA expression data, \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item Samples: 50 \item Gene list: 223.txt file \item lpThreshold: 0.95 \end{itemize} \paragraph{Example 2: Bladder \emph{vs} Breast \emph{vs} Glioblastoma \emph{vs} Lung \emph{vs} Ovarian \emph{vs} Prostate Cancers (Figure~\ref{GenesClass}A'):} \begin{itemize} \item Studies: \begin{itemize} \item Bladder Urothelial Carcinoma(TCGA, Provisional) \item Breast Invasive Carcinoma (TCGA, Provisional) \item Glioblastoma Multiforme (TCGA, Provisional) \item Lung Adenocarcinoma (TCGA, Provisional) \item Ovarian Serous Cystadenocarcinoma (TCGA, Provisional) \item Prostate Adenocarcinoma (TCGA, Provisional) \end{itemize} \item Cases: All Samples with mRNA expression data, \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item Samples: 50 \item Gene list: 73.txt file \item lpThreshold: 0.95 \end{itemize} The significant genes (dots) are over the threshold of posterior probability (Figure~\ref{GenesClass}B, B'). Their number are listed in the legend. The details are displayed in table (Figure~\ref{GenesClass}C, C') and saved in /Results/Classifier folder. \begin{figure}[!ht] \centering \includegraphics[scale=2]{image/GenesClass.png} \caption{Genes/Disease Classification. A, A' Sampling and lpThreshold choice; B, B' Plot of the posterior probabilities of the genes of Disease classes, ordering the genes according to their rank. brca: breast; gbm: Glioblastoma; lihc: Liver; lusc: Lung; blca: Bladder; luad: Lung; ov: Ovary; prad: Prostate. C, C' The table lists the significant genes ranked by class (disease). The columns are ordered as Gene Symbols, Ranking, Class, PostProb, ExprsMeanDiff, ExprsUpDw. } \label{GenesClass} \end{figure} \newpage \subsubsection{Genes \emph{vs} Phenotypes (intra-disease)} The second function of genes classification is implemented from \texttt{rpart} package~\cite{Therneau2014}. The resulting models can be represented as binary trees with gene expression profile threshold (P53 > 2.56) is in node and classes (Living/Deceased) of selected variable in branch. The root of the tree is the best gene divisor of classes. The goal is to pedict which genes combinaison (P53> 2.56 + CAD < -0.45 + MDM4 > 1.92 lead to 80\% Diseased) could be split classes in two or more groups. The tree is built by the following process: first the single variable is found which gene level splits the data into two groups. The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made~\cite{Therneau2014}. There are 3 methods of splitting rule : \texttt{classification}, \texttt{anova}, and \texttt{Poisson}. The \texttt{Classification} used either Gini or log-likelihood function rules. It is used when there are only two categories into the selected variable. The dependent variable is nominal (factor). At each splitting step, the rule tries to reduce the total impurity of the two son nodes relative to the parent node. In the \texttt{anova}, the variable is numeric and it used to predict the closest value to the true one. The method uses splitting criteria $SS_{T} -(SS_{L} + SS_{R})$, where $SS_{T} = \sum(y_{i}-\bar(y)^{2}$ is the sum of squares for the node, and $SS_{R}, SS_{L}$ are the sums of squares for the right a,d left son respectively. This is equivalent to choosing the split to maximize the between-groups sum-of-squares in a simple analysis of variance~\cite{Therneau2014}. The \texttt{poisson} splitting method attempts to extend \texttt{rpart} models to event rate data. The model in this case is $\lambda=f(x)$, where $\lambda$ is an event rate and $x$ is some set of predictors. \newpage \paragraph{Example 1: Genes classification \emph{vs} OS\_STATUS (Living/Deceased) } \begin{itemize} \item Studies: Breast Invasive Carcinoma (TCGA, Provisional) \item Cases: All Samples with mRNA expression data (526 samples), \item Genetic Profiles: mRNA expression z-score (RNA Seq V2 RSEM) \item Gene list: 223.txt file \item Variable: OS\_STATUS \item Split method: \texttt{class} \end{itemize} In this example, the main clinical endpoint of interest is the Living/Deceased Status of patient having breast cancer. We run classification to predict which genes from 223 that make the difference between deceased and living patient having breast cancer. The tree in Figure~\ref{GenePhenoClass1}B indicates that all deceased patients have the following z-score gene expression profile: RAD51<-1.091, HSPA1A>1.424, JUN> -0.4016. All remain patient are living. \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/GenePhenoClass1.png} \caption{Genes/phenotype Classification. A, Selecting variable and the splitting method; B, Tree result, In any node (ellipse or rectangle) the ratio indicates in numerator/Deceased and in the denominator/Living patients. The condition RAD51 z-score < - 1.091 descriminates 12 cases with 7 deceased and so on. The Deceased cases go to the left an the living cases go to the right. The plot could be saved as svg, png or jpg format and scaled in vertical and horizontal; C, The reading of the tree.} \label{GenePhenoClass1} \end{figure} \paragraph{Example 2: Genes regression \emph{vs} OS\_STATUS (Living/Deceased) } \begin{itemize} \item Studies: Breast Invasive Carcinoma (TCGA, Provisional) \item Cases: All Samples with mRNA expression data (526 samples), \item Genetic Profiles: mRNA expression z-score (RNA Seq V2 RSEM) \item Gene list: 223.txt file \item Variable: OS\_STATUS \item Split method: \texttt{anova} \end{itemize} When we split the same clinical data used in the first example (Living/Deceased) using regression method (\texttt{anova}), we see in the Figure~\ref{GenePhenoClass2}C the same genes and split points with further splitting nodes. For example, the gene HSPA11 has 124 patients with 109 are living. In classification method, all patient have the same predicted value (0.846, Figure~\ref{GenePhenoClass1}C) because the error (misclassification) with and without the split is identical. In the regression context the two predicted values of 16.08 and 1.84 (Figure~\ref{GenePhenoClass2}B) are different. The split has identified a nearly pure subgroup of significant size. \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/GenePhenoClass2.png} \caption{Genes/phenotype Regression. A, Selecting variable and the splitting method; B, The reading of the tree; C, Tree result, In any node (ellipse or rectangle) the ratio indicates in numerator/Deceased and in the denominator/Living patients. } \label{GenePhenoClass2} \end{figure} \paragraph{Example 3: Genes classification \emph{vs} tumor grade (grade1/2/3) } \begin{itemize} \item Studies: Uterine Corpus Endometrioid Carcinoma (TCGA, Nature 2013) \item Cases: All Samples with mRNA expression data (333 samples), \item Genetic Profiles: mRNA expression z-score (RNA Seq V2 RSEM) \item Gene list: 223.txt file \item Variable: Tumor Grade \item Split method: \texttt{class} \end{itemize} \texttt{Classification} method could work with phenotype with more than two classes. In this case, the tumor grades are splitted following this order from left to right Grade1/Grade2/Grade3. The nodes are named by the must frequent grade (Figure~\ref{GenePhenoClass3}). \begin{figure}[!ht] \centering \includegraphics[scale=1]{image/GenePhenoClass3.png} \caption{Genes/phenotype Classification. A, Selecting Tumor Grade as phenotype and the \texttt{class} as splitting method; B, The reading of the tree; C, Tree result, In any node (ellipse or rectangle) the ratio indicates in numerator/Deceased and in the denominator/Living patients. } \label{GenePhenoClass3} \end{figure} \subsection{\texttt{Plots}} Their are two plotting functions.The first one (1 Gene/ 2 Gen. Profiles) plots gene data for specified case and two genetic profiles from the same study. It associates between the two selected genetic profiles. The dialog box allows user to specify: \begin{itemize} \item Layout Skin \begin{itemize} \item cont: This is the default skin. It treats all data as being continuous %\item disc: Repuires a single gene and a single genetic profile. It is not available. \item disc\_cont: Requires two genetic profiles. The first datasetin handled as being discrete data, and the function generates a boxplot with distributions for each level of the discrete genetic profile. \item cna\_mrna\_mut : This skin plots mRNA expression level as function of CNA or DNA methylation status for given gene. Data points are colored respectively by mutation status or CNA and mutation status. \end{itemize} \item Correlation Method \begin{itemize} \item Pearson correlationis is used for parametric distribution (more than 30 samples by genetic profile) \item Spearman correlation is used for non-parametric data. \item Kendall tau is non-parametric test that mesures the rank correlation. \end{itemize} \end{itemize} \paragraph{Example: Association of P53 copy number alteration and mRNA exprssion in glioblastoma} \begin{itemize} \item Study: Glioblastoma Multiforme (TCGA, Provisional) \item Cases: All tumor Samples that have mRNA,CNA and sequencing data (135 samples), \item Genetic Profiles 1: Putative copy-number alterations from GISTIC \item Genetic Profiles 2: mRNA expression (RNA Seq V2 RSEM) \item Gene: P53 \item Skin: cna\_mrna\_mut \item correlation method: \texttt{pearson} \end{itemize} No significant pearson correlation was observed between P53 Copy Number Variation (CNV) and P53 mRNA expression (r=0.23, Figure~\ref{plot1}B). The CNV is ranged into 4 levels that derived from the copy-number analysis algorithms GISTIC or RAE, and indicate the copy-number level per gene. "-2" is a deep loss, possibly a homozygous deletion (Homdel), "-1" is a single-copy loss (heterozygous deletion: Hetloss), "0" is Diploid, "1" indicates a low-level gain, and "2" is a high-level amplification. Note that these calls are putative. \begin{figure}[!ht] \centering \includegraphics[scale=0.5]{image/plot1.png} \caption{P53: CNA status vs mRNA expression. Homdel, homozygous deletion; Hetloss, heterozygous deletion; Diploid, No modification; Gain, ADN amplification.} \label{plot1} \end{figure} The second plot function evaluates the relationship of two genes expression levels in the same study (Figure~\ref{plot2}). \paragraph{Example: MAP2K2 and ABHD17A mRNA expression (RNA Seq V2 RSEM) levels in Uterine Corpus Endometrioid Carcinoma (TCGA, Nature 2013)} \begin{itemize} \item Study: Uterine Corpus Endometrioid Carcinoma (TCGA, Nature 2013) \item Cases: All Samples with mRNA expression data (333 samples), \item Genetic Profiles: mRNA expression (RNA Seq V2 RSEM) \item Genes: MAP2K2 and ABHD17A \item correlation method: \texttt{pearson} \end{itemize} \begin{figure}[!ht] \centering \includegraphics[scale=0.5]{image/plot3.jpeg} \caption{MAP2K2 and ABHD17A mRNA expression levels correlation in Uterine Corpus Endometrioid Carcinoma} \label{plot2} \end{figure} \section{Perspectives} \begin{enumerate} \item Pairwise, Multiple comparative of Genetics Profiles \item search nearest Cancer by Matching Genetic Profile for gene list. \item Onco Query Language \end{enumerate} \bibliography{canceR} \bibliographystyle{abbrv} %tools::compactPDF("/home/mezhoud/CGDS-R/canceR/vignettes/canceR.pdf", gs_quality="ebook") \end{document}