%\VignetteIndexEntry{pwOmics} \documentclass[a4paper]{article} \usepackage[utf8]{inputenc} \usepackage[english]{babel} \usepackage{graphicx} \usepackage{enumitem} \usepackage{epstopdf} \usepackage{csquotes} \usepackage[backend=bibtex,bibencoding=ascii,style=authoryear,sorting=none]{biblatex} \usepackage{subcaption} \usepackage{hyperref} \usepackage[noae]{Sweave} \setlength{\parindent}{0pt} \title{pwOmics - Pathway-based integration of time-series omics data using public database knowledge} \author{Astrid Wachter\\ Medical Statistics, University Medical Center Göttingen, Germany} \begin{document} \maketitle \tableofcontents \section{Introduction} Characterization of biological processes can be performed in great detail with the increased generation of omics data on different functional levels of the cell. Especially interpretation of time-series omics data measured in parallel with different platforms is a complex but promising task, needing consideration of time-independent combination of omics data and additionally time-dependent signaling analysis. As each measurement technique shows a certain bias and has natural limitations in identifying full signaling responses (Yeger-Lotem, E et al., 2009), such cross-platform analysis is an up-to-date approach in order to connect biological implications on different signaling levels. Using diverse data types is expected to provide a deeper understanding of global biological functions and the underlying complex processes (Kholodenko B et al., 2012).\\ This is why computational data analysis tools for interpretation of data from proteomics and transcriptomics measurements in parallel are needed.\\ \textit{pwOmics} is a tool for pathway-based level-specific data comparison and analysis of single time point or time-series omics data measured in parallel. It provides individual analysis workflows for the different omics data sets (see Figure~\ref{fig:analysis}) and in addition enables consensus analysis of omics data. \begin{figure}[htbp] \centering \includegraphics [width=100px, height=300px, angle = 270]{updown.pdf} \caption{\textit{pwOmics} downstream and upstream analysis.} \label{fig:analysis} \end{figure} Up to this point analysis is restricted to human species. In future an expansion of the package is possible dependent on available online open access database information. \section{Databases} \label{sec:Databases} As \textit{pwOmics} is a package for data integration based on prior pathway and transcription knowledge data, it is necessary to define the databases to work with. Three different kinds of databases are necessary to do all analyses steps:\\ \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex] \item Pathway databases:\\ The user can choose from Biocarta (Nishimura, D, 2001), Reactome (Milacic, M et al., 2012; Croft, David et al., 2014), PID (Schaefer, CF et al., 2009) from the National Cancer Institute (NCI) and KEGG (Kanehisa, M et al.,2014, Kanehisa, M and Goto, S, 2000).\\ \item Protein-protein interaction (PPI) database:\\ STRING (Franceschini, A et al., 2013).\\ \item Transcription factor (TF) - target gene databases:\\ The user can choose from ChEA (Lachmann, A et al., 2010), Pazar (Portales-Casamar, E et al., 2009; Portales-Casamar, E et al., 2007) and/or decide to specify an own file e.g. based on a commercial database.\\ \end{enumerate} The pathway database information is used to identify the pathways of the differentially abundant phosphoproteins in the downstream analysis as well as upstream protein regulators of TFs in the upstream analysis. The PPI database STRING (Franceschini, A et al., 2013) was chosen to define the protein net for the consensus analysis. TF - target gene database information is necessary for the TF identification in pathways in the downstream analysis. Additionally the upstream TFs of differentially expressed genes/transcripts are identified in the upstream analysis based on this information.\\ In downstream analysis the pathway gene set information is used, whereas in the upstream analysis also the pathway topology information is exploited.\\ The database information is downloaded internally via \textit{STRINGdb} and \textit{AnnotationHub} (Morgan M et al.) package. In case the author is interested also in the metadata of the pathway database and TF - target database it can be received by \begin{Schunk} \begin{Sinput} > library(pwOmics) > library(AnnotationHub) > ah = AnnotationHub() > #pathway databases > pw = query(ah, "NIH Pathway Interaction Database") > pw[1] \end{Sinput} \begin{Soutput} AnnotationHub with 1 record # snapshotDate(): 2016-08-15 # names(): AH22329 # $dataprovider: NIH Pathway Interaction Database # $species: Homo sapiens # $rdataclass: biopax # $title: BioCarta.owl.gz # $description: BioCarta BioPax file from NCI Pathway Interaction Database # $taxonomyid: 9606 # $genome: hg19 # $sourcetype: BioPax # $sourceurl: ftp://ftp1.nci.nih.gov/pub/PID/BioPAX/BioCarta.owl.gz # $sourcelastmodifieddate: 2009-09-09 # $sourcesize: 338343 # $tags: BioCarta, BioPax, Pathway Interaction Database # retrieve record with 'object[["AH22329"]]' \end{Soutput} \begin{Sinput} > #TF-target databases > chea = query(ah, "ChEA") > chea[1] \end{Sinput} \begin{Soutput} AnnotationHub with 1 record # snapshotDate(): 2016-08-15 # names(): AH22237 # $dataprovider: ChEA # $species: NA # $rdataclass: data.frame # $title: chea-background.zip # $description: ChEA background file, containing transcription factor data t... # $taxonomyid: NA # $genome: NA # $sourcetype: Zip # $sourceurl: http://amp.pharm.mssm.edu/result/kea/chea-background.zip # $sourcelastmodifieddate: 2015-03-09 # $sourcesize: 3655103 # $tags: ChEA, Transcription Factors # retrieve record with 'object[["AH22237"]]' \end{Soutput} \begin{Sinput} > pazar = query(ah, "Pazar") > pazar[1] \end{Sinput} \begin{Soutput} AnnotationHub with 1 record # snapshotDate(): 2016-08-15 # names(): AH22238 # $dataprovider: Pazar # $species: NA # $rdataclass: GRanges # $title: pazar_ABS_20120522.csv # $description: TF - Target Gene file from pazar_ABS_20120522 # $taxonomyid: NA # $genome: NA # $sourcetype: CSV # $sourceurl: http://www.pazar.info/tftargets/pazar_ABS_20120522.csv # $sourcelastmodifieddate: 2012-06-04 # $sourcesize: 120202 # $tags: Pazar, Transcription Factors # retrieve record with 'object[["AH22238"]]' \end{Soutput} \end{Schunk} In case you want to use TF - target gene information which is not part of the mentioned databases but e.g. part of a commercial database, a user-specified file can be used for the analysis. This file should be a `.txt' file with first column transcription factors and second column target gene symbols without a header, e.g.:\\ \begin{center} \fbox{ \begin{tabular}{ l l } GATA-4 & HAMP\\ c-Jun & IL18\\ NF-kappaB & TLR2\\ MYB & LTB\\ FOXO1A & TGFBR1\\ ... & ... \end{tabular} } \end{center} The STRING PPI-information is downloaded automatically while processing and analyzing the data: The \textit{STRINGdb} package (Franceschini, A et al., 2013) is used here. \section{Example dataset} The example dataset used here for demonstration purposes comprises whole genome time course microarray data at time points 0, 1, 4, 8, 13, 18 and 24 hr after stimulation. The complementary phosphoproteomics data was measured at time points 0.25, 1, 4, 8, 13, 18 and 24 hours after stimulation. Preprocessing of data is presupposed with given lists of significant genes and phosphoproteins for each time point as logarithmized expression ratios relative to the time 0 hr controls. \section{Data pre-processing} \label{sec:Datapre} \textit{pwOmics} is a package for secondary data analysis, i.e. it needs already pre-processed data as input for the analysis. The input required is\\ \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex] \item a list of all phosphoprotein IDs measured,\\ \item a list of all gene/transcript IDs measured,\\ \item a list of differentially abundant phosphoproteins + log fold changes,\\ \item a list of differentially expressed genes/transcripts + log fold changes.\\ \end{enumerate} The IDs need to be gene symbols, both for phosphoprotein and gene/transcript data. In case time-series data is analyzed inputs 3. and 4. needs to be specified for each time point. It is absolutely necessary, that all phosphoproteins and genes/transcript in inputs 3. and 4. are part of the lists of all phosphoprotein IDs and all gene/transcript IDs, respectively. \\ The OmicsData object is the format used for data analysis in \textit{pwOmics} package. It contains a list of four main elements:\\ \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex] \item OmicsD - here the omics data set, its description and the results are stored\\ \item PathwayD - here the chosen pathway databases and the generated Biopax model is stored\\ \item TFtargetsD - here the chosen TF-target gene databases and the combined TF-target gene information is stored\\ \item Status - The status variable equals `1' in case not all information needed for the analysis is read in yet and `2' after identification of the first upstream/downstream signaling levels. As the enrichment step is not necessarily part of the analysis and dependent on the pathway database and the TF-target gene database the identification of signaling molecules in further levels might not be successful, the status variable is not used in the further analysis.\\ \end{enumerate} Thus \textit{pwOmics} reads in the omics data set provided by the user to the first element of the OmicsData object and further on stores all the results in this part as well.\\ This is why the user has to provide the omics data set in a special format: A list needs to be generated with a phosphoprotein list named `P' as first element and a gene/transcript list named `G' as second element. These lists contain as first element a data frame with all (unique) phosphoprotein IDs and gene/transcript IDs in the first column, respectively, and as second element a list with data frames for each time point of measurement. The data frames have two columns with the first one containing the differentially abundant/expressed phosphoproteins or genes/transcripts as gene symbols and the second column containing the corresponding log fold changes, e.g.:\\ \begin{Schunk} \begin{Sinput} > data(OmicsExampleData) > OmicsExampleData \end{Sinput} \end{Schunk} Generated as in the following example: \begin{verbatim} OmicsExampleData = list(P = list(allPIDs, list(PIDstp0.25, PIDstp1, PIDstp4, PIDstp8, PIDstp13, PIDstp18, PIDstp24)), G = list(allGIDs, list(GIDstp1, GIDstp4, GIDstp8, GIDstp13, GIDstp18, GIDstp24))) \end{verbatim} \begin{Schunk} \begin{Sinput} > head(OmicsExampleData$P[[2]][[1]]) \end{Sinput} \begin{Soutput} GeneSymbol X15min 1 MRPS17 0.6976049 2 RPS12 -1.0297977 3 SLC3A2 -1.2623327 4 RPL8 0.8304820 5 ACTB -2.4914461 6 ALDOA 0.8637013 \end{Soutput} \end{Schunk} In case the user only wants to analyze omics data from a single time point just one data frame has to be specified.\\ The time points do not have to be the same for phosphoprotein and gene/transcript data and need to be specified when reading in the omics data set separately via the `tp\_prots' and `tp\_genes' parameters of the `readOmics' function. \begin{Schunk} \begin{Sinput} > data_omics = readOmics(tp_prots = c(0.25, 1, 4, 8, 13, 18, 24), + tp_genes = c(1, 4, 8, 13, 18, 24), + OmicsExampleData, + PWdatabase = c("biocarta", "kegg", "nci", + "reactome"), + TFtargetdatabase = c("userspec")) \end{Sinput} \end{Schunk} If data from a single timepoint measurement should be analyzed the user simply assigns the experiment number `1' for these parameters:\\ \begin{verbatim} #for single time point data set: omics = list(P = list(allPIDs, list(PIDs_1)), G = list(allGIDs, list(GIDs_1))) data_omics = readOmics(tp_prots = c(1), tp_genes = c(1), OmicsExampleData, PWdatabase = c("biocarta", "kegg", "nci", "reactome"), TFtargetdatabase = c("userspec")) \end{verbatim} Additionally the selected databases have to be specified.\\ The stored information can be easily accessed via the following functions: \begin{Schunk} \begin{Sinput} > getOmicsTimepoints(data_omics) > head(getOmicsallProteinIDs(data_omics)) > head(getOmicsallGeneIDs(data_omics)) > head(getOmicsDataset(data_omics, writeData = FALSE)[[1]]) \end{Sinput} \end{Schunk} \section{Individual analysis} As shown in Figure~\ref{fig:analysis} the analysis is based on an individual analysis of the phosphoproteomic and the genomic/transcriptomic data. The downstream analysis and upstream analysis are described in the following subsections. Prior to that the database information has to be read in. In a first step the phosphoprotein information about downstream activating/inactivating effects should be read in. \begin{verbatim} data_omics = readPhosphodata(data_omics, phosphoreg) \end{verbatim} The parameter `phosphoreg' should be a tab-separated .txt file with two columns, but no header. The first column should include the HUGO gene symbols of the phosphoproteins with known downstream effects, the second column should include either activating (1) or inactivating (-1) downstream effect coding upon phosphorylation of the corresponding phosphoprotein. So far, there is no possibility to include multiple phosphorylation events individually. However, if no information for a phosphoprotein is available in this file, any up/downregulation match is allowed for the molecules identified in the intersection analyses. In a second step the TF- target information can be made accessible to the `OmicsData' object by: \begin{verbatim} data_omics = readTFdata(data_omics, TF_target_path) \end{verbatim} Via the `TF\_target\_path' parameter a path of a user-specified file specifying transcription factors and corresponding target genes can be given. This should be a tab-separated .txt file with two columns, but no header, giving HUGO gene symbols of transcription factors in the first column and gene symbols of corresponding target genes in a second column. This information can be used additionally to the selected database content. Thirdly, the `readPWdata' function takes the `OmicsData' object with the provided information about the omics data set and the path of the prepared `.RData' genelists from the pathway databases (see Section~\ref{sec:Databases}) or automatically generates the corresponding genelists of the pathway data if `loadgenelists = FALSE'. In this step the automatic definition of internal IDs for different pathway databases is necessary, which are stored in a new biopax model in the `OmicsData' object. \begin{verbatim} data_omicsPW = readPWdata(data_omics, loadgenelists = FALSE) \end{verbatim} As the process of generating genelists with these IDs can take some time - especially for rather big databases such as Reactome (Milacic, M et al., 2012; Croft, D et al., 2014) - the genelists for the different databases are automatically stored in the working directory and can be reused in another analysis when the corresponding path containing these files is given to the `readPWdata' function as loadgenelists parameter. \begin{verbatim} data_omics = readPWdata(data_omics, loadgenelists = "Genelist_reactome.RData") \end{verbatim} \subsection{Downstream analysis} The downstream analysis is starting with the provided phosphoproteomic data (either single time point data or time-series data). The first step is the identification of downstream regulation influence of phosphoproteins based on the file read in by `readPhosphodata' function. \begin{verbatim} data_omics = identifyPR(data_omics) \end{verbatim} The second step is the identification of the pathways in which the differentially abundant phosphoproteins play a role. \textit{pwOmics} performs this searching step on the basis of the provided phosphoproteomic data set and the selected pathway database(s). As this function exploits pathway knowledge, it is necessary to be in the working directory of the genelists that were generate by function `readPWdata'. \begin{verbatim} data_omics = identifyPWs(data_omics) \end{verbatim} Following the workflow the next step is the identification of the transcription factors in these pathways, which is done with the information provided by the chosen TF-target gene database. \begin{verbatim} data_omics = identifyPWTFTGs(data_omics) \end{verbatim} For use of this function the working directory should still contain the previously generated genelists.\\ The results of the downstream analysis can be easily accessed by the following functions: \begin{verbatim} getDS_PWs(data_omics) getDS_TFs(data_omics) getDS_TGs(data_omics) #Access biopax model generated newly on basis of selected #pathway databases: getBiopaxModel(data_omics) \end{verbatim} \subsection{Upstream analysis} The upstream analysis is starting with the provided gene/transcript data (either single time point data or time-series data). It first of all identifies the upstream TFs of the differentially expressed genes/transcripts. This step is done with the provided/selected information of TF-target gene pairs. Given this information, the identification of upstream TFs can be done: \begin{verbatim} data_omics = identifyTFs(data_omics) \end{verbatim} Upstream of the TFs the potential regulator proteins can be identified with the following function: \begin{verbatim} data_omics = identifyRsofTFs(data_omics, noTFs_inPW = 1, order_neighbors = 10) \end{verbatim} The identification of potential upstream regulators is done in the following way:\\ \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex] \item Identification of the pathways the previously identified TFs are part of.\\ \item Selection of pathways according to the user-specified parameter `noTFs\_inPW': Only those pathways are considered in further analysis with at least this number of TFs present in the pathway.\\ \item Upstream regulators are identified for these TFs. This is done by finding first for each TF the pathway neighborhood according to the user-specified parameter `order\_neighbors'. This parameter specifies the order of the identified pathway neighborhood. Then the intersection of all identified neighborhoods for all TFs in a pathway is determined. The resulting pathway node set is defined here as the set of potential regulator proteins.\\ \end{enumerate} In case the pathways under consideration do not have pathway components in the downloaded biopax model, this will be indicated with a warning. This warning can be ignored by the user in regard to the following analysis steps. The results of the upstream analysis can be accessed with the following functions: \begin{verbatim} getUS_TFs(data_omics) getUS_PWs(data_omics) getUS_regulators(data_omics) \end{verbatim} \section{Consensus analysis} The consensus analysis combines the results from upstream and downstream analysis by constituting in particular the comparative analysis of the results of the two different data sets on each cellular level individually. The intersection analysis thus simply compares the results of the separate upstream and downstream analyses. The static consensus analysis allows to determine static consensus graphs for each time point measured in parallel. Finally, the consensus-based dynamic analysis provides the user with one final dynamic network obtained from the data changes over time based on dynamic bayesian network inference. The consensus-based dynamic analysis is only conductable with time-series data sets measured for both phosphoproteome and genome/transcriptome data in parallel. \subsection{Intersection analysis} The intersection analysis of \textit{pwOmics} is a simple comparative analysis of the results of upstream and downstream analysis. Thus, it enables a comparison of single time point data and time-series data, the latter also for non-corresponding time points in the different data sets. The comparison is possible on the three different functional levels considered in this package: On the pathway level, the transcription factor level and gene/transcript level. The parameters `updown' and `phospho' enable to define, whether the direction of regulation should be considered in the comparison and whether the phosphosite information should be considered. The consensus molecule sets (C) for each of the molecular levels (protein level, transcription factor level and gene/transcript level) are defined as: \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex] \item if phosphoprotein downstream activity is induced:\\ $C = (X'\uparrow \cap ~X \uparrow) \cup (X'\downarrow \cap ~X\downarrow)$ \\ \item if phosphoprotein downstream activity is inhibited:\\ $C = (X'\uparrow \cap ~X \downarrow) \cup (X'\downarrow \cap ~X\uparrow)$ \\ \item if no information is available for a specific phosphoprotein:\\ $C = (X'\uparrow \cap ~X \uparrow) \cup (X'\downarrow \cap ~X\downarrow) \cup (X'\uparrow \cap ~X \downarrow) \cup (X'\downarrow \cap ~X\uparrow)$\\ \end{enumerate} with $\uparrow$ = upregulated and $\downarrow$ = downregulated and $X$ = molecules in downstream analysis, $X'$ = molecules in upstream analysis. \begin{verbatim} getProteinIntersection(data_omics, tp_prot = 4, tp_genes = 18, updown = TRUE, phospho = TRUE) getTFIntersection(data_omics, tp_prot = 13, tp_genes = 13, updown = TRUE, phospho = TRUE) getGenesIntersection(data_omics, tp_prot = 24, tp_genes = 24, updown = TRUE, phospho = TRUE) \end{verbatim} These functions not only enable a comparison of the same timepoints on the distinct levels, but for time-series data sets also for non-matching time points:\\ With the time resolution of measuring omics data in most cases being pre-defined by expected signaling changes and financial limitations the potential in the interpretation of the results is strongly confined to the experimental design decisions. Thus, measured signaling changes, which naturally consist of a superposition of diverse time-scales of transcriptional and translational processes and comprehend diverse frequency patterns (Yosef, N and Regev A(2011)), are dependent on the sampling. This means for some of the signaling axes it might be the case, that \begin{itemize} \item changes are not detected at all as their rate is too high, \item some are represented in the data and \item some might be so slow that their change is not considered significant and thus are excluded from analysis. \end{itemize} As the corresponding signaling changes are not expected to be seen at the same time point in phosphoproteome data and gene/transcript data it is necessary to enable also the comparison of non-corresponding time points.\\ The option to compare non-corresponding time points cannot account for the changes not captured during measurement, however, it gives the possibility to consider also the time needed for regulatory control mechanisms in the interpretation of the measurement results.\\ In case the user wants to compare the corresponding time points on the three levels simultaneously he can do so by using the following function: \begin{verbatim} gettpIntersection(data_omics, updown = TRUE, phospho = TRUE) \end{verbatim} The parameters `updown' and `phospho' allow to select whether different regulation directions in the data sets (phosphoproteome vs. transcriptome) should be filtered out in the comparison, and whether the downstream regulation influence of phosphoproteins read in with the `readPhosphodata' function should be considered in the comparison. \subsection{Static consensus analysis} The static consensus analysis goes one step ahead and integrates the results gained from the comparative analysis of the corresponding time points to a consensus net for each time point. The change of this consensus net over time gives a first insight into the changes seen statically at the different time points. However, the static consensus nets do not yet include information gathered over time - as it is the case for the consensus-based dynamic analysis (see Section~\ref{sec:DynCons}). This is why the static consensus analysis is also applicable for single time point measurements.\\ The static consensus analysis is conducted by generation of a Steiner tree (Kleinberg, J and Tardos E, 2006) on basis of consensus proteins and TFs identified in downstream and upstream analysis for each corresponding time point. The underlying graph used is the protein-protein-interaction (PPI) graph from the STRING database reduced to the connected nodes. The consensus proteins and TFs are considered as terminal nodes and are connected via a shortest path-based approximation of the Steiner tree algorithm (Takahashi, H and A Matsuyama, 1980; Sadeghi, A and Fröhlich, H, 2013) across the reduced PPI-STRING-graph. Subsequently knowledge of TF-target gene pairs from the chosen database is used to expand the graph with matching genes/transcripts from both upstream and downstream analysis. In case the consensus graph contains corresponding proteins and genes/transcripts, feedback loops are added automatically. \begin{verbatim} consensusGraphs = staticConsensusNet(data_omics, updown = TRUE, phospho = TRUE) \end{verbatim} The parameters `updown' and `phospho' give the same filtering options as in the intersection analysis. \subsection{Consensus-based dynamic analysis} \label{sec:DynCons} Unlike the static consensus analysis, the consensus-based dynamic analysis takes into consideration also the signaling changes over time by applying dynamic bayesian network inference. The packages used for the consensus-based dynamic analysis are \textit{longitudinal} (Opgen-Rhein, R and Strimmer, K, 2006; Opgen-Rhein R and Strimmer K, 2006) to adjust the format of the data and the actual network inference part is done via the \textit{ebdbNet} (Rau, A et al., 2010) package. This package includes an iterative empirical Bayesian procedure with a Kalman filter estimating the posterior distributions of the network parameters. The defined prior structure of the network is used for a straightforward estimation of hyperparameters via an expectation maximization (EM)-like algorithm and the dimension of the hidden states are determined via the singular value decomposition (SVD) of a block-Hankel matrix.\\ The nodes included into the network inference step are nodes which are part of the static consensus graphs from corresponding time points of the two different measurement types, i.e.\\ \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex] \item proteins identified in upstream and downstream analysis at the same time points,\\ \item Steiner nodes identified via static consensus analysis,\\ \item TFs identified in upstream and downstream analysis at the same time points and\\ \item genes/transcripts identified in upstream and downstream analysis at the same time points.\\ \end{enumerate} However, all measured time points of these nodes are taken into consideration. To apply dynamic network inference a reasonable number of measurements needs to be available. As in most cases of parallel phosphoprotein and gene/transcript measurements only very few corresponding time steps are available it is necessary to artificially introduce additional time steps. This is done by generating smoothing splines applied on the log fold changes provided by the user under the simplifying assumption of a gradual change of signaling between the different time points.\\ This assumption, however, has to be applied consciously and carefully, as there might be higher frequency signaling components superimposed (see for a comprehensive analysis of temporal dynamics of gene expression (Yosef N and Regev A, 2011). In theory a signal has to be sampled two times its maximal frequency in order to be able to reconstruct it exactly from time discrete measurements (Nyquist-Shannon sampling theorem (Shannon, CA and Weaver W, 1949; Nyquist, H, 1928). This means only exact interpretation of those signaling axes are possible that have a frequency which is smaller than half of the sampling frequency. However, under certain preconditions on signal structure and the sampling operator reconstruction of the original signal can be done with a lower sampling rate (Blumensath, T and Michael ED, 2009). This is an interesting starting point for a more comprehensive dynamic analysis of the expected signals and the sampling needed for an extensive data mining of omics data sets measured in parallel, but exceeds the scope of this package. \\ The number of time points generated additionally via smoothing splines is based on simulation results of \textit{ebdbNet} analysis for median area under the curve (AUC) values of receiver operating characteristic (ROC) curves: In their results it was shown that a plateau at around 50 to 75 time points was reached. Thus in \textit{pwOmics} 50 time points are predicted with smoothing splines in order to apply dynamic bayesian network inference on omics data sets measured in parallel.\\ After generation of these time points a block-Hankel matrix of autocovariances is constructed based on the time series abundance/expression data. For this the user needs to provide the \textit{laghankel} parameter, specifying the maximum relevant time lag to be used in constructing the block-Hankel matrix. With a singular value decomposition (see function `hankel' of \textit{ebdbNet} package) the number of hidden states can be determined. Here, the user can specify the \textit{cutoffhankel} parameter to choose the cutoff to determine the desired percent of total variance explained by the singular values. Additional parameters on convergence criteria and iterations performed can be specified. For further details the user is referred to (Rau, A et al., 2010). \begin{verbatim} library(ebdbNet) library(longitudinal) dynInferredNet = consDynamicNet(data_omics, consensusGraphs, laghankel = 3, cutoffhankel = 0.9) \end{verbatim} \section{Time profile clustering} An additional analysis option is clustering of co-regulation patterns over time. It provides information about the signaling molecules with common dynamic behaviour and thus allows to draw conclusions in terms of signaling chronology. Time profile clustering is performed as soft clustering based on the \textit{Mfuzz} package (Futschik, M, 2012). The advantage of this clustering method is that a protein, TF or gene/transcript can be assigned to several clusters, thus reducing the sensitivity to noise and the information loss hard clustering exhibits. It is implemented as fuzzy c-means algorithm (Hathaway, R and Bezdek, J, 1986) and iteratively optimizes the objective function to minimize the variation of objects within the clusters. The user needs to provide a `min.std' threshold parameter if proteins or genes/transcripts with a low standard deviation should be excluded. In addition the maximum number of cluster centers which should be tested in the `minimum distance between cluster centroid test' has to by given. This number is used as initial number to determine the data-specific maximal cluster number based on the number of distinct data points. For more details see (Futschik, M, 2012) and (Schwämmle, V and Jensen ON, 2010). \begin{verbatim} library(Mfuzz) fuzzyClusters = clusterTimeProfiles(dynInferredNet, min.std = 0, ncenters = 12) fuzzyClusters$size fuzzyClusters$cluster \end{verbatim} \section{Identification of signaling axes} Based on the provided data it is also possible to follow the signal through the different cellular levels individually, such that individual signaling axes can be identified. The user can seach for individual axes downstream of a certain phosphoprotein by using the `findSignalingAxes' function and indicating the time point of interest. The time point needs to comply with the ones read in in the `readOmics' function. \begin{verbatim} MAPK1_axis1 = findSignalingAxes(data_omics, phosphoprot = "MAPK1", tpDS = 1) MAPK1_axis18 = findSignalingAxes(data_omics, phosphoprot = "MAPK1", tpDS = 18) BLNK_axis13 = findSignalingAxes(data_omics, phosphoprot = "BLNK", tpDS = 13) \end{verbatim} The output is a list of pathways that are identified for this time point downstream of the phosphoprotein of interest. Sublists give information on the transcription factors in these pathways, the direction of regulation (up/down), the target genes and their matching transcripts from the transcriptome data set. Based on these results the user can produce a summarized results table containing those target genes of the axis, which are matching the transcriptome data, the direction of their regulation coming from the phosphoproteome data set (up/down), the direction of regulation coming from the transcriptome data set (up/down) and the summarized complying information: \begin{verbatim} MAPK1_transcripts = get_matching_transcripts(data_omics, MAPK1_axis1) \end{verbatim} If the user is interested in an overall analysis of all signaling axes in the data sets, he can apply: \begin{verbatim} generate_DSSignalingBase(data_omics, timepoints = c(0.25, 1, 4, 8, 13, 18, 24)) \end{verbatim} with chosen time points. This function creates a folder structure for each consensus protein in the working directory, which is filled with .RData objects and .csv tables that carry the information for the individual time points of interest. \section{Visualization} To complement the results from the different comparisons and analyses (accessible via the `get...' functions) the \textit{pwOmics} package provides visualization functions for the different analyses. The consensus graphs of the static analysis for one or more corresponding time points can be plotted with the following function (see Figure~\ref{fig:Consgraph1} and Figure~\ref{fig:Consgraph2}): \begin{verbatim} plotConsensusGraph(consensusGraphs, data_omics) \end{verbatim} \begin{figure}[htbp] \centering \includegraphics [trim={3cm 2cm 0cm 2cm},clip]{ConsensusGraphs_tp1.pdf} \caption{\textit{pwOmics} static consensus graph: Time point 1 hr.} \label{fig:Consgraph1} \end{figure} \begin{figure}[htbp] \centering \includegraphics [trim={3cm 2cm 0cm 2cm},clip]{ConsensusGraphs_tp24.pdf} \caption{\textit{pwOmics} static consensus graph: Time point 24 hrs.} \label{fig:Consgraph2} \end{figure} The consensus-based dynamic analysis result can be visualized as follows (see Figure~\ref{fig:DynConsgraph}): \begin{verbatim} plotConsDynNet(dynInferredNet, sig.level = 0.7) \end{verbatim} Here, the parameter `sig.level' is the significance level used as cutoff for plotting edges in the network and has to be specified in the range between 0 and 1. Furthermore the user can indicate if unconnected nodes should be removed and provide additional \textit{igraph} (Csardi, G and Nepusz T, 2006) layout parameters.\\ \begin{figure}[htbp] \centering \includegraphics{dynConsensusNet_2.pdf} \caption{\textit{pwOmics} dynamic network graph.} \label{fig:DynConsgraph} \end{figure} However, as the user can access the networks easily the \textit{tkplot} function from the \textit{igraph} R package is a nice interactive graph drawing alternative. In addition plot parameters can be easily changed as the result networks are of class `igraph'. In order to plot the results from time profile clustering (see Figure~\ref{fig:cluster}) the following function can be used: \begin{verbatim} plotTimeProfileClusters(fuzzyClusters) \end{verbatim} The different colours represent the different clusters. The legend is only shown if the number of genes and proteins is not too large. Otherwise the user can easily access this information by having a look to the output of the `clusterTimeProfiles' function which provides information about cluster centers, the number of data points in each cluster of the closest hard clustering, cluster indices, and additional parameters explained in detail in the `mfuzz' documentation. In the legend the attachments `\_g' and `\_p', respectively, indicate, if the node originally derives from phosphoprotein or gene/transcript measurements. \begin{figure}[htbp] \centering \includegraphics{TimeProfiles.pdf} \caption{\textit{pwOmics} time profile clusters.} \label{fig:cluster} \end{figure} It is furthermore possible to investigate temporal correlations of phosphoprotein and transcriptome measurements. \begin{verbatim} temp_correlations(consensusGraphs[[1]], timepointsprot = c(1,4,8,13,18,24), timepointstrans = c(1,4,8,13,18,24), foldername = "~/TempCorr_", trans_sign = "~/transcriptome_signif.txt", trans_sign_names = c("FC_1", "FC_4", "FC_8", "FC_13", "FC_18", "FC_24"), phospho_sign = "~/phospho_anno.txt", phospho_sign_names = c("rat_1", "rat_4","rat_8","rat_13", "rat_18", "rat_24")) \end{verbatim} This function needs as input the static consensus graph with the consensus proteins of interest, the protein and transcriptome measurement time points for which correlation should be investigated, the name of the folder that should be generated as results folder, the phosphoproteome and transcriptome measurement files and the corresponding column names for the selected time points. It is necessary that the gene symbols read in in the `readOmics' function are reused here. In the tab-delimited phosphoproteome measurement file (.txt) additional phosphoprotein information can be stored in columns `Amino.acid', `Position' and `Multiplicity'. This allows to compare correlations for individual phosphorylations. \begin{figure}[htbp] \centering \includegraphics [width = 350pt, height = 280pt, trim={0cm 5cm 0cm 0cm},clip]{TempCorr_MAPK14_downstream.pdf} \caption{Temporal correlation of MAPK14 with transcripts identified to be resulted downstream.} \end{figure} \section{References} \begin{itemize} \item Yeger-Lotem, Esti et al. (2009). “Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity”. In: Nature Genetics 41.3, pp. 316–323. issn: 1546-1718. doi: 10.1038/ng. 337. \item Boris Kholodenko et al. (2012). “Computational Approaches for Analyzing Information Flow in Biological Networks”. In: Science Signaling 5, re1. \item Nishimura, Darryl (2001). “BioCarta”. In: Biotech Software \& Internet Report 2.3, pp. 117–120. issn: 1527-9162. doi: 10.1089/152791601750294344. url: http://online.liebertpub.com/doi/abs/10.1089/152791601750294344 (visited on 01/02/2015). \item Milacic, Marija et al. (2012). “Annotating cancer variants and anti-cancer therapeutics in reactome”. In: Cancers 4.4, pp. 1180–1211. issn: 2072-6694. doi: 10.3390/cancers4041180. \item Croft, David et al. (2014). “The Reactome pathway knowledgebase”. In: Nucleic Acids Research 42.Database issue, pp. D472–477. issn: 1362-4962. doi: 10.1093/nar/gkt1102. \item Schaefer, Carl F. et al. (2009). “PID: the Pathway Interaction Database”. In: Nucleic Acids Research 37.Database issue, pp. D674–679. issn: 1362- 4962. doi: 10.1093/nar/gkn653. \item Kanehisa, Minoru et al. (2014). “Data, information, knowledge and principle: back to metabolism in KEGG”. In: Nucleic Acids Research 42.Database issue, pp. D199–205. issn: 1362-4962. doi: 10.1093/nar/gkt1076. \item Kanehisa, M. and S. Goto (2000). “KEGG: kyoto encyclopedia of genes and genomes”. In: Nucleic Acids Research 28.1, pp. 27–30. issn: 0305-1048. 19 \item Franceschini, Andrea et al. (2013). “STRING v9.1: protein-protein interaction networks, with increased coverage and integration”. In: Nucleic Acids Research 41. Database issue, pp. D808–D815. issn: 0305-1048. doi: 10.1093/ nar / gks1094. url: http : / / www . ncbi . nlm . nih . gov / pmc / articles / PMC3531103/ (visited on 01/02/2015). \item Lachmann, Alexander et al. (2010). “ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments”. In: Bioinformatics (Oxford, England) 26.19, pp. 2438–2444. issn: 1367-4811. doi: 10.1093/bioinformatics/btq466. \item Portales-Casamar, Elodie et al. (2009). “The PAZAR database of gene regu- latory information coupled to the ORCA toolkit for the study of regulatory sequences”. In: Nucleic Acids Research 37.suppl 1, pp. D54–D60. issn: 0305-1048, 1362-4962. doi: 10.1093/nar/gkn783. \item Portales-Casamar, Elodie et al. (2007). “PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation”. In: Genome Biology 8.10, R207. issn: 1465-6906. doi: 10.1186/gb- 2007- 8- 10- r207. \item Morgan M Carlson M, Tenenbaum D and Arora S. AnnotationHub: Client to access AnnotationHub resources. R package version 2.22.0. \item Yosef, Nir and Aviv Regev (2011). “Impulse control: Temporal dynamics in gene transcription”. In: Cell 144.6, pp. 886–896. issn: 0092-8674. doi: 10. 1016/j.cell.2011.02.015. url: http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3148525/ (visited on 01/03/2015). \item Kleinberg, J and E Tardos (2006). Algorithm Design. Pearson, Boston, MA. \item Takahashi, H and A Matsuyama (1980). “An approximate solution for the Steiner problem in graphs”. In: Math. Jap. 24, pp. 573–577. \item Sadeghi, Afshin and Holger Fröhlich (2013). “Steiner tree methods for optimal sub-network identification: an empirical study”. In: BMC Bioinformatics 14, p. 144. \item Opgen-Rhein, R and K Strimmer (2006). “Using regularized dynamic correlation to infer gene dependency networks from time-series microarray data”. In: Proceedings of the 4th International Workshop on Computational Systems Biology, WCSB 2006, pp. 73–76. \item Rainer Opgen-Rhein, Korbinian Strimmer (2006). “Inferring gene dependency networks from genomic longitudinal data: a functional data approach”. In: REVSTAT – Statistical Journal Volume 4.1, pp. 53–65. issn: 1645-6726. \item Rau, Andrea et al. (2010). “An empirical Bayesian method for estimating biological networks from temporal microarray data”. In: Statistical Applications in Genetics and Molecular Biology 9.1. \item Shannon, CA and W Weaver (1949). The mathematical theory of communication. University of Illinois Press. \item Nyquist, H (1928). “Certain topics in telegraph transmission theory”. In: Transactions of the A. I. E. E. pp. 617–644. \item Blumensath, Thomas and Michael E. Davies (2009). “Sampling Theorems for Signals From the Union of Finite-Dimensional Linear Subspaces”. In: IEEE Transactions on Information Theory 55.4, pp. 1872–1882. \item Futschik, Matthias (2012). Mfuzz: Soft clustering of time series gene expression data. R package version 2.22.0. url: http://itb.biologie.hu-berlin. de/~futschik/software/R/Mfuzz/. \item Hathaway, R and Bezdek, J (1986). "Local convergence of the fuzzy c-means algorithm". In: Pattern Recognition 19, pp. 477-480. \item Schwämmle, Veit and Ole Nørregaard Jensen (2010). “A simple and fast method to determine the parameters for fuzzy c–means cluster analysis”. In: Bioinformatics 26.22, pp. 2841–2848. issn: 1367-4803, 1460-2059. doi: 10. 1093/bioinformatics/btq534. url: http://bioinformatics.oxfordjournals. org/content/26/22/2841 (visited on 01/03/2015). \item Csardi, Gabor and Tamas Nepusz (2006). “The igraph software package for complex network research”. In: InterJournal Complex Systems, p. 1695. url: http://igraph.org. \end{itemize} \newpage \section{Session Information} <<>>= sessionInfo() @ \end{document}