--- title: "" author: "" date: "" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SIMMS} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} %\SweaveUTF8 --- ## Subnetwork Integration for Multi-modal Signatures (SIMMS) ### Author(s): Syed Haider, Paul C. Boutros ### Maintainer: Syed Haider (Syed.Haider@icr.ac.uk) ### Dated: `r Sys.Date()` ## Table of Contents * [Overview](#Overview) * [Networks database](#Networks-database) * [Molecular profiles and clinical data](#Molecular-profiles-and-clinical-data) * [Example R-script](#Example-R-script) * [Output](#Output) ## Overview SIMMS enables intergration of molecular profiles with functional networks such as protein-protein interaction networks. This document shows how to use SIMMS from a simple use case of integrating mRNA abundance data with pathways derived protein-protein networks, through to more sophisticated use cases of integrating multiple molecular profiles such as mRNA abundance, DNA copy-number and DNA-methylation data. Note, SIMMS requires patient outcome data as dependent variable. In current implementation, it works with survival data `Survival time, Survival status`. Please setup the dataset metadata directories before using SIMMS. There are a couple of directories that one needs to setup and ensure the contents are in correct format. These directories contain: - Networks database (optional) - Molecular profiles and clinical data The structure and format of these inputs is described below. ## Networks database SIMMS rely on functionally or computationally derived networks in order to meaningfully integrate molecular profiles. By default, SIMMS package has a built in database of pathway derived protein-protein interaction networks extracted from (a dated version) Reactome, BioCarta and NCI-PID. This database can be accessed through various SIMMS functions using the parameter setting `networks.database = "default"`. If you would like to create your own networks database for subsequent use with SIMMS, please note that the networks database is organised in two files: pathway_based_sub_networks.txt pathway_based_networks_flattened.txt File `pathway_based_sub_networks.txt` contains all the subnetworks. For instance a hypothetical subnetwork of five genes `ERBB2, EGFR, MKI67, ESR1, PGR` with four interactions `(PGR-ESR1, EGFR-ERBB2, EGFR-PGR, MKI67-ESR1)`, and another hypothetical subnetwork of three genes PDK1, AKT3 and PIK3CA with two interactions `(PDK1-PIK3CA, AKT3-PIK3CA)` using Entrez Gene IDs would look like (tab-delimited): #ID=0001 NAME=Module.1 ID=0001 5241 2099 ID=0001 1956 2064 ID=0001 1956 5241 ID=0001 4288 2099 #ID=0002 NAME=Module.2 ID=0002 5163 5290 ID=0002 10000 5290 Please note that the only Entrez IDs are supported in this file, which should match the Entrez IDs format in the molecular data files (except additional \_at suffix in the molecular data files). File `pathway_based_networks_flattened.txt` is a non-redundant, without self-interactions, quick lookup table of all the pairwise interactions present in the file `pathway_based_sub_networks.txt`. This should not contain the subnetwork name and ID annotations. For example, the contents of this file for the above-mentioned subnetworks would look like (tab-delimited): GeneID1 GeneID2 5241 2099 1956 2064 1956 5241 4288 2099 5163 5290 10000 5290 Please note that only Entrez IDs are supported in this file, which should match the Entrez IDs format in the molecular data files (except additional \_at suffix in the molecular data files). If you have a gene list without known interactions, and would like to try Node-only model (model 2), you can create a subnetwork with random interactions in `pathway_based_sub_networks.txt` as long as all the features (genes) are present and connected. Same goes for `pathway_based_networks_flattened.txt`. The two database files are placed inside a directory you wish to call your database. For example: `MyNetworkDB`, and this directory should be placed under: > SIMMS/inst/programdata/networkdb/ You would need to build and install SIMMS package again and your networks database should be available to use through relevant SIMMS' functions with parameter setting `networks.database = "MyNetworkDB"` ## Molecular profiles and clinical data The main input to SIMMS is molecular and clinical data. Currently tested datatypes are mRNA abundance, DNA copy-number and DNA methylation datasets. mRNA abundance and DNA methylation profiles are expected to be in continuous scale, while DNA copy-number profiles are expected to be gene copy-number calls `(-2,-1,0,1,2) or (-1,0,1)` representing deletions (-ve), neutral (0), and gains/amplifications (+ve) calls. Given that SIMMS support both continuous and categorical/ordinal profiles, any datatype can be used as input. The table below shows the supported/unsupported datatypes and respective examples: ```{r xtable, echo=FALSE, results="asis"} library(xtable); x <- matrix( data = c( "{ *x ∈ ℝ* : *x ≥ 0* }", "0, 2, 1.37, 7.04, 9.68", "log2 mRNA or miRNA abundance", "{ *x ∈ ℝ* : *x ≥ 0* }", "0.1, 0.38, 0.78, 0.22, 0.98", "DNA methylation beta values", "{ *x ∈ ℤ* }", "-2, -1, 0, 1, 2", "copy-number calls. Reference group = 0 (Neutral/Diploid)", "{ *x ∈ ℤ* : *x ≥ 0* }", "0, 1, 2, 3", "mutation data. Reference group = 0 (Wildtype)", "{ *x ∈ ℤ* : *x ≠ 0* }", "-2, -1, 1, 2", "Unsupported due to missing reference group 0", "{ *x ∉ ℝ* }", "WT, Mutant, Gain, Deleted", "Unsupported due to alphabets" ), nrow = 6, ncol = 3, byrow = TRUE, dimnames = list( NULL, c("Data type", "Example data", "Example profile") ) ); tab <- xtable(x); print(tab, type = "html", include.rownames = FALSE); ``` Molecular profiles are strictly tab-delimited, and are expected to be in feature (gene) by sample (patient) matrices. Genes should be represented using Entrez IDs followed by \_at suffix. For mRNA abundance data, if you have pre-processed your data with [BrainArray Entrez CDFs](http://brainarray.mbni.med.umich.edu/brainarray/Database/CustomCDF/genomic_curated_CDF.asp) Entrez CDF, your data should be SIMMS compatible by default. Otherwise, please map your dataset features e.g genes to Entrez IDs followed by \_at suffix (e.g 5290_at). Clinical profiles are strictly tab-delimited, sample (patient) by annotation matrices. The annotation columns must contain survival columns `Survival time, Survival status, Survival time unit`. The row names should match the column names in the molecular profiles' matrices. These columns in clinical annotation file could be called anything as long as these are correctly identified through the metadata file described below. Both molecular and clinical annotations are read by SIMMS through a tab-delimited metadata file called `datasets.txt`. An example of the contents of this file are shown below: dataset mRNA cna methylation annotations survstat survtime survtime.unit Breastdata1 mRNA_abundance.txt CNA.txt DNA_methylation.txt patient_annotation.txt e.os t.os t.os.unit Breastdata2 mRNA_abundance.txt CNA.txt DNA_methylation.txt patient_annotation.txt e.os t.os t.os.unit Here, each row contain an entry for a dataset e.g Breastdata1 and Breastdata2, each having its own directory on the filesystem. mRNA, cna and methylation columns contain the file names of each of the these datatypes. annotations column contains the names of the annotation files for each dataset. All the datatypes and annotation files for a given dataset must be inside the dataset directory (e.g Breastdata1/). survstat, survtime, and survtime.unit columns contain the column names of `Survival status, Survival time and Survival time unit`, respectively. These column names are expected to match the column names in the annotations files. annotation data file must also have a column `Tumour` with possible values of `Yes|No`. All analyses will be limited to those samples with `Yes` in the `Tumour` column. Further subsetting of molecular and annotation datasets is enable using parameter `subset` in SIMMS' functions `?derive.network.features` & `?prepare.training.validation.datasets`. For convenience sake, an example test dataset `testdata` containing metadata file `datasets.txt`, mRNA abundance profiles and respective clinical data is bundled with SIMMS package, and can be found under: > SIMMS/inst/programdata/testdata/ There is no need to drop your datasets inside the package, or need to rebuild the package. You can just point to your datasets through SIMMS package keeping them anywhere on your filesystem. ## Example R-script Lets start with a simple case. We have two mRNA abundance datasets; `Breastdata1, Breastdata2`. We would like to identify prognostic biomarkers using Breastdata1 and validate on Breastdata2. Assuming these two datasets are setup correctly (as described in the previous section), a typical workflow would be: ```{r, results = "hide", message = FALSE, eval = TRUE} options("warn" = -1); # load SIMMS library library("SIMMS"); # path of the data directory containing Breastdata1/ and Breastdata2/ subdirectories data.directory <- SIMMS::get.program.defaults(networks.database = "test")[["test.data.dir"]]; # path of the directory where results will be stored output.directory <- tempdir(); # molecular profiles to be used data.types <- c("mRNA"); # feature selection datasets # name of the dataset directory containing mRNA abundance and annotation profiles of training dataset/s feature.selection.datasets <- c("Breastdata1"); # model training datasets, ideally same as feature selection datasets # name of the dataset directory containing mRNA abundance and annotation profiles of training dataset/s training.datasets <- feature.selection.datasets; # validation datasets # name of the dataset directory containing mRNA abundance and annotation profiles of validation dataset/s validation.datasets <- c("Breastdata2"); # all the possible P value thresholds that one may consider applying to feature selection process. # its the P value of univariate (genewise) Cox model statistics feature.selection.p.thresholds <- c(0.5); # one of the P values above, to be used for subsequent analysis. Not a vector for performance reasons feature.selection.p.threshold <- 0.5; # names of the learning algorithms to be used for the final multivarite model learning.algorithms <- c("backward", "forward", "glm", "randomforest"); # top features to be used for model selection (Backwards elimination, Forward selection, GLM, Random survival forest) # you can try a number of different model selection runs by specifying a vector of top n features top.n.features <- c(5); # truncate survival truncate.survival <- 10; # calculate per feature univariate coefficients in training sets derive.network.features( data.directory = data.directory, output.directory = output.directory, data.types = data.types, feature.selection.datasets = feature.selection.datasets, feature.selection.p.thresholds = feature.selection.p.thresholds, networks.database = "test", # or "default" for Reactome/BioCarta/NCI-PID truncate.survival = truncate.survival ); # calculate per-subnetwork scores in both training and validation sets prepare.training.validation.datasets( data.directory = data.directory, output.directory = output.directory, data.types = data.types, p.threshold = feature.selection.p.threshold, feature.selection.datasets = feature.selection.datasets, datasets = c(training.datasets, validation.datasets), networks.database = "test", # or "default" for Reactome/BioCarta/NCI-PID truncate.survival = truncate.survival ); # iterate over varying top n features, identify and validate survival models for (top.n in top.n.features) { # create classifier assessing univariate prognostic power of subnetwork modules (Train and Validate) ret <- create.classifier.univariate( data.directory = data.directory, output.directory = output.directory, feature.selection.datasets = feature.selection.datasets, feature.selection.p.threshold = feature.selection.p.threshold, training.datasets = training.datasets, validation.datasets = validation.datasets, top.n.features = top.n ); # create a multivariate classifier (Train and Validate) ret <- create.classifier.multivariate( data.directory = data.directory, output.directory = output.directory, feature.selection.datasets = feature.selection.datasets, feature.selection.p.threshold = feature.selection.p.threshold, training.datasets = training.datasets, validation.datasets = validation.datasets, learning.algorithms = learning.algorithms, top.n.features = top.n ); # perform Kaplan-Meier analysis and generate plots create.survivalplots( data.directory = data.directory, output.directory = output.directory, training.datasets = training.datasets, validation.datasets = validation.datasets, top.n.features = top.n, learning.algorithms = learning.algorithms, truncate.survival = truncate.survival, survtime.cutoffs = c(5), main.title = FALSE, KM.plotting.fun = "create.KM.plot", resolution = 100 ); } ``` Please note that a number of parameters such as `output.directory, training.datasets and validations.datasets` are repeatedly passed in various methods. This may look mindless, however, this is because none of the methods return objects that are passed over to the next method/s. The rationale behind this was to keep the memory footprint to bare-minimum by avoiding large R objects passed around. This particular feature also facilitates restarting from any step after deliberate or accidental closure of R environment as everything can be read again from the filesystem. The only compromise is the data footprint on the filesystem. The above R-script will generate two sub-directories under the `output.directory` path: > output/ and > graphs/ ## Output SIMMS creates two output directories; `output/ and graphs/`. The contents of these directories are described below: ### output/ > coxph\_nodes\_\_`$training.datasets`\_\_datatype\_`$data.types`.txt > coxph\_edges\_coef\_\_`$training.datasets`\_\_datatype\_`$data.types`.txt > coxph\_edges\_P\_\_`$training.datasets`\_\_datatype\_`$data.types`.txt Contains per feature (nodes) univariate Cox proportional hazards model results, and per interaction (gene-gene edge) Cox proportional hazards model results > top\_subnets\_score\_\_TRAINING\_`$training.datasets`\_\_model\_`1,2,3`\_\_PV_`$feature.selection.p.thresholds`.txt Contains per subnetwork scores and is used for subsequent feature selection. Models 1, 2 and 3 refers to Node+Interaction, Node-only and Interaction-only models. Nodel-only (Model-2) is generally the most interpretable and robust model > riskscores\_uv\_\_`all_training,all_validation`\_\_TRAINING\_`$training.datasets`\_\_model\_`1,2,3`\_\_top_`$top.n`.txt Contains per sample risk scores for each subnetwork, along with Survival time and Survival status. Sample by risk score matrix. First two columns contain survival data > riskgroups\_uv\_\_`all_training,all_validation`\_\_TRAINING\_`$training.datasets`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains per sample risk groups for each subnetwork, along with Survival time and Survival status. Sample by risk group matrix. First two columns contain survival data. Risk groups are median-dichotomised (training set) risk scores > riskscores\_\_`all_training,all_validation`\_\_TRAINING\_`$training.datasets`\_\_`backward,forward,glm,randomforest`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains per sample risk scores estimated by the multivariate Cox proportional hazards model, along with Survival time and Survival status > riskgroups\_\_`all_training,all_validation`\_\_TRAINING\_`$training.datasets`\_\_`backward,forward,glm,randomforest`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains per sample risk group (along with Survival time and Survival status) generated by median dichotomising risk scores (training set) estimated through the multivariate Cox proportional hazards model > coxph\_\_`all_training,all_validation`\_\_TRAINING\_`$training.datasets`\_\_`backward,forward,glm,randomforest`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains univariate Cox proportional hazards model results with risk group (derived from multivariate Cox model) as the explanatory variable > beta\_\_TRAINING\_`$training.datasets`\_\_`backward,forward`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains the fitted coefficients (betas) of the final model following backward elimination and forward selection (separately) > beta\_\_TRAINING\_`$training.datasets`\_\_`glm`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains the fitted coefficients (betas) of the final model following GLMs (LASSO, Ridge or Elastic Nets) driven feature selection > vimp\_\_TRAINING\_`$training.datasets`\_\_`randomforest`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains the variable importance of random forest. > OOB\_error\_\_TRAINING\_`$training.datasets`\_\_`randomforest`\_\_model\_`1,2,3`\_\_top\_`$top.n`.txt Contains OOB error rate against number of trees to help identify stablisation point for 'rf.ntree' parameter ### graphs/ > KM\_uv\_\_`all_training,all_validation`\_\_TRAINING\_`$training.datasets`\_\_model\_`1,2,3`\_\_`SubnetworkName`\_\_truncated\_`$truncate.survival`.png Kaplan-Meier analysis of risk groups generated for each subnetwork through univariate Cox proportional hazards model. These will be generated if parameter `plot.univariate.data` was set to `TRUE` in `create.survivalplots()` as default value is set to `FALSE`. > KM\_\_`all_training,all_validation`\_\_TRAINING\_`$traning.datasets`\_\_`backward,forward,glm,randomforest`\_\_model\_`1,2,3`\_\_top\_`$top.n`\_\_truncated\_`$truncate.survival`.png Kaplan-Meier analysis of risk groups generated through multivariate Cox proportional hazards model