---
title: "2 - Introduction: Input data"
output: 
  html_document:
    self_contained: true
    number_sections: no
    theme: flatly
    highlight: tango
    mathjax: default
    toc: true
    toc_float: true
    toc_depth: 2
    css: style.css
  
bibliography: bibliography.bib    
vignette: >
  %\VignetteIndexEntry{"2 - Introduction: Input data"}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, fig.height=6,echo=FALSE, message=FALSE, warning=FALSE, include=TRUE}
library(ELMER.data)
library(ELMER)
data(elmer.data.example)
data(LUSC_meth_refined)
data(LUSC_RNA_refined)
library(DT)
```

<br>

# Input data

A Multi Assay Experiment object [@mae2017] is the input for all main functions of `r BiocStyle::Biocpkg("ELMER")` and can be generated by `createMAE` function. 

To perform `r BiocStyle::Biocpkg("ELMER")` analyses, the [Multi Assay Experiment](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) needs:

- a DNA methylation matrix or SummarizedExperiment object from HM450K or EPIC platform for multiple samples;
- a gene expression matrix or SummarizedExperiment object for the same samples;
- a matrix mapping DNA methylation samples to gene expression samples 
- a matrix with samples metadata (i.e. clinical data, molecular  subtype information). 

If TCGA data are used, the the last two matrices will be automatically generated.
Based on the genome of reference selected, metadata for the DNA methylation probes, such as genomic coordinates, will be added from  [Wanding Zhou annotation](http://zwdzwd.github.io/InfiniumAnnotation) [@zhou2016comprehensive]; 
and metadata for gene annotation  will be added from ensemble database [@yates2015ensembl] using [biomaRt](http://bioconductor.org/packages/biomaRt/)
[@durinck2009mapping].


## DNA methylation data

DNA methylation data feeding to `r BiocStyle::Biocpkg("ELMER")` should be a matrix of DNA methylation
beta ($\beta$) value for samples (column) and probes (row) processed from row HM450K 
array data. If TCGA data is used, processed data from GDC website will be downloaded 
and automatically transformed to the matrix by `r BiocStyle::Biocpkg("ELMER")`. The processed TCGA 
DNA methylation data were calculated as $\frac{M}{(M+U)}$, where M represents the methylated 
allele intensity and U the unmethylated allele intensity. Beta values range from 0 to 1, 
reflecting the fraction of methylated alleles at each CpG in the each tumor; beta values 
close to 0 indicates low levels of DNA methylation and beta values close to 1 
indicates high levels of DNA methylation.

If user have raw HM450K data, these data can be processed by `r BiocStyle::Biocpkg("Methylumi")` 
or `r BiocStyle::Biocpkg("minfi")` generating DNA methylation beta ($\beta$) value for each CpG site 
and multiple samples. The `getBeta` function in `r BiocStyle::Biocpkg("minfi")` can be used to generate a matrix 
of DNA methylation beta ($\beta$) value to feed in `r BiocStyle::Biocpkg("ELMER")`. And we recommend to 
save this matrix as `meth.rda` since  `createMAE` 
can read in files by specifying their path which will help to reduce memory usage.

```{r}
# Example of DNA methylation data input
datatable(Meth[1:10, 1:10], 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = TRUE)
```

## Gene expression data

Gene expresion data feeding to `r BiocStyle::Biocpkg("ELMER")` should be a matrix of gene expression
values for samples (column) and genes (row). Gene expression value can be generated
from different platforms: array or RNA-seq. The row data should be processed by other
software to get gene or transcript level gene expression calls such as mapping by 
[tophat](https://ccb.jhu.edu/software/tophat/index.shtml), 
calling expression value by [cufflink](https://github.com/cole-trapnell-lab/cufflinks), 
[RSEM](https://github.com/deweylab/RSEM) or 
[GenomeStudio](http://www.illumina.com/techniques/microarrays/array-data-analysis-experimental-design/genomestudio.html) for expression array. 
It is recommended to normalize expression data making gene expression 
comparable across samples such as quantile normalization. User can refer TCGA RNA-seq
analysis pipeline to do generate comparable gene expression data. Then transform the 
gene expression values from each sample to the matrix for feeding into `r BiocStyle::Biocpkg("ELMER")`.
If users want to use TCGA data, `r BiocStyle::Biocpkg("ELMER")` has functions to download the 
RNA-Seq Quantification data (HTSeq-FPKM-UQ)  from GDC website and transform the data to the matrix for feeding 
into `r BiocStyle::Biocpkg("ELMER")`. It is recommended to save this matrix as `RNA.rda` since `createMAE` 
can read in files by specifying the path of files which will help to reduce memory usage.

```{r}
# Example of Gene expression data input
datatable(GeneExp[1:10, 1:2], 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = TRUE)
```


## Sample information

Sample information should be stored as a data.frame object containing sample ID, 
group labels (control and experiment). Sample ID and groups labels are required.
Other information for each sample can be added to this data.frame object.
When TCGA data were used, samples information will be automatically generated by 
`createMAE`  function by specifying option `TCGA=TRUE`. A columns name `TN` will
create the groups Tumor and Normal using the following samples to each group:

Tumor samples are:   

*   Primary solid Tumor
*   Recurrent Solid Tumor
*   Primary Blood Derived Cancer - Peripheral Blood
*   Recurrent Blood Derived Cancer - Bone Marrow
*   Additional - New Primary
*   Metastatic
*   Additional Metastatic
*   Human Tumor Original Cells
*   Primary Blood Derived Cancer - Bone Marrow

Normal samples:

* Blood Derived Normal
* Solid Tissue Normal
* Buccal Cell Normal
* EBV Immortalized Normal
* Bone Marrow Normal

```{r, message=FALSE}
library(MultiAssayExperiment)
data <- createMAE(
  exp = GeneExp, 
  met = Meth,
  met.platform = "450K",
  genome = "hg19",
  save = FALSE,
  TCGA = TRUE
)
data
as.data.frame(colData(data)[,c("patient","definition","TN")]) %>% 
  datatable(options = list(scrollX = TRUE,pageLength = 5)) 

# Adding sample information for non TCGA samples
# You should have two objects with one for DNA methylation and 
# one for gene expression. They should have the same number of samples and the names of the 
# sample in the gene expression object and in hte DNA methylation matrix 
# should be the same
not.tcga.exp <- GeneExp # 234 samples
colnames(not.tcga.exp) <- substr(colnames(not.tcga.exp),1,15)
not.tcga.met <- Meth # 268 samples
colnames(not.tcga.met) <- substr(colnames(not.tcga.met),1,15)
# Number of samples in both objects (234)
table(colnames(not.tcga.met) %in% colnames(not.tcga.exp)) 

# Our sample information must have as row names the samples information
phenotype.data <- data.frame(row.names = colnames(not.tcga.exp), 
                             primary = colnames(not.tcga.exp), 
                             group = c(rep("group1", ncol(GeneExp)/2),
                                       rep("group2", ncol(GeneExp)/2)))

data.hg19 <- createMAE(exp = not.tcga.exp, 
                       met =  not.tcga.met, 
                       TCGA = FALSE, 
                       met.platform = "450K",
                       genome = "hg19", 
                       colData = phenotype.data)
data.hg19

# The samples that does not have data for both DNA methylation and Gene exprssion will be removed even for the phenotype data
phenotype.data <- data.frame(row.names = colnames(not.tcga.met), 
                             primary = colnames(not.tcga.met), 
                             group = c(rep("group1", ncol(Meth)/4),
                                       rep("group2", ncol(Meth)/4),
                                       rep("group3", ncol(Meth)/4),
                                       rep("group4", ncol(Meth)/4)))

data.hg38 <- createMAE(exp = not.tcga.exp, 
                       met =  not.tcga.met, 
                       TCGA = FALSE, 
                       save = FALSE,
                       met.platform = "450K",
                       genome = "hg38", 
                       colData = phenotype.data)
data.hg38
as.data.frame(colData(data.hg38)[1:20,]) %>%
  datatable(options = list(scrollX = TRUE,pageLength = 5)) 
```

## Probe information

Probe information is stored as a GRanges object containing the coordinates 
of each probe on the DNA methylation array and names of each probe. 
The default probe information is fetching from  [Wanding Zhou annotation](http://zwdzwd.github.io/InfiniumAnnotation) [@zhou2016comprehensive]

```{r, message=FALSE}
library(SummarizedExperiment, quietly = TRUE)
rowRanges(getMet(data))[1:3,1:8]
```

## Gene information

Gene information is stored as a GRanges object containing coordinates of 
each gene, gene id, gene symbol and gene isoform id. The default gene information 
is the ensembl gene annotation fetched from `r BiocStyle::Biocpkg("biomaRt")` by `r BiocStyle::Biocpkg("ELMER")` 
function.

```{r}
rowRanges(getExp(data))
```

# Multi Assay Experiment object 

A Multi Assay Experiment object from the `r BiocStyle::Biocpkg("MultiAssayExperiment")` 
package is the input for multiple main functions of `r BiocStyle::Biocpkg("ELMER")`. 
It contains the above components and making a Multi Assay Experiment object by `createMAE` function will keep each
component consistent with each other. For example, althougth DNA methylation and gene 
expression matrixes have different rows (probe for DNA methylation and gene id for gene
expression), the column (samples) order should be same in the two matrixes. The `createMAE`
function will keep them consistent when it generates the  Multi Assay Experiment object.


```{r, message=FALSE}
data <- createMAE(exp = GeneExp, 
                  met = Meth,
                  genome = "hg19",
                  save = FALSE,
                  met.platform = "450K",
                  TCGA = TRUE)

# For TGCA data 1-12 represents the patient and 1-15 represents the sample ID (i.e. primary solid tumor samples )
all(substr(colnames(getExp(data)),1,15) == substr(colnames(getMet(data)),1,15))

# See sample information for data
as.data.frame(colData(data)) %>% datatable(options = list(scrollX = TRUE)) 

# See sample names for each experiment
as.data.frame(sampleMap(data)) %>% datatable(options = list(scrollX = TRUE)) 
```

You can also use your own data and annotations to create  Multi Assay Experiment object. 

```{r, message=FALSE}
# NON TCGA example: matrices has different column names
gene.exp <- S4Vectors::DataFrame(sample1.exp = c("ENSG00000141510"=2.3,"ENSG00000171862"=5.4),
                                 sample2.exp = c("ENSG00000141510"=1.6,"ENSG00000171862"=2.3))
dna.met <- S4Vectors::DataFrame(sample1.met = c("cg14324200"=0.5,"cg23867494"=0.1),
                                sample2.met =  c("cg14324200"=0.3,"cg23867494"=0.9))
sample.info <- S4Vectors::DataFrame(primary =  c("sample1","sample2"), 
                                    sample.type = c("Normal", "Tumor"))
sampleMap <- S4Vectors::DataFrame(
  assay = c("Gene expression","DNA methylation","Gene expression","DNA methylation"),
  primary = c("sample1","sample1","sample2","sample2"), 
  colname = c("sample1.exp","sample1.met","sample2.exp","sample2.met"))
mae <- createMAE(exp = gene.exp, 
                 met = dna.met, 
                 sampleMap = sampleMap, 
                 met.platform ="450K",
                 colData = sample.info, 
                 genome = "hg38") 
# You can also use sample Mapping and Sample information tables from a tsv file
# You can use the createTSVTemplates function to create the tsv files
readr::write_tsv(as.data.frame(sampleMap), path = "sampleMap.tsv")
readr::write_tsv(as.data.frame(sample.info), path = "sample.info.tsv")
mae <- createMAE(exp = gene.exp, 
                 met = dna.met, 
                 sampleMap = "sampleMap.tsv", 
                 met.platform ="450K",
                 colData = "sample.info.tsv", 
                 genome = "hg38") 
mae

# NON TCGA example: matrices has same column names
gene.exp <- S4Vectors::DataFrame(sample1 = c("ENSG00000141510"=2.3,"ENSG00000171862"=5.4),
                                 sample2 = c("ENSG00000141510"=1.6,"ENSG00000171862"=2.3))
dna.met <- S4Vectors::DataFrame(sample1 = c("cg14324200"=0.5,"cg23867494"=0.1),
                                sample2=  c("cg14324200"=0.3,"cg23867494"=0.9))
sample.info <- S4Vectors::DataFrame(primary =  c("sample1","sample2"), 
                                    sample.type = c("Normal", "Tumor"))
sampleMap <- S4Vectors::DataFrame(
  assay = c("Gene expression","DNA methylation","Gene expression","DNA methylation"),
  primary = c("sample1","sample1","sample2","sample2"), 
  colname = c("sample1","sample1","sample2","sample2")
)
mae <- createMAE(exp = gene.exp, 
                 met = dna.met, 
                 sampleMap = sampleMap, 
                 met.platform ="450K",
                 colData = sample.info, 
                 genome = "hg38") 
mae
```

# Bibliography