---
title: "deconvR : Simulation and Deconvolution of Omic Profiles"
authors: 
  - name: Irem B. Gunduz, Veronika Ebenal, Altuna Akalin
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r Sys.Date()`"
package: "`r pkg_ver('deconvR')`"
vignette: >
  %\VignetteIndexEntry{deconvRVignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(tidy = FALSE,
                      cache = FALSE,
                      dev = "png",
                      message = FALSE, 
                      error = FALSE,
                      warning = FALSE)
BiocStyle::markdown()
library(knitr)
library(deconvR)
library(doParallel)
library(dplyr)

cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
```

# Introduction

Recent studies associated the differences of cell-type proportions may be
correlated to certain phenotypes, such as cancer. Therefore, the demand for the
development of computational methods to predict cell type proportions increased.
Hereby, we developed `deconvR`, a collection of functions designed for analyzing
deconvolution of the bulk sample(s) using an atlas of reference omic signature
profiles and a user-selected model. We wanted to give users an option to extend
their reference atlas. Users can create new reference atlases using
`findSignatures` or extend their atlas by adding more cell types. Additionally, we
included `BSMeth2Probe` to make mapping whole-genome bisulfite sequencing data
to their probe IDs easier. So users can map WGBS methylation data (as in
**methylKit** or **GRanges** object format) to probe IDs, and the results of
this mapping can be used as the bulk samples in the deconvolution. We also
included a comprehensive DNA methylation atlas of 25 different cell types to use
in the main function `deconvolute`. `deconvolute` allows the user to specify
their desired deconvolution model (non-negative least squares regression,
support vector regression, quadratic programming, or robust linear regression),
and returns a dataframe which contains predicted cell-type proportions of bulk
methylation profiles, as well as  partial R-squared values for each sample.

As an another option, users can generate a simulated table of a desired number
of samples, with either user-specified or random origin proportions using
`simulateCellMix`. `simulateCellMix` returns a second data frame called
`proportions`, which contains the actual cell-type proportions of the simulated
sample. It can be used for testing the accuracy of the deconvolution by
comparing these actual proportions to the proportions predicted by
`deconvolute`.

`deconvolute` returns partial R-squares, to check if deconvolution brings
advantages on top of the basic bimodal profiles. The reference matrix usually
follows a bimodal distribution in the case of methylation, and taking the
average of  the rows of methylation matrix might give a pretty similar profile
to the bulk methylation profile you are trying to deconvolute. If the
deconvolution is advantageous, partial R-squared is expect to be high.

# Installation

The deconvR package can be installed from Bioconductor with:

``` {r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("deconvR")
```

# Data

## Comprehensive Human Methylome Reference Atlas

The comprehensive human methylome reference atlas created by Moss et al. ^[Moss,
J. et al.  (2018). Comprehensive human cell-type methylation atlas reveals
origins of circulating cell-free DNA in health and disease. Nature
communications, 9(1), 1-12. <https://doi.org/10.1038/s41467-018-07466-6>] can be
used as the reference atlas parameter for several functions in this package.
This atlas was modified to remove duplicate CpG loci before being included in
the package as the dataframe. The dataframe is composed of 25 human cell types
and roughly 6000 CpG loci, identified by their Illumina Probe ID. For each cell
type and CpG locus, a methylation value between 0 and 1 is provided. This value
represents the fraction of methylated bases of the CpG locus. The atlas
therefore provides a unique methylation pattern for each cell type and can be
directly used as `reference` in `deconvolute` and `simulateCellMix`, and `atlas`
in `findSignatures`. Below is an example dataframe to illustrate the `atlas`
format.

``` {r, message = FALSE, output.lines=10}
library(deconvR) 

data("HumanCellTypeMethAtlas")
head(HumanCellTypeMethAtlas[,1:5])
```

## Illumina Infinium MethylationEPIC v1.0 B5 Manifest Probes (hg38)

The **GRanges** object `IlluminaMethEpicB5ProbeIDs` contains the Illumina probe
IDs of 400000 genomic loci (identified using the "seqnames", "ranges", and
"strand" values). This object is based off of the Infinium MethylationEPIC v1.0
B5 Manifest data. Unnecessary columns were removed and rows were truncated to
reduce file size before converting the file to a **GRanges** object. It can be
used directly as `probe_id_locations` in `BSmeth2Probe`.

``` {r, message = FALSE, output.lines=10}
data("IlluminaMethEpicB5ProbeIDs")
head(IlluminaMethEpicB5ProbeIDs)
```

# Example Workflow For Whole Genome Bisulfate Sequencing Data

## Expanding Reference Atlas

As mentioned in the introduction section, users can extend their reference atlas
to incorporate new data. Or may combine different reference atlases to construct
a more comprehensive one.  This can be done using the `findSignatures` function.
In this example, since we don't have any additional reference atlas, we will add
simulated data as a new cell type to reference atlas for example purposes.
First, ensure that `atlas` (the signature matrix to be extended) and `samples`
(the new data to be added to the signature matrix) are compliant with the
function requirements. Below illustrates the `samples` format.

``` {r, message = FALSE, output.lines=10}
samples <- simulateCellMix(3,reference = HumanCellTypeMethAtlas)$simulated
head(samples)
```

`sampleMeta` should include all sample names in `samples`, and specify the
origins they should be mapped to when added to `atlas`.

``` {r, message = FALSE, output.lines=10}
sampleMeta <- data.table("Experiment_accession" = colnames(samples)[-1],
                         "Biosample_term_name" = "new cell type")
head(sampleMeta)
```

Use `findSignatures` to extend the matrix.

``` {r, output.lines=10}
extended_matrix <- findSignatures(samples = samples, 
                                 sampleMeta = sampleMeta, 
                                 atlas = HumanCellTypeMethAtlas,
                                 IDs = "IDs")
head(extended_matrix)
```

WGBS methylation data first needs to be mapped to probes using `BSmeth2Probe`
before being deconvoluted. The methylation data `WGBS_data` in `BSmeth2Probe`
may be either a **GRanges** object or a **methylKit** object.

Format of `WGBS_data` as **GRanges** object:

``` {r, message = FALSE, output.lines=10}
load(system.file("extdata", "WGBS_GRanges.rda",
                                     package = "deconvR"))
head(WGBS_GRanges)
```

or as **methylKit** object:

``` {r, message = FALSE, output.lines=10}
head(methylKit::methRead(system.file("extdata", "test1.myCpG.txt", 
                                     package = "methylKit"), 
                         sample.id="test", assembly="hg18", 
                         treatment=1, context="CpG", mincov = 0))
```

`probe_id_locations` contains information needed to map cellular loci to probe IDs

``` {r, message = FALSE, output.lines=10}
data("IlluminaMethEpicB5ProbeIDs")
head(IlluminaMethEpicB5ProbeIDs)
```

Use `BSmeth2Probe` to map WGBS data to probe IDs.

``` {r, output.lines=10}
mapped_WGBS_data <- BSmeth2Probe(probe_id_locations = IlluminaMethEpicB5ProbeIDs, 
                                 WGBS_data = WGBS_GRanges,
                                 multipleMapping = TRUE,
                                 cutoff = 10)
head(mapped_WGBS_data)
```

This mapped data can now be used in `deconvolute`. Here we will deconvolute it
using the previously extended signature matrix as the reference atlas.

``` {r}
deconvolution <- deconvolute(reference = HumanCellTypeMethAtlas, 
                             bulk = mapped_WGBS_data)
deconvolution$proportions
```

## Constructing tissue specific CpG signature matrix

Alternatively, users can set *tissueSpecCpGs* as **TRUE** to construct tissue
based methylation signature matrix by using the reference atlas. Here, we
used simulated samples to construct tissue specific signature matrix since we 
don't have tissue specific samples. 

``` {r, output.lines=10}
data("HumanCellTypeMethAtlas")
exampleSamples <- simulateCellMix(1,reference = HumanCellTypeMethAtlas)$simulated
exampleMeta <- data.table("Experiment_accession" = "example_sample",
                          "Biosample_term_name" = "example_cell_type")
colnames(exampleSamples) <- c("CpGs", "example_sample")
colnames(HumanCellTypeMethAtlas)[1] <- c("CpGs")

signatures <- findSignatures(
  samples = exampleSamples,
  sampleMeta = exampleMeta,
  atlas = HumanCellTypeMethAtlas,
  IDs = "CpGs", K = 100, tissueSpecCpGs = TRUE)

print(head(signatures[[2]]))
```

## Constructing tissue specific DMPs

Alternatively, users can set *tissueSpecDMPs* as **TRUE** to obtain tissue based
DMPs by using the reference atlas. Here, we used simulated samples since we 
don't have tissue specific samples. Note that both *tissueSpecCpGs* and *tissueSpecDMPs*
can't be *TRUE* at the same time.

``` {r, output.lines=10}
data("HumanCellTypeMethAtlas")
exampleSamples <- simulateCellMix(1,reference = HumanCellTypeMethAtlas)$simulated
exampleMeta <- data.table("Experiment_accession" = "example_sample",
                          "Biosample_term_name" = "example_cell_type")
colnames(exampleSamples) <- c("CpGs", "example_sample")
colnames(HumanCellTypeMethAtlas)[1] <- c("CpGs")

signatures <- findSignatures(
  samples = exampleSamples,
  sampleMeta = exampleMeta,
  atlas = HumanCellTypeMethAtlas,
  IDs = "CpGs", tissueSpecDMPs = TRUE)

print(head(signatures[[2]]))
```

# Example Workflow For RNA Sequencing Data

It is possible to use RNA-seq data for deconvolution via **deconvR** package.
Beware that you have to set `IDs` column that contains `Gene names` to run
**deconvR** functions. Therefore you can simulate bulk RNA-seq data via
`simulateCellMix` and, extend RNA-seq reference atlas via `findSignatures`.


# sessionInfo

```{r }
sessionInfo()
stopCluster(cl)
```