---
author: "Jack Fu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::knitr}
  %\VignetteIndexEntry{Title of your vignette}
  %\usepackage[UTF-8]{inputenc}
---

## Overview

MDTS provides the necessary infrastructure to take raw bam files of targeted 
sequencing trios to produce de novo deletion calls of high sensitivity and low 
false positives. Our method benefits tremendously from pooling information from 
across many trios to determine regions of interest, normalize read-depth, and to
filter candidate deletion signals.

MDTS is broken down into the following major steps:

1. Calculating the MDTS bins
2. Determinig the read-depth of the bins for each sample
3. Normalizing the read-depth
4. Creating the Minimum Distance for each trio
5. Call de novo deletions and filter

## Raw Data

Preprocessed sample data is included with the package `MDTS`, but should you be 
interested in the raw files, they can be found in the github repo 
`jmf47/MDTSdata`. It includes:

1. 21 simulated bam files (7 trios)
      + Sequencing read length of 100bp 
      + 1 true de novo deletion in family F4
2. A bw file that includes the 100mer mappability of chr1
3. A pedigree file 
      + Records the trio kinships 
      + Records the file paths to the raw bam files of each sample
      
```{r, echo=F, message=F, warning=F}
library(MDTS); library(BSgenome.Hsapiens.UCSC.hg19)
setwd(system.file("extdata", package="MDTS"))
load('pD.RData')
pD
```

## 1. Calculating MDTS Bins

One innovation of our method is that the bins to determine sequencing read depth
is calculated based on the empirical capture. These bins can be significantly 
different from the standard practice of using probe design coordinates to create
bins. Furthermore, our binning process allows the bins to be smaller in regions 
of high capture, and vice verse in low capture. This dynamic scaling of the bin 
size makes efficient use of varying depths of coverage - allowing us to call 
deletions with finer resolution in areas of higher coverage.

The general approach is first to examine the mapped coverage of a subset of the 
full dataset across the entire genome. Basepairs that are covered passed a 
certain threshold in at least one sample is selected as a proto-region. 
Proto-regions are subdvided into non-overlapping bins such that the median 
number of reads falling into each bin meets a specified level spcified by the 
*medianCoverage* parameter. The choice of *medianCoverage* is a tradeoff between
sensitivity and false positives. A larger *medianCoverage* decreases sensitivity
and false positives. For our publication, we used *medianCoverage=160*, and for 
this example we used *medianCoverage=150*.

Information on GC content and mappability of the bins are also calculated at 
this stage. As a result, MDTS requires a BSgenome object that is consistent with
the reference annotation the bam files were aligned to, as well as a mappability
bigwig that contains mappability information in windows consistent with 
sequencing read length. In our example, we require `BSgenome.Hsapiesn.UCSC.hg19`
and 100mer mappability.

** In general you will need a bw file that contains the information for all
autosomes 1-22 (unlike the bw file only for chr1 included in the example). **
This is the link to some hg19 mappability bw files:
http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability

This portion of the vignette example uses raw data from `MDTSData`. However 
the resulting bins are included in the `data` directory of MDTS as a `.RData`.

```{r, echo=FALSE, message=FALSE}
library(MDTS)
```

```{r, eval=F, warning=F}
library(MDTS); library(BSgenome.Hsapiens.UCSC.hg19)
# Using the raw data from MDTSData
devtools::install_github("jmf47/MDTSData")
setwd(system.file("data", package="MDTSData")) 

# Importing the pedigree file that includes information on where to locate the 
# raw bam files
pD <- getMetaData("pD.ped")

# Information on the GC content and mappability to estimate GC and mappability 
# for the MDTS bins
genome <- BSgenome.Hsapiens.UCSC.hg19; map_file = "chr1.map.bw"

# This command now subsets 5 samples to determine MDTS bins
# pD is the metaData matrix from getMetaData()
# n is the number of samples to examine to calculate the bins
# readLength is the sequencing read length
# minimumCoverage is the minimum read depth for a location to be included 
#     in a proto region
# medianCoverage is the median number of reads across the n samples in a bin
bins <- calcBins(metaData=pD, n=5, readLength=100, minimumCoverage=5, 
                medianCoverage=150, genome=genome, mappabilityFile=map_file)
```

## 2. Calculating coverage of MDTS bins

Given a set of dynamic MDTS bins, we can proceed to calculate the number of 
reads that fall into these bins for the entirety of our sample. We organize the 
number of reads as a matrix, where each column is a sample, and each row 
corresponds to a bin. This portion of the vignette is a continuation of the 
usage of the raw data from `MDTSdata` above. However the resulting `counts` 
matrix is also shipped as a `rda` in `MDTS`.

```{r, eval=F}
# pD is the phenotype matrix
# bins is the previously calculated MDTS bins
# rl is the sequencing read length
counts = calcCounts(pD, bins, rl=100)
```

The MDTS bins, raw counts, and pedigree files are included with `MDTS` and can 
be loaded as follows:
```{r, message=FALSE, warning=F}
load(system.file("extdata", 'bins.RData', package = "MDTS"))
load(system.file("extdata", 'counts.RData', package = "MDTS"))
load(system.file("extdata", 'pD.RData', package = "MDTS"))
```
The MDTS bins are
```{r}
bins
```
The count matrix where each column is a sample and each row is a bin:
```{r}
head(counts)
```

## 3. Normalizing the read-depth

After obtaining the raw read-depth matrix, MDTS calculates a vector M score 
matrix of the same organization - each row corresponds to a row and each column 
a sample. The M score is based on a log2 transformation, followed by median 
polish, and GC and mappability adjust via a loess smoother. The resulting M 
scores have critical values where (0, -1, <-4) are generally consistent with 2, 
1, and 0 copy numbers [barring Copy Number Polymorphisms].

```{r, warning=F}
# counts is the raw read depth of [MDTS bins x samples]
# bins is the previously calculated MDTS bins
mCounts <- normalizeCounts(counts, bins)
```

## 4. Creating the Minimum Distance for each trio

The second advantage of MDTS is the use of Minimum Distances for candidate de 
novo deletion identification. The assumption underlying the development of the 
Minimum Distance is that it is vastly more likely for a deletion to be inherited
than de novo if the proband shares a deletion with one of the parents in the 
trio. Based on that assumption, the Minimum Distance is calculated per family - 
bin combination. For each bin, the Minimum Distance is smallest (in absolute 
terms) difference between the proband's M score and both parents'. 

For example, if the M scores are (-1, -1, 0) for a proband and the 2 parents 
respectively, a situation consistent with an inherited deletion in the proband 
from parent 1, the Minimum Distance of the 2 possible pairwise comparisons is 0.
On the other hand, if the M scores are (-1, 0, 0) for the proband and the 2 
parents, the Minimum Distance stands out at -1, and is consistent with a de novo
deletion. 

To calculate the Minimum Distance of the dataset, MDTS takes as input the M 
score matrix. The output matrix is organized where each row corresponds still to
a bin, but each column now refers to a trio.

```{r, warning=F}
# mCounts is the normalized read depth of [MDTS bins x samples]
# bins is the previously calculated MDTS bins
# pD is the phenotype matrix
md <- calcMD(mCounts, pD)
```

## 5. Call de novo deletions and filter

Using the Minimum Distance matrix calculated above, MDTS uses the tried and true
Circular Binary Segmentation method to infer deletion states. Inferred candidate
de novo deletions are further filtered to remove likely false positive signals 
that arose out of regions of highly variable read-depth (generally indicative of
sequence artifacts or polymorphic events).

```{r, warning=FALSE, message=FALSE, warning=F}
# md is the Minimum Distance of [MDTS bins x trio]
# bins is the previously calculated MDTS bins
# mCounts is the normalized read depth of [MDTS bins x samples]
cbs <- segmentMD(md=md, bins=bins)
denovo <- denovoDeletions(cbs, mCounts, bins)
```

In our example, the output is a single detected de novo deletion in family F4:

```{r}
denovo
```

No signals were picked up from the CNP region or elsewhere.

## 6. SessionInfo

```{r}
sessionInfo()
```