---
title: "Methrix tutorial"
author: "CompEpigen"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    self_contained: yes
    highlight: pygments
vignette: >
  %\VignetteIndexEntry{Methrix tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(tinytex.verbose = TRUE)
```

<img src="logo_large_hexagon.gif" height="100" width="100" style="position:absolute;top:30px;right:0px;" />

## Introduction

`Methrix` provides set of function which allows easy importing of various flavors of bedgraphs generated by methylation callers, and many downstream analysis to be performed on large matrices.

This vignette describes basic usage of the package intended to process several large [bedgraph](https://genome.ucsc.edu/goldenPath/help/bedgraph.html) files in R. In addition, a detailed exemplary complete data analysis with steps from reading in to annotation and differential methylation calling can be found in our [WGBS best practices workflow](https://compepigen.github.io/methrix_docs/articles/methrix.html)


## Overview and usage functions of the package

![](overview.png)

### Installation

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#Installing stable version from BioConductor
BiocManager::install("methrix")

#Installing developmental version from GitHub
BiocManager::install("CompEpigen/methrix")
```

***NOTE***

Installation from BioConductor requires the BioC and R versions to be the newest. This arises from the restrictions imposed by BioConductor community which might cause package incompatibilities with the earlier versions of R (for e.g; R < 4.0). In that case installing from GitHub might be easier since it is much more merciful with regards to versions.

## Reading bedgraph files

`read_bedgraphs` function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for chromosome names, start position and other required fields. There are also presets available to import `bedgraphs` from most common programs such as `Bismark`, `MethylDackel`, and `MethylcTools`.

```{r, message=FALSE, warning=FALSE, eval=TRUE}
#Load library
library(methrix)
```

```{r, eval=FALSE}
#Genome of your preference to work with
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

library(BiocManager)

if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
}
library(BSgenome.Hsapiens.UCSC.hg19) 
```



```{r}
#Example bedgraph files
bdg_files <- list.files(
  path = system.file('extdata', package = 'methrix'),
  pattern = "*bedGraph\\.gz$",
  full.names = TRUE
)

print(basename(bdg_files))

#Generate some sample annotation table
sample_anno <- data.frame(
  row.names = gsub(
    pattern = "\\.bedGraph\\.gz$",
    replacement = "",
    x = basename(bdg_files)
  ),
  Condition = c("cancer", 'cancer', "normal", "normal"),
  Pair = c("pair1", "pair2", "pair1", "pair2"),
  stringsAsFactors = FALSE
)

print(sample_anno)
```

We can import bedgraph files with the function `read_bedgraphs` which reads in the bedgraphs, adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class `methrix` which in turn inherits [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html) class. `methrix` object contains 'methylation' and 'coverage' matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis.

```{r, warning=FALSE, eval=TRUE}
#First extract genome wide CpGs from the desired reference genome
hg19_cpgs <- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
```


```{r, eval=TRUE}
#Read the files 
meth <- methrix::read_bedgraphs(
  files = bdg_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 3,
  U_idx = 4,
  stranded = FALSE,
  zero_based = FALSE, 
  collapse_strands = FALSE, 
  coldata = sample_anno
)
```


Note: Use the argument `pipeline` if your bedgraphs are generated with "Bismark", "MethylDeckal", or "MethylcTools". This will automatically figure out the file formats for you, and you dont have to use the arguments `chr_idx` `start_idx` and so..

```{r, eval=TRUE}
#Typing meth shows basic summary.
meth
```


## HTML QC report
Get basic summary statistics of the `methrix` object with `methrix_report` function which produces an interactive html report
```{r, eval=FALSE}
methrix::methrix_report(meth = meth, output_dir = tempdir())
```

Click [here](https://compepigen.github.io/methrix_docs/articles/raw_report.html) for an example report.

## Filtering
### Remove uncovered loci
Usual task in analysis involves removing uncovered CpGs. i.e, those loci which are not covered across all sample (in other words covered only in subset of samples resulting `NA` for rest of the samples ).
```{r, eval=TRUE}
meth = methrix::remove_uncovered(m = meth)
```


```{r, eval=TRUE}
meth
```

### Remove SNPs
One can also remove CpG sites overlaping with common SNPs based on minor allele frequencies.

```{r, eval=FALSE}
if(!require(MafDb.1Kgenomes.phase3.hs37d5)) {
BiocManager::install("MafDb.1Kgenomes.phase3.hs37d5")} 
if(!require(GenomicScores)) {
BiocManager::install("GenomicScores")} 
```


```{r, eval=TRUE}
library(MafDb.1Kgenomes.phase3.hs37d5)
library(GenomicScores)

meth_snps_filtered <- methrix::remove_snps(m = meth)
```



## Basic operations
### Extract methylation/coverage matrices
```{r}
#Example data bundled, same as the previously generated meth 
data("methrix_data")

#Coverage matrix
coverage_mat <- methrix::get_matrix(m = methrix_data, type = "C")
head(coverage_mat)
```

```{r}
#Methylation matrix
meth_mat <- methrix::get_matrix(m = methrix_data, type = "M")
head(meth_mat)
```

```{r}
#If you prefer you can attach loci info to the matrix and output in GRanges format
meth_mat_with_loci <- methrix::get_matrix(m = methrix_data, type = "M", add_loci = TRUE, in_granges = TRUE)
meth_mat_with_loci
```

### Coverage filter
Furthermore if you prefer you can filter sites based on coverage conditions.
```{r}
#e.g; Retain all loci which are covered at-least in two sample by 3 or more reads
methrix::coverage_filter(m = methrix_data, cov_thr = 3, min_samples = 2)
```

## Subset operations
Subset operations in `methrix` make use of `data.table`s [fast binary search](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html) which is several orders faster than `bsseq` or other similar packages.

### Subset by chromosome
```{r}
#Retain sites only from chromosme chr21
methrix::subset_methrix(m = methrix_data, contigs = "chr21")
```

### Subset by genomic regions
Regions can be data.table or [GRanges](https://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html#granges) format.
```{r}
#e.g; Retain sites only in TP53 loci 
target_loci <- GenomicRanges::GRanges("chr21:27867971-27868103")

print(target_loci)

methrix::subset_methrix(m = methrix_data, regions = target_loci)

```

### Subset by samples
```{r}
methrix::subset_methrix(m = methrix_data, samples = "C1")

#Or you could use [] operator to subset by index
methrix_data[,1]
```

## Summary statsitcis 
### Basic summaries
```{r}
meth_stats <- get_stats(m = methrix_data)
print(meth_stats)
```

```{r}
#Draw mean coverage per sample
plot_stats(plot_dat = meth_stats, what = "C", stat = "mean")
#Draw mean methylation per sample
plot_stats(plot_dat = meth_stats, what = "M", stat = "mean")
```

## PCA
```{r}
mpca <- methrix_pca(m = methrix_data, do_plot = FALSE)

#Plot PCA results
plot_pca(pca_res = mpca, show_labels = TRUE)

#Color code by an annotation
plot_pca(pca_res = mpca, m = methrix_data, col_anno = "Condition")
```

## Plotting
### Methylation
```{r}
#Violin plots
methrix::plot_violin(m = methrix_data)
```

### Coverage
```{r}
methrix::plot_coverage(m = methrix_data, type = "dens")
```

## Converting methrix to BSseq
If you prefer to work with [bsseq](https://bioconductor.org/packages/release/bioc/vignettes/bsseq/inst/doc/bsseq.html#3_using_objects_of_class_bsseq) object, you can generate `bsseq` object from methrix with the `methrix2bsseq`. 
```{r eval=FALSE}
if(!require(bsseq)) {
BiocManager::install("bsseq")}

```

```{r}
library(bsseq)
bs_seq <- methrix::methrix2bsseq(m = methrix_data)

bs_seq
```

## SessionInfo
```{r}
sessionInfo()
```