---
title: "metagene: a package to produce metagene plots"
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{Introduction to metagene}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r style, echo = FALSE, results = 'asis', message = FALSE}
BiocStyle::markdown()
library(knitr)
```
**Package**: `r Biocpkg("metagene")`
**Modified**: 18 september, 2015
**Compiled**: `r date()`
**License**: `r packageDescription("metagene")[["License"]]`
# Introduction
This package produces metagene-like plots to compare the behavior of
DNA-interacting proteins at selected groups of features. A typical analysis
can be done in viscinity of transcription start sites (TSS) of genes or at any
regions of interest (such as enhancers). Multiple combinations of group of
features and/or group of bam files can be compared in a single analysis.
Bootstraping analysis is used to compare the groups and locate regions with
statistically different enrichment profiles. In order to increase the
sensitivity of the analysis, alignment data is used instead of peaks produced
with peak callers (i.e.: MACS2 or PICS). The metagene package uses bootstrap
to obtain a better estimation of the mean enrichment and the confidence
interval for every group of samples.
This vignette will introduce all the main features of the metagene package.
# Loading the metagene package
```{r libraryLoad, message = FALSE}
library(metagene)
```
# Inputs
## Alignment files (BAM files)
There is no hard limit in the number of BAM files that can be included in an
analysis (but with too many BAM files, memory may become an issue). BAM files
must be indexed. For instance, if you use a file names `file.bam`, a file
named `file.bam.bai` or `file.bai`must be present in the same directory.
The path (relative or absolute) to the BAM files must be in a vector:
```{r bamFiles}
bam_files <- get_demo_bam_files()
bam_files
```
For this demo, we have 2 samples (each with 2 replicates). It is also possible
to use a named vector to add your own names to each BAM files:
```{r namedBamFiles}
named_bam_files <- bam_files
names(named_bam_files) <- letters[seq_along(bam_files)]
named_bam_files
```
Using named BAM files can simplify the use of the metagene helper functions and
the creation of the design.
## Genomic regions
### BED files
To compare custom regions of interest, it is possible to use a list of one or
more BED files.
```{r regionsArgument}
regions <- get_demo_regions()
regions
```
The name of the files (without the extension) will be used to name each groups.
`metagene` also support the
[narrowPeak](https://genome.ucsc.edu/FAQ/FAQformat.html#format12)
and the [broadPeak](https://genome.ucsc.edu/FAQ/FAQformat.html#format13).
### GRanges or GRangesList objects - Regions
As an alternative to a list of BED files, `GRanges` or `GRangesList` objects
can be used.
### Available datasets
Some common datasets are already available with the `metagene` package:
* `promoters_hg19`
* `promoters_hg18`
* `promoters_mm10`
* `promoters_mm9`
```{r showDatasets}
data(promoters_hg19)
promoters_hg19
```
For more details about each datasets, please refer to their documentation
(i.e.:`?promoters_hg19`).
## Design groups
A design group contains a set of BAM files that, when put together, represent
a logical analysis. Furthermore, a design group contains the relationship
between every BAM files present. Samples (with or without replicates) and
controls can be assigned to a same design group. There can be as many groups
as necessary. A BAM file can be assigned to more than one group.
To represent the relationship between every BAM files, design groups must have
the following columns:
* The list of paths to every BAM files related to an analysis.
* One column per group of files (replicates and/or controls).
There is two possible way to create design groups, by reading a file or by
directly creating a design object in R.
### Design groups from a file
Design groups can be loaded into the metagene package by using a text file. As
the relationship between BAM files as to be specified, the following columns
are mandatory:
* First column: The list of paths (absolute or relative) to every BAM files
for all the design groups, the BAM filenames or the BAM names (if a named BAM.
file was used).
* Following columns: One column per design group (replicates and/or controls).
The column can take only three values:
+ 0: ignore file
+ 1: input
+ 2: control
The file must also contain a header. It is recommanded to use `Samples` for the
name of the first column, but the value is not checked. The other columns in
the design file will be used for naming design groups, and must be unique.
```{r designFile}
fileDesign <- system.file("extdata/design.txt", package="metagene")
design <- read.table(fileDesign, header=TRUE, stringsAsFactors=FALSE)
design$Samples <- paste(system.file("extdata", package="metagene"),
design$Samples, sep="/")
kable(design)
```
### Design groups from R
It is not obligatory to use a design file, you can create the design
`data.frame` using your prefered method (as long as the restrictions on the
values mentioned previously are respected).
For instance, the previous design data.frame could have been create directly
in R:
```{r alternateDesign}
design <- data.frame(Samples = c("align1_rep1.bam", "align1_rep2.bam",
"align2_rep1.bam", "align2_rep2.bam", "ctrl.bam"),
align1 = c(1,1,0,0,2), align2 = c(0,0,1,1,2))
design$Samples <- paste0(system.file("extdata", package="metagene"), "/",
design$Samples)
kable(design)
```
# Analysis steps
A typical metagene analysis will consist steps:
* Extraction the read count of every BAM files in selected regions.
* Conversion in coverage.
* Noise removal
* Normalization of the coverage values.
* Table production.
* Data frame production.
* Generation of the metagene plot.
## Minimal analysis
A minimal metagene analysis can be performed in 2 steps:
1. Initialization (the `new` function).
2. `plot`
```{r minimalAnalysis}
regions <- get_demo_regions()
bam_files <- get_demo_bam_files()
# Initialization
mg <- metagene$new(regions = regions, bam_files = bam_files)
# Plotting
mg$plot(title = "Demo metagene plot")
```
As you can see, it is not mandatory to explicitly call each step of the
metagene analysis. For instance, in the previous example, the `plot` function
call the other steps automatically with default values (the next section will
describe the steps in more details).
In this specific case, the plot is messy since by default
`r Biocpkg("metagene")` will produce a curve for each possible combinations of
BAM file and regions. Since we have `r length(bam_files)` BAM files and
`r length(regions)` regions, this gives us
`r length(bam_files) * length(regions)` curves.
If we want more control on how every step of the analysis are performed, we
have to call each functions directly.
## Complete analysis
In order to fully control every step of a metagene analysis, it is important to
understand how a complete analysis is performed. If we are satisfied with the
default values, it is not mandatory to explicitly call every step (as was shown
in the previous section).
### Initialization
During this step, the coverages for every regions specified are extracted from
every BAM files. More specifically, a new `GRanges` is created by combining
all the regions specified with the `regions` param of the `new` function.
```{r initialization}
regions <- get_demo_regions()
bam_files <- get_demo_bam_files()
mg <- metagene$new(regions = regions, bam_files = bam_files)
```
### Producing the table
To produce the table, coverages (produced from Genomics regions (.BED),
Alignment Files (.BAM) and Design Sheet) was treated for noise removal
and normalized. Furthermore, to reduce the computation time during the
following steps, the positions are also binned. Regions, designs, bins,
associated values and orientation of strands are pulled into a data.table
called 'table' and accessible thanks to the getter `get_table`.
We can control the size of the bins with the `bin_count` argument. By
default, a `bin_count` of 100 will be used during this step.
```{r showProduceTable}
mg$produce_table()
```
We can also use the design we produced earlier to remove background signal and
combine replicates:
```{r produceTableDesign}
mg$produce_table(design = design)
```
### Producing the `data.frame`
The metagene plot are produced using the `ggplot2` package, which require a
`data.frame` as input. During this step, the values of the ribbon are
calculated. Metagene uses "bootstrap" to obtain a better estimation of the
mean of enrichment for every positions in each groups.
```{r produceDataFrame}
mg$produce_data_frame()
```
### Plotting
During this step, metagene will use the `data.frame` to plot the calculated
values using `ggplot2`. We show a subset of the regions by using the
`region_names` and `design_names` parameter. The `region_names` correspond to
the names of the regions used during the initialization. The `design_name`
will vary depending if a design was added. If no design was added, this param
correspond to the BAM name or BAM filenames. Otherwise, we have to use the
names of the columns from the design.
```{r showPlot}
mg$plot(region_names = "list1", title = "Demo plot subset")
```
# Manipulating the `metagene` objects
## Getters
Multiple getters functions are available to access the data that is stored in a
`metagene` object.
### `get_table`
To get the data.table containing regions, designs, bins, values at bins and
orientation of strands.
```{r getTable}
mg <- get_demo_metagene()
mg$produce_table()
mg$get_table()
```
### `get_matrices`
To get the data.table as matrices (the former data structure)
```{r getMatrices}
mg <- get_demo_metagene()
mg$produce_table()
m <- mg$get_matrices()
# m$list1$ctrl$input to access to region 'list1' and 'ctrl' design
```
### `get_data_frame`
get_data_frame = function(region_names = NULL, design_names = NULL)
To get the data.frame containing regions and design
```{r getDataFrame}
mg <- get_demo_metagene()
mg$produce_table()
mg$produce_data_frame()
mg$get_data_frame()
```
### `get_params`
The various parameters used during the initialization of the `metagene` object,
the production of the table and the production of the plot are saved and can
be accessed with the `get_params` function:
```{r getParams}
mg <- get_demo_metagene()
mg$get_params()
```
### `get_design`
To get the design that was used to produce the last version of the table,
you can use the `get_design` function:
```{r getDesign}
mg$produce_table(design = get_demo_design())
## Alternatively, it is also possible to add a design without producing the
## table:
mg$add_design(get_demo_design())
mg$get_design()
```
### `get_bam_count`
To get the number of aligned read in a BAM file, you can use the
`get_bam_count`
function:
```{r getBamCount}
mg$get_bam_count(bam_files[1])
```
### `get_regions`
To get all the regions, you can use the `get_regions` function:
```{r getRegions}
mg$get_regions()
```
It is also possible to extract a subset of the regions with the `get_regions`
function:
```{r getRegionsSubset}
mg$get_regions(region_names = c(regions[1]))
```
### `get_raw_coverages`
To get the coverages produced during the initialization of the `metagene`
object, you can use the `get_raw_coverages` function. Please note that to save
space, metagene will only extract the coverages in the regions provided.
```{r getRawCoverages}
coverages <- mg$get_raw_coverages()
coverages[[1]]
length(coverages)
```
It is also possible to extract a subset of all the coverages by providing the
filenames:
```{r getRawCoveragesSubset}
coverages <- mg$get_raw_coverages(filenames = bam_files[1:2])
length(coverages)
```
### `get_normalized_coverages`
The `get_normalized_coverages` function works exactly like the
`get_raw_coverages` function except that it returns the coverages in read per
million aligned (RPM).
## Chaining functions
Every function of metagene (except for the getters) invisibly return a pointer
to itself. This means that the functions can be chained:
```{r showChain}
rg <- get_demo_regions()
bam <- get_demo_bam_files()
d <- get_demo_design()
title <- "Show chain"
mg <- metagene$new(rg, bam)$produce_table(design = d)$plot(title = title)
```
## Copying a metagene object
To copy a metagene object, you have to use the `clone` function:
```{r copyMetagene}
mg_copy <- mg$clone()
```
# Managing large datasets
While `metagene` try to reduce it's memory usage, it's possible to run into
memory limits when working with multiple large datasets (especially when there
is a lot of regions with a large width).
One way to avoid this is to analyse each dataset seperately and then merge just
before producing the metagene plot:
```{r memory, collapse=TRUE}
mg1 <- metagene$new(bam_files = bam_files, regions = regions[1])
mg1$produce_data_frame()
mg2 <- metagene$new(bam_files = bam_files, regions = regions[2])
mg2$produce_data_frame()
```
Then you can extract the `data.frame`s and combine them with `rbind`:
```{r extractDF}
df1 <- mg1$get_data_frame()
df2 <- mg2$get_data_frame()
df <- rbind(df1, df2)
```
Finally, you can use the `plot_metagene` function to produce the metagene plot:
```{r plotMetagene}
p <- plot_metagene(df)
p + ggplot2::ggtitle("Managing large datasets")
```
# Comparing profiles with permutations
It is possible to compare two metagene profiles using the `permutation_test`
function provided with the `metagene` package. Please note that the permutation
tests functionality is still in development and is expected to change in future
releases.
The first step is to decide which profiles we want to compare and extract the
corresponding tables :
```{r extract_subtables}
tab <- mg$get_table()
tab0 <- tab[which(tab$region == "list1"),]
tab1 <- tab0[which(tab0$design == "align1"),]
tab2 <- tab0[which(tab0$design == "align2"),]
```
Then we defined to function to use to compare the two profiles. For this, a
companion package of `metagene` named `r Biocpkg("similaRpeak")` provides
multiple metrics.
For this example, we will prepare a function to calculate the
RATIO_NORMALIZED_INTERSECT between two profiles:
```{r similaRpeak}
library(similaRpeak)
perm_fun <- function(profile1, profile2) {
sim <- similarity(profile1, profile2)
sim[["metrics"]][["RATIO_NORMALIZED_INTERSECT"]]
}
```
We then compare our two profiles using this metric:
```{r calculateRNI}
ratio_normalized_intersect <-
perm_fun(tab1[, .(moy=mean(value)), by=bin]$moy,
tab2[, .(moy=mean(value)), by=bin]$moy)
ratio_normalized_intersect
```
To check if this value is significant, we can permute the two tables that
were used to produce the profile and calculate their
RATIO_NORMALIZED_INTERSECT:
```{r permTest}
permutation_results <- permutation_test(tab1, tab2, sample_size = 50,
sample_count = 1000, FUN = perm_fun)
```
Finally, we check how often the calculated value is greater than the results of
the permutations:
```{r perm_pval}
sum(ratio_normalized_intersect >= permutation_results) /
length(permutation_results)
```