---
title: "The OGRE user guide"
author: "Sven Berres"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
abstract: >
OGRE calculates overlap between user defined annotated genomic region
datasets. Any regions can be supplied such as public annotations (genes),
genetic variation (SNPs, mutations), regulatory elements (TFBS, promoters,
CpG islands) and basically all types of NGS output from sequencing
experiments. After overlap calculation, key numbers help analyse the extend
of overlaps which can also be visualized at a genomic level. To start OGRE's
GUI use function SHREC() in your R console. Find additional information and
tutorials on [github](https://github.com/svenbioinf/OGRE/).
OGRE package version: `r packageVersion("OGRE")`
output:
rmarkdown::html_document:
highlight: pygments
toc: true
fig_width: 5
vignette: >
%\VignetteIndexEntry{OGRE}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
%\VignetteBuilder{knitr}
---
```{r echo=FALSE, out.width='30%',fig.align='center'}
knitr::include_graphics('./logo.png')
```
```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(tidy = FALSE,
cache = FALSE,
dev = "png",
message = FALSE, error = FALSE, warning = TRUE)
vigDir=getwd()
if(dir.exists("../inst/extdata")){
knitr::opts_knit$set(root.dir = "../inst/extdata")
}else{
knitr::opts_knit$set(root.dir = "../extdata")
}
```
# Installation
Install OGRE using Bioconductor's package installer.
```{r message=FALSE, warning=FALSE,eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("OGRE")
```
Load the OGRE package:
```{r message=FALSE, warning=FALSE}
library(OGRE)
```
# Quick start- load datasets from hard drive
To start up OGRE you have to generate an `OGREDataSet` that is used to store your
datasets and additional information about the analysis that you are conducting.
Query and subjects files can be conveniently stored in their own folders as
GenomicRanges objects in form of stored .rds / .RDS files. We point OGRE to the
correct location by supplying a path for each folder with the character vectors
`queryFolder` and `subjectFolder`. In this vignette we are using lightweight
query and subject example data sets to show OGRE's functionality.
```{r InitializeOGRE, message=TRUE, echo=TRUE,warning=FALSE}
myQueryFolder <- file.path(system.file('extdata', package = 'OGRE'),"query")
mySubjectFolder <- file.path(system.file('extdata', package = 'OGRE'),"subject")
myOGRE <- OGREDataSetFromDir(queryFolder=myQueryFolder,
subjectFolder=mySubjectFolder)
```
By monitoring OGRE's metadata information you can make sure the input paths you
supplied are stored correctly.
```{r checkMetadata, message=TRUE, echo=TRUE,warning=TRUE}
metadata(myOGRE)
```
Query and subject datasets are read by `loadAnnotations()` and stored in the
`OGREDataSet` as `GRanges` objects.
We are going to read in the following example datasets:
* query "genes" (242 Protein coding genes)
* subject "CGI" (365 CpG islands)
* subject "TFBS" (48761 Transcription factor binding sites)
```{r loadAnnotations, message=TRUE, echo=TRUE,warning=TRUE}
myOGRE <- loadAnnotations(myOGRE)
```
OGRE uses your dataset file names to label query and subjects internally, we can
check these names by using the `names()` function since every `OGREDataSet` is a
`GRangesList`.
```{r checkDatasetNames, message=TRUE, echo=TRUE,warning=TRUE}
names(myOGRE)
```
Let's have a look at the stored datasets:
```{r checkGRanges, message=TRUE, echo=TRUE,warning=TRUE}
myOGRE
```
To find overlaps between your query and subject datasets we call `fOverlaps()`.
Internally OGRE makes use of the `GenomicRanges` package to calculate full and
partial overlap as schematically shown.
```{r echo=FALSE, out.width='70%',fig.align='center'}
knitr::include_graphics(file.path(vigDir,'overlap.png'))
```
Any existing subject - query hits are then listed in `detailDT` and stored as a
`data.table`.
```{r fOverlaps, message=TRUE, echo=TRUE,warning=TRUE}
myOGRE <- fOverlaps(myOGRE)
head(metadata(myOGRE)$detailDT,n=2)
```
The summary plot provides us with useful information about the number of
overlaps between your datasets.
```{r sumPlot, echo=TRUE, message=TRUE, warning=TRUE,fig.dim = c(30/2.54, 20/2.54)}
myOGRE <- sumPlot(myOGRE)
metadata(myOGRE)$barplot_summary
```
Using the `Gviz` visualization each query can be displayed with all overlapping
subject elements. Choose labels for all region tracks by supplying a
`trackRegionLabels` vector. Plots are stored in the same location as your
dataset files.
```{r gvizPlot, message=TRUE, echo=TRUE,warning=FALSE}
myOGRE <- gvizPlot(myOGRE,"ENSG00000142168",showPlot = TRUE,
trackRegionLabels = setNames(c("name","name"),c("genes","CGI")))
```
The overlap distribution can be generated with `summarizeOverlap(myOGRE)` and outputs
a table with informative statistics such as minimum, lower quantile, mean, median,
upper quantile, and maximum number of overlaps per region and per dataset.
Overlap distribution can also be displayed as histograms using `plotHist(myOGRE)`
and accessed by `metadata(myOGRE)$hist` and `metadata(myOGRE)$summaryDT`.
Two tables / plots are generated. The first one showing numbers for regions with
and without overlap and the second one showing numbers only for regions with overlap
by excluding all others.
Next, we generate an histogram with the number of TFBS per gene (x-axis, log scale) and the
TFBS frequency (y-axis). When focusing only on regions with overlap, we see that
genes have on average (median) 54 TFBS overlaps (black dashed line).
```{r summarizeOverlap, message=TRUE, echo=TRUE,warning=FALSE}
myOGRE <- summarizeOverlap(myOGRE)
myOGRE <- plotHist(myOGRE)
metadata(myOGRE)$summaryDT
metadata(myOGRE)$hist$TFBS
```
It is possible to create an average coverage profile of all gene-TFBS overlaps,
split in 100 bins, which represent gene bodies of all 242 genes. Both, forward
and reverse coding genes are arranged on the x-Axis and peaks indicate an TFBS
overlap enrichment. Overlap coverage is calculated as the sum of all gene TFBS
overlaps in 5'-3'direction. Generated plots can be
accessed by `metadata(myOGRE)$covPlot$TFBS` and the resulting profile shows
an accumulation of TFBS around gene start and end positions.
```{r coveragePlot, message=TRUE, echo=TRUE,warning=FALSE}
myOGRE <- covPlot(myOGRE)
metadata(myOGRE)$covPlot$TFBS$plot
```
# Quick start- load datasets from AnnotationHub
AnnotationHub offers a wide range of annotated datasets which can be manually
aquired but need some parsing to work with OGRE as detailed in vignette
section [Frequently Asked Questions](#FAQ)(FAQ).
For convenience `addDataSetFromHub()` adds one of the predefined human
datasets of `listPredefinedDataSets()` to an OGREDataSet. Those are taken from
AnnotationHub and are ready to use for OGRE.
We start by creating an empty OGREDataSet and attaching one dataset after another,
whereby one query and two subjects are added. The datasets are now ready for
further analysis.
```{r datasetsFromAnnotationHub,eval=FALSE}
myOGRE <- OGREDataSet()
listPredefinedDataSets()
myOGRE <- addDataSetFromHub(myOGRE,"protCodingGenes","query")
myOGRE <- addDataSetFromHub(myOGRE,"CGI","subject")
myOGRE <- addDataSetFromHub(myOGRE,"TFBS","subject")
names(myOGRE)
```
As you can see, the three datasets proteinCodingGenes, CGI and TFBS are stored within OGRE. You can then continue with overlap analysis using `fOverlaps()`.
# Quick start- load user defined GenomicRanges (GRanges) datasets
To offer more flexibility `addGRanges()` enables the user to attach additional
datasets to OGRE in form of GenomicRanges objects. Again we start by creating an
empty OGREDataSet and generate an example GenomicRanges object which is then added to
your OGREDataSet either as "query" or "subject".
```{r datasetsUserGRanges,eval=FALSE}
myOGRE <- OGREDataSet()
myGRanges <- makeExampleGRanges()
myOGRE <- addGRanges(myOGRE,myGRanges,"query")
```
# Frequently asked questions
## How to add additional datasets from AnnotationHub?
Use `AnnotationHub()` to connect to AnnotationHub. Each dataset is stored
under a unique ID and can be accessed in a list like fashion i.e. `aH[["AH5086"]]`.
Queries like `c("GRanges","Homo sapiens", "CpG")` enable browsing through datasets.
In this case we are searching for human CpG islands ranges stored as GenomicRanges
objects. For more information refer to `?AnnotationHub`
To make those datasets compatible with OGRE additional parsing is needed as
stated in [How to add custom GenomicRanges datasets?]
```{r additionalAnnotationHub,eval=FALSE}
aH <- AnnotationHub()
aH[["AH5086"]]
q <- query(aH, c("GRanges","Homo sapiens", "CpG"))
```
## How to add custom GenomicRanges datasets?
Any GenomicRanges datasets can be added that fulfill basic compatibility requirements.
GenomicRanges objects must:
* Originate from a common genome build i.e. "HG19"
Use `GenomeInfoDb::genome()` on any GenomicRanges object to get/set genome information
* Contain the same set of chromosomes i.e. chr1 != 1 or chrM != MT
Use `GenomeInfoDb::seqinfo()` on any GenomicRanges object to get/set chromosome information
* Contain a "name" and a (unique) "ID" column
Use `S4Vectors::mcols()` on any GenomicRanges object to get/set metadata information
## How to add datasets stored as .gff files?
Datasets from external sources are often stored as .gff (v2&v3) files. Once those
files exist in the query or subject folder and their attribute columns contain
"ID" and "name" information, OGRE tries to load them. Working example .gff
files can be found on [OGRE's github page](https://github.com/svenbioinf/OGRE) in
folder: inst/extdata/gffTest.
```{r addgff,eval=FALSE}
myOGRE <- OGREDataSetFromDir(queryFolder = "pathToQueryFolder",
subjectFolder = "pathToSubjectFolder")
myOGRE <- loadAnnotations(myOGRE)
```
## How to add datasets stored as tabular files?
Datasets stored as tabular files like .csv or .bed may need some preprocessing for
them work with OGRE. We recommend reading them in with `read.table()` or
`data.table::fread()` to obtain a data frame object. After making sure the dataset
complies with the requirements in section [How to add custom GenomicRanges datasets?],
`GenomicRanges::makeGRangesFromDataFrame()` offers a convenient way to generate
GenomicRanges object from data frames.
## What type of overlaps are reported?
Both, partial overlap, where only a part of two (or more) regions are overlapping
and complete overlap, where one region is completely overlapped by another, are
reported.
## How to change dataset names?
OGRE automatically infers dataset names based on your file names. You can either
change your file names before you start OGRE or you can use
`names(myOGRE) <- c("NewName1", "NewName2","...")` after you read in your datasets.
# Session info
```{r sessionInfo}
sessionInfo()
```