---
output:
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
package: TENET.AnnotationHub
title: "Using the TENET.AnnotationHub datasets"
author: Rhie Lab at the University of Southern California
date: "`r Sys.Date()`"
abstract: >
    This vignette describes the basic usage of the TENET.AnnotationHub package,
    which contains datasets for use in the TENET package in the form of
    GenomicRanges objects representing putative enhancer, promoter, and open
    chromatin regions from a variety of sources. See our GitHub repository
    (<https://github.com/rhielab/TENET.AnnotationHub>) for more information and
    to view the manifest files for each of the datasets in this package. All
    included datasets are aligned to the human hg38 genome.
vignette: >
    %\VignetteIndexEntry{Using the TENET.AnnotationHub datasets}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    \usepackage[utf8]{inputenc}
---

\RaggedRight

```{r echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(tibble.print_min = 4, tibble.print_max = 4)
```

# Introduction

The TENET.AnnotationHub package contains 9 GenomicRanges datasets for use in the
TENET package, which are all aligned to the hg38 human genome. These datasets
include regions of putative enhancers, promoters, and open chromatin such as the
ENCODE Registry of cCREs V3 datasets, consensus datasets derived from a wide
variety of tissues, cells, and patient samples from sources including Roadmap
Epigenomics ChromHMM annotations, FANTOM5 putative enhancers, the ENCODE DNaseI
Hypersensitive Site Master List, and TCGA tumor samples, as well as datasets
relevant to 10 unique cancer types (BLCA, BRCA, COAD, ESCA, HNSC, KIRP, LIHC,
LUAD, LUSC, and THCA) we personally curated from hundreds of GEO datasets and
relevant TCGA tumor samples (see Mullen et al).

Manifests for the last two categories of datasets, which are derived from a
variety of different sources instead of a single source, are available in the
data-raw subdirectory of the package GitHub repository
(<https://github.com/rhielab/TENET.AnnotationHub/tree/devel/data-raw>). These
manifests detail, among other information, the ENCODE/GEO experiments where the
files originate.

The raw .bed.gz and .narrowPeak.gz files we downloaded/processed, respectively,
are available in a separate TENET.AnnotationHub_files repository at
<https://github.com/rhielab/TENET.AnnotationHub_files>, which also contains
copies of the manifests for the datasets containing these files.

# Acquiring and installing TENET.AnnotationHub

R 4.4 or a newer version is required.

On Ubuntu 24.04, installation was successful in a fresh R environment after
adding the official R Ubuntu repository using the instructions at
<https://cran.r-project.org/bin/linux/ubuntu/> and running the following command
in a terminal:

`sudo apt-get install r-base-dev libcurl4-openssl-dev libfreetype-dev libfribidi-dev libfontconfig-dev libharfbuzz-dev libtiff-dev libxml2-dev libssl-dev`

No dependencies other than R are required on macOS or Windows.

Two versions of this package are available.

To install the stable version from Bioconductor, start R and run:

```{r eval = FALSE}
## Install BiocManager, which is required to install packages from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("TENET.AnnotationHub")
```
The development version containing the most recent updates is available from our
GitHub repository (<https://github.com/rhielab/TENET.AnnotationHub>).

To install the development version from GitHub, start R and run:

```{r eval = FALSE}
## Install prerequisite packages to install the development version from GitHub
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
if (!requireNamespace("remotes", quietly = TRUE)) {
    install.packages("remotes")
}

BiocManager::install("rhielab/TENET.AnnotationHub")
```

# Loading TENET.AnnotationHub

To load the TENET.AnnotationHub package, start R and run:

```{r message = FALSE}
library(TENET.AnnotationHub)
```

# Using the included datasets

Wrapper functions are provided to allow easy access to all included datasets.
Usage of each wrapper function is demonstrated below.

# Included datasets

## `ENCODE_dELS_regions`

A GRanges object containing regions of candidate cis-regulatory
elements with distal enhancer-like signatures as identified by the ENCODE
SCREEN project. These consist of regions with high H3K27ac and DNase signal,
but low H3K4me3 signal, and located more than 2kb from GENCODE transcription
start sites. **Citation:** ENCODE Project Consortium; Moore JE, Purcaro MJ,
Pratt HE, et al. Expanded encyclopaedias of DNA elements in the human and
mouse genomes. Nature. 2020 Jul;583(7818):699-710. doi:
10.1038/s41586-020-2493-4. Epub 2020 Jul 29. Erratum in: Nature. 2022
May;605(7909):E3. PMID: 32728249; PMCID: PMC7410828.

This dataset consists of 786,756 ranges and has no metadata columns.

```{r}
## Retrieve the AnnotationHub metadata for the object
ENCODE_dELS_regions(metadata = TRUE)

## Retrieve the object itself
ENCODE_dELS_regions()
```

## `ENCODE_pELS_regions`

A GenomicRanges dataset of proximal enhancer-like elements from the ENCODE
A GRanges object containing regions of candidate cis-regulatory
elements with proximal enhancer-like signatures as identified by the ENCODE
SCREEN project. These consist of regions with high H3K27ac and DNase signal,
but low H3K4me3 signal, and located 2kb or less from GENCODE transcription
start sites. **Citation:** ENCODE Project Consortium; Moore JE, Purcaro MJ,
Pratt HE, et al. Expanded encyclopaedias of DNA elements in the human and
mouse genomes. Nature. 2020 Jul;583(7818):699-710. doi:
10.1038/s41586-020-2493-4. Epub 2020 Jul 29. Erratum in: Nature. 2022
May;605(7909):E3. PMID: 32728249; PMCID: PMC7410828.

This dataset consists of 171,292 ranges and has no metadata columns.

```{r}
## Retrieve the AnnotationHub metadata for the object
ENCODE_pELS_regions(metadata = TRUE)

## Retrieve the object itself
ENCODE_pELS_regions()
```

## `ENCODE_PLS_regions`

A GRanges object containing regions of candidate cis-regulatory
elements with promoter-like signatures as identified by the ENCODE SCREEN
project. These consist of regions with high H3K4me3 and DNase signal, and
located within 200 bp of a GENCODE transcription start site. **Citation:**
ENCODE Project Consortium; Moore JE, Purcaro MJ, Pratt HE, et al. Expanded
encyclopaedias of DNA elements in the human and mouse genomes. Nature.
2020 Jul;583(7818):699-710. doi: 10.1038/s41586-020-2493-4. Epub 2020 Jul 29.
Erratum in: Nature. 2022 May;605(7909):E3. PMID: 32728249; PMCID: PMC7410828.

This dataset consists of 40,734 ranges and has no metadata columns.

```{r}
## Retrieve the AnnotationHub metadata for the object
ENCODE_PLS_regions(metadata = TRUE)

## Retrieve the object itself
ENCODE_PLS_regions()
```

## `TENET_10_cancer_panel_enhancer_regions`

A composite GRanges object containing regions of putative
enhancer elements from 10 different cancer types (BRCA, BLCA, COAD, ESCA,
HNSC, KIRP, LIHC, LUAD, LUSC, and THCA) primarily for use in the TENET
Bioconductor package. This dataset is composed of H3K27ac and H3K4me1 peaks
from ChIP-seq datasets collected from Cistrome.org and processed using the
ENCODE pipelines. For additional information on component datasets, see the
manifest file hosted at
<https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_10_cancer_panel_enhancer_regions_manifest.tsv>.

The peaks within each cancer type are reduced, but the final dataset with peaks
across all 10 cancer types is not reduced, and consists of 4,798,784 ranges,
with a single metadata column `TYPE`, which lists which of the ten cancer types
each range represents.

```{r}
## Retrieve the AnnotationHub metadata for the object
TENET_10_cancer_panel_enhancer_regions(metadata = TRUE)

## Retrieve the object itself
TENET_10_cancer_panel_enhancer_regions()
```

## `TENET_10_cancer_panel_open_chromatin_regions`

A composite GRanges object containing regions of open chromatin
from 10 different cancer types (BRCA, BLCA, COAD, ESCA, HNSC, KIRP, LIHC,
LUAD, LUSC, and THCA) primarily for use in the TENET Bioconductor package.
This dataset is composed of peaks from DNase I and ATAC-seq datasets
collected from Cistrome.org and processed using the ENCODE guidelines, along
with additional TCGA ATAC-seq peaks from cancer samples of these ten types.
For additional information on component datasets, see the manifest file
hosted at
<https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_10_cancer_panel_open_chromatin_regions_manifest.tsv>.

The peaks within each cancer type are reduced, but the final dataset with peaks
across all 10 cancer types is not reduced, and consists of 7,514,441 ranges,
with a single metadata column `TYPE`, which lists which of the ten cancer types
each range represents.

```{r}
## Retrieve the AnnotationHub metadata for the object
TENET_10_cancer_panel_open_chromatin_regions(metadata = TRUE)

## Retrieve the object itself
TENET_10_cancer_panel_open_chromatin_regions()
```

## `TENET_10_cancer_panel_promoter_regions`

A composite GRanges object containing regions of putative
promoter elements from 10 different cancer types (BRCA, BLCA, COAD, ESCA,
HNSC, KIRP, LIHC, LUAD, LUSC, and THCA) primarily for use in the TENET
Bioconductor package. This dataset is composed of H3K27me3 peaks from
ChIP-seq datasets collected from Cistrome.org and processed using the ENCODE
guidelines. For additional information on component datasets, see the
manifest file hosted at
<https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_10_cancer_panel_promoter_regions_manifest.tsv>.

The peaks within each cancer type are reduced, but the final dataset with peaks
across all 10 cancer types is not reduced, and consists of 2,627,647 ranges,
with a single metadata column `TYPE`, which lists which of the ten cancer types
each range represents.

```{r}
## Retrieve the AnnotationHub metadata for the object
TENET_10_cancer_panel_promoter_regions(metadata = TRUE)

## Retrieve the object itself
TENET_10_cancer_panel_promoter_regions()
```

## `TENET_consensus_enhancer_regions`

A composite GRanges object containing regions of putative
enhancer elements from a variety of sources, primarily for use in the TENET
Bioconductor package. This dataset is composed of regions of strong enhancers
as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model
based on 98 reference epigenomes, lifted over to the hg38 genome (the
following 4 states represent strong enhancers: 7: Genic enhancer1, 8: Genic
enhancer2, 9: Active Enhancer 1, and 10: Active Enhancer 2), as well as
regions of human permissive enhancers identified by the FANTOM5 project in
phase 1 and phase 2. For additional information on component datasets, see
the manifest file hosted at
<https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv>.
**Citations:** Roadmap Epigenomics Consortium; Kundaje A, Meuleman W,
Ernst J, et al. Integrative analysis of 111 reference human epigenomes.
Nature. 2015 Feb 19;518(7539):317-30. doi: 10.1038/nature14248. PMID:
25693563; PMCID: PMC4530010. Lizio M, Harshbarger J, Shimoji H, et al.
Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome
Biol 16(1), 22 (2015). Abugessaisa I, Ramilowski JA, Lizio M, et al. FANTOM
enters 20th year: expansion of transcriptomic atlases and functional
annotation of non-coding RNAs. Nucleic Acids Res. 2021 Jan
8;49(D1):D892-D898. doi: 10.1093/nar/gkaa1054. PMID: 33211864; PMCID:
PMC7779024.

This dataset consists of 403,602 reduced ranges and has no metadata columns.

```{r}
## Retrieve the AnnotationHub metadata for the object
TENET_consensus_enhancer_regions(metadata = TRUE)

## Retrieve the object itself
TENET_consensus_enhancer_regions()
```

## `TENET_consensus_open_chromatin_regions`

A composite GRanges object containing regions of open chromatin
from a variety of sources, primarily for use in the TENET Bioconductor
package. This dataset is composed of DNase I hypersensitive regions from the
master list compiled from 125 cell types by ENCODE, lifted over to the hg38
genome, along with TCGA ATAC-seq peaks from 410 cancer samples of 23 cancer
types. For additional information on component datasets, see the manifest
file hosted at
<https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv>.
**Citations:** ENCODE Project Consortium. An
integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep
6;489(7414):57-74. doi: 10.1038/nature11247. PMID: 22955616; PMCID:
PMC3439153. Thurman RE, Rynes E, Humbert R, et al. The accessible chromatin
landscape of the human genome. Nature. 2012 Sep 6;489(7414):75-82. doi:
10.1038/nature11232. PMID: 22955617; PMCID: PMC3721348. Corces MR, Granja JM,
Shams S, et al. The chromatin accessibility landscape of primary human
cancers. Science. 2018 Oct 26;362(6413):eaav1898. doi:
10.1126/science.aav1898. PMID: 30361341; PMCID: PMC6408149.

This dataset consists of 2,525,827 reduced ranges and has no metadata columns.

```{r}
## Retrieve the AnnotationHub metadata for the object
TENET_consensus_open_chromatin_regions(metadata = TRUE)

## Retrieve the object itself
TENET_consensus_open_chromatin_regions()
```

## `TENET_consensus_promoter_regions`

A composite GRanges object containing regions of putative
promoter elements from a variety of sources, primarily for use in the TENET
Bioconductor package. This dataset is composed of regions flanking
transcription start sites as annotated by the Roadmap Epigenomics ChromHMM
expanded 18-state model based on 98 reference epigenomes, lifted over to the
hg38 genome (the following 4 states represent regions flanking transcription
start sites: 1: Active TSS, 2: Flanking TSS, 3: Flanking TSS Upstream, and
4: Flanking TSS Downstream). For additional information on component
datasets, see the manifest file hosted at
<https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv>.
**Citation:** Roadmap Epigenomics Consortium;
Kundaje A, Meuleman W, Ernst J, et al. Integrative analysis of 111 reference
human epigenomes. Nature. 2015 Feb 19;518(7539):317-30. doi:
10.1038/nature14248. PMID: 25693563; PMCID: PMC4530010.

This dataset consists of 361,315 reduced ranges and has no metadata columns.

```{r}
## Retrieve the AnnotationHub metadata for the object
TENET_consensus_promoter_regions(metadata = TRUE)

## Retrieve the object itself
TENET_consensus_promoter_regions()
```

# Session info

```{r}
sessionInfo()
```