--- output: pdf_document: toc: true includes: in_header: "formatting-fixes.tex" BiocStyle::html_document: toc: true toc_depth: 2 package: TENET.AnnotationHub title: "Using the TENET.AnnotationHub datasets" author: Rhie Lab at the University of Southern California date: "`r Sys.Date()`" abstract: > This vignette describes the basic usage of the TENET.AnnotationHub package, which contains datasets for use in the TENET package in the form of GenomicRanges objects representing putative enhancer, promoter, and open chromatin regions from a variety of sources. See our GitHub repository () for more information and to view the manifest files for each of the datasets in this package. All included datasets are aligned to the human hg38 genome. vignette: > %\VignetteIndexEntry{Using the TENET.AnnotationHub datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} --- \RaggedRight ```{r echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") options(tibble.print_min = 4, tibble.print_max = 4) ``` # Introduction The TENET.AnnotationHub package contains 9 GenomicRanges datasets for use in the TENET package, which are all aligned to the hg38 human genome. These datasets include regions of putative enhancers, promoters, and open chromatin such as the ENCODE Registry of cCREs V3 datasets, consensus datasets derived from a wide variety of tissues, cells, and patient samples from sources including Roadmap Epigenomics ChromHMM annotations, FANTOM5 putative enhancers, the ENCODE DNaseI Hypersensitive Site Master List, and TCGA tumor samples, as well as datasets relevant to 10 unique cancer types (BLCA, BRCA, COAD, ESCA, HNSC, KIRP, LIHC, LUAD, LUSC, and THCA) we personally curated from hundreds of GEO datasets and relevant TCGA tumor samples (see Mullen et al). Manifests for the last two categories of datasets, which are derived from a variety of different sources instead of a single source, are available in the data-raw subdirectory of the package GitHub repository (). These manifests detail, among other information, the ENCODE/GEO experiments where the files originate. The raw .bed.gz and .narrowPeak.gz files we downloaded/processed, respectively, are available in a separate TENET.AnnotationHub_files repository at , which also contains copies of the manifests for the datasets containing these files. # Acquiring and installing TENET.AnnotationHub R 4.4 or a newer version is required. On Ubuntu 22.04, installation was successful in a fresh R environment after adding the official R Ubuntu repository using the instructions at and running the following command in a terminal: `sudo apt-get install r-base-core r-base-dev libcurl4-openssl-dev libfreetype6-dev libfribidi-dev libfontconfig1-dev libharfbuzz-dev libtiff5-dev libxml2-dev` No dependencies other than R are required on macOS or Windows. Two versions of this package are available. To install the stable version from Bioconductor, start R and run: ```{r eval = FALSE} ## Install BiocManager, which is required to install packages from Bioconductor if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("TENET.AnnotationHub") ``` The development version containing the most recent updates is available from our GitHub repository (). To install the development version from GitHub, start R and run: ```{r eval = FALSE} ## Install prerequisite packages to install the development version from GitHub if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } if (!requireNamespace("remotes", quietly = TRUE)) { install.packages("remotes") } BiocManager::install("rhielab/TENET.AnnotationHub") ``` # Loading TENET.AnnotationHub To load the TENET.AnnotationHub package, start R and run: ```{r message = FALSE} library(TENET.AnnotationHub) ``` # Using the included datasets Wrapper functions are provided to allow easy access to all included datasets. Usage of each wrapper function is demonstrated below. # Included datasets ## `ENCODE_dELS_regions` A GRanges object containing regions of candidate cis-regulatory elements with distal enhancer-like signatures as identified by the ENCODE SCREEN project. These consist of regions with high H3K27ac and DNase signal, but low H3K4me3 signal, and located more than 2kb from GENCODE transcription start sites. **Citation:** ENCODE Project Consortium; Moore JE, Purcaro MJ, Pratt HE, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020 Jul;583(7818):699-710. doi: 10.1038/s41586-020-2493-4. Epub 2020 Jul 29. Erratum in: Nature. 2022 May;605(7909):E3. PMID: 32728249; PMCID: PMC7410828. This dataset consists of 786,756 ranges and has no metadata columns. ```{r} ## Retrieve the AnnotationHub metadata for the object ENCODE_dELS_regions(metadata = TRUE) ## Retrieve the object itself ENCODE_dELS_regions() ``` ## `ENCODE_pELS_regions` A GenomicRanges dataset of proximal enhancer-like elements from the ENCODE A GRanges object containing regions of candidate cis-regulatory elements with proximal enhancer-like signatures as identified by the ENCODE SCREEN project. These consist of regions with high H3K27ac and DNase signal, but low H3K4me3 signal, and located 2kb or less from GENCODE transcription start sites. **Citation:** ENCODE Project Consortium; Moore JE, Purcaro MJ, Pratt HE, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020 Jul;583(7818):699-710. doi: 10.1038/s41586-020-2493-4. Epub 2020 Jul 29. Erratum in: Nature. 2022 May;605(7909):E3. PMID: 32728249; PMCID: PMC7410828. This dataset consists of 171,292 ranges and has no metadata columns. ```{r} ## Retrieve the AnnotationHub metadata for the object ENCODE_pELS_regions(metadata = TRUE) ## Retrieve the object itself ENCODE_pELS_regions() ``` ## `ENCODE_PLS_regions` A GRanges object containing regions of candidate cis-regulatory elements with promoter-like signatures as identified by the ENCODE SCREEN project. These consist of regions with high H3K4me3 and DNase signal, and located within 200 bp of a GENCODE transcription start site. **Citation:** ENCODE Project Consortium; Moore JE, Purcaro MJ, Pratt HE, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020 Jul;583(7818):699-710. doi: 10.1038/s41586-020-2493-4. Epub 2020 Jul 29. Erratum in: Nature. 2022 May;605(7909):E3. PMID: 32728249; PMCID: PMC7410828. This dataset consists of 40,734 ranges and has no metadata columns. ```{r} ## Retrieve the AnnotationHub metadata for the object ENCODE_PLS_regions(metadata = TRUE) ## Retrieve the object itself ENCODE_PLS_regions() ``` ## `TENET_10_cancer_panel_enhancer_regions` A composite GRanges object containing regions of putative enhancer elements from 10 different cancer types (BRCA, BLCA, COAD, ESCA, HNSC, KIRP, LIHC, LUAD, LUSC, and THCA) primarily for use in the TENET Bioconductor package. This dataset is composed of H3K27ac and H3K4me1 peaks from ChIP-seq datasets collected from Cistrome.org and processed using the ENCODE pipelines. For additional information on component datasets, see the manifest file hosted at . The peaks within each cancer type are reduced, but the final dataset with peaks across all 10 cancer types is not reduced, and consists of 4,798,784 ranges, with a single metadata column `TYPE`, which lists which of the ten cancer types each range represents. ```{r} ## Retrieve the AnnotationHub metadata for the object TENET_10_cancer_panel_enhancer_regions(metadata = TRUE) ## Retrieve the object itself TENET_10_cancer_panel_enhancer_regions() ``` ## `TENET_10_cancer_panel_open_chromatin_regions` A composite GRanges object containing regions of open chromatin from 10 different cancer types (BRCA, BLCA, COAD, ESCA, HNSC, KIRP, LIHC, LUAD, LUSC, and THCA) primarily for use in the TENET Bioconductor package. This dataset is composed of peaks from DNase I and ATAC-seq datasets collected from Cistrome.org and processed using the ENCODE guidelines, along with additional TCGA ATAC-seq peaks from cancer samples of these ten types. For additional information on component datasets, see the manifest file hosted at . The peaks within each cancer type are reduced, but the final dataset with peaks across all 10 cancer types is not reduced, and consists of 7,514,441 ranges, with a single metadata column `TYPE`, which lists which of the ten cancer types each range represents. ```{r} ## Retrieve the AnnotationHub metadata for the object TENET_10_cancer_panel_open_chromatin_regions(metadata = TRUE) ## Retrieve the object itself TENET_10_cancer_panel_open_chromatin_regions() ``` ## `TENET_10_cancer_panel_promoter_regions` A composite GRanges object containing regions of putative promoter elements from 10 different cancer types (BRCA, BLCA, COAD, ESCA, HNSC, KIRP, LIHC, LUAD, LUSC, and THCA) primarily for use in the TENET Bioconductor package. This dataset is composed of H3K27me3 peaks from ChIP-seq datasets collected from Cistrome.org and processed using the ENCODE guidelines. For additional information on component datasets, see the manifest file hosted at . The peaks within each cancer type are reduced, but the final dataset with peaks across all 10 cancer types is not reduced, and consists of 2,627,647 ranges, with a single metadata column `TYPE`, which lists which of the ten cancer types each range represents. ```{r} ## Retrieve the AnnotationHub metadata for the object TENET_10_cancer_panel_promoter_regions(metadata = TRUE) ## Retrieve the object itself TENET_10_cancer_panel_promoter_regions() ``` ## `TENET_consensus_enhancer_regions` A composite GRanges object containing regions of putative enhancer elements from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of regions of strong enhancers as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome (the following 4 states represent strong enhancers: 7: Genic enhancer1, 8: Genic enhancer2, 9: Active Enhancer 1, and 10: Active Enhancer 2), as well as regions of human permissive enhancers identified by the FANTOM5 project in phase 1 and phase 2. For additional information on component datasets, see the manifest file hosted at . **Citations:** Roadmap Epigenomics Consortium; Kundaje A, Meuleman W, Ernst J, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015 Feb 19;518(7539):317-30. doi: 10.1038/nature14248. PMID: 25693563; PMCID: PMC4530010. Lizio M, Harshbarger J, Shimoji H, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol 16(1), 22 (2015). Abugessaisa I, Ramilowski JA, Lizio M, et al. FANTOM enters 20th year: expansion of transcriptomic atlases and functional annotation of non-coding RNAs. Nucleic Acids Res. 2021 Jan 8;49(D1):D892-D898. doi: 10.1093/nar/gkaa1054. PMID: 33211864; PMCID: PMC7779024. This dataset consists of 403,602 reduced ranges and has no metadata columns. ```{r} ## Retrieve the AnnotationHub metadata for the object TENET_consensus_enhancer_regions(metadata = TRUE) ## Retrieve the object itself TENET_consensus_enhancer_regions() ``` ## `TENET_consensus_open_chromatin_regions` A composite GRanges object containing regions of open chromatin from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of DNase I hypersensitive regions from the master list compiled from 125 cell types by ENCODE, lifted over to the hg38 genome, along with TCGA ATAC-seq peaks from 410 cancer samples of 23 cancer types. For additional information on component datasets, see the manifest file hosted at . **Citations:** ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247. PMID: 22955616; PMCID: PMC3439153. Thurman RE, Rynes E, Humbert R, et al. The accessible chromatin landscape of the human genome. Nature. 2012 Sep 6;489(7414):75-82. doi: 10.1038/nature11232. PMID: 22955617; PMCID: PMC3721348. Corces MR, Granja JM, Shams S, et al. The chromatin accessibility landscape of primary human cancers. Science. 2018 Oct 26;362(6413):eaav1898. doi: 10.1126/science.aav1898. PMID: 30361341; PMCID: PMC6408149. This dataset consists of 2,525,827 reduced ranges and has no metadata columns. ```{r} ## Retrieve the AnnotationHub metadata for the object TENET_consensus_open_chromatin_regions(metadata = TRUE) ## Retrieve the object itself TENET_consensus_open_chromatin_regions() ``` ## `TENET_consensus_promoter_regions` A composite GRanges object containing regions of putative promoter elements from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of regions flanking transcription start sites as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome (the following 4 states represent regions flanking transcription start sites: 1: Active TSS, 2: Flanking TSS, 3: Flanking TSS Upstream, and 4: Flanking TSS Downstream). For additional information on component datasets, see the manifest file hosted at . **Citation:** Roadmap Epigenomics Consortium; Kundaje A, Meuleman W, Ernst J, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015 Feb 19;518(7539):317-30. doi: 10.1038/nature14248. PMID: 25693563; PMCID: PMC4530010. This dataset consists of 361,315 reduced ranges and has no metadata columns. ```{r} ## Retrieve the AnnotationHub metadata for the object TENET_consensus_promoter_regions(metadata = TRUE) ## Retrieve the object itself TENET_consensus_promoter_regions() ``` # Session info ```{r} sessionInfo() ```