--- title: "extraChIPs: Range-Based operations" author: - name: Steve Pederson affiliation: Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, University of Adelaide email: stephen.pederson.au@gmail.com package: extraChIPs resource_files: - vignettes/stitched.png output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Range-Based Operations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE) ``` # Introduction This package is designed to enable the Gene Regulatory Analysis using Variable IP ([GRAVI](https://github.com/steveped/GRAVI)) workflow, as a method for detecting differential binding in ChIP-Seq datasets. As a workflow focussed on data integration, most functions provided by the package `extraChIPs` are designed to enable comparison across datasets. This vignette looks primarily at functions which work with `GenomicRanges` objects. # Installation In order to use the package `extraChIPs` and follow this vignette, we recommend using the package `BiocManager` hosted on CRAN. Once this is installed, the additional packages required for this vignette (`tidyverse`, `plyranges` and `Gviz`) can also be installed. ```{r install, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install(c("tidyverse", "plyranges", "Gviz")) BiocManager::install("extraChIPs") ``` # Coercion The advent of the `tidyverse` has led to `tibble` objects becoming a common alternative to `data.frame` or `DataFrame` objects. Simple functions within `extraChIP` enable coercion from `GRanges`, `GInteractions` and `DataFrame` objects to tibble objects, with list columns correctly handled. By default these coercion functions will coerce `GRanges` elements to a character vector. Similarly, `GRanges` represented as a character column can be coerced to the ranges element of a `GRanges` object. First let's coerce from a `tibble` (or S3 `data.frame`) to a `GRanges` ```{r gr} library(tidyverse) library(extraChIPs) set.seed(73) df <- tibble( range = c("chr1:1-10:+", "chr1:5-10:+", "chr1:5-6:+"), gene_id = "gene1", tx_id = paste0("transcript", 1:3), score = runif(3) ) df gr <- colToRanges(df, "range") gr ``` Coercion back to a `tibble` will place the ranges as a character column by default. However, this can be turned off and the conventional coercion from `as.data.frame` will instead be applied, internally wrapped in `as_tibble()` ```{r as-tibble} as_tibble(gr) as_tibble(gr, rangeAsChar = FALSE) ``` A simple feature which may be useful for printing gene names using `rmarkdown` is contained in `collapseGenes()`. Here a character vector of gene names is collapsed into a `glue` object of length `, with gene names rendered in italics by default. ```{r collapse-gene, results='asis'} gn <- c("Gene1", "Gene2", "Gene3") collapseGenes(gn) ``` # Simple Operations Retaining `mcols()` The standard set operations implemented in the package `GenomicRanges` will always drop the `mcols` element by default. The `extraChIPs` functions `reduceMC()`, `setdiffMC()`, `intersectMC()` and `unionMC()` all produce the same output as their similarly-named functions, however, the `mcols()` elements in the query object are also returned. Where required, columns are coerced into `CompressedList` columns. This can particularly useful when needed to propagate the information contained in the initial ranges through to subsequent analytic steps ## Simplifying single `GRanges` objects The classical approach to defining TSS regions for a set of transcripts would be to use the function `resize`()`, setting the width as 1. ```{r tss} tss <- resize(gr, width = 1) tss ``` As we can see, two transcripts start at position 5, so we may choose to reduce this, which would lose the `mcols` element. The alternative `reduceMC()` will retain all `mcols`. ```{r reduce} GenomicRanges::reduce(tss) reduceMC(tss) ``` By default, this function will attempt to coerce `mcols` to a new `mcol` of the appropriate type, however, when multiple values are inevitable such as for the `tx_id` column above, these will be coerced to a `CompressedList`. The simplification of the multiple values seen in the `gene_id` can also be turned off if desired should repeated values be important for downstream analysis. ```{r reduce-no-simplify} reduceMC(tss, simplify = FALSE) ``` This allows for simple integration with `tidyverse` nesting strategies. ```{r reduce-unnest} reduceMC(tss, simplify = FALSE) %>% as_tibble() %>% unnest(everything()) ``` Whilst `reduceMC` relies on the range-reduction as implemented in `GenomicRanges::reduce()`, some alternative approaches are included, such as `chopMC()`, which finds identical ranges and nests the mcols element as `CompressedList` objects. ```{r chop} chopMC(tss) ``` In the case of the object `tss`, this output is identical to `reduceMC()`, however, given that there are no identical ranges in `gr` the two functions would behave very differently on that object. A final operation for simplifying `GRanges` objects would be `distinctMC()` which is a wrapper to `dplyr]::distinct` incorporating both the range and `mcols`. The columns to search can be called using `` approaches as detailed in the manual. ```{r distinct} distinctMC(tss) distinctMC(tss, gene_id) ``` ## Set Operations with Two `GRanges` Objects Whilst `reduce/reduceMC` is applied to a single `GRanges` object, the set operation functions `intersect`, `setdiff` and `union` are valuable approaches for comparing ranges. Using the `*MC()` functions will retain `mcols` elements from the *query range*. ```{r setopts} peaks <- GRanges(c("chr1:1-5", "chr1:9-12:*")) peaks$peak_id <- c("peak1", "peak2") GenomicRanges::intersect(gr, peaks, ignore.strand = TRUE) intersectMC(gr, peaks, ignore.strand = TRUE) setdiffMC(gr, peaks, ignore.strand = TRUE) unionMC(gr, peaks, ignore.strand = TRUE) ``` There is a performance overhead to preparation of mcols as `CompressedList` objects and all `mcols` in the query object will be returned. If only wishing to retain a subset of `mcols`, these should be selected prior to passing to these functions. ```{r setopts-plyranges} library(plyranges) gr %>% select(tx_id) %>% intersectMC(peaks, ignore.strand = TRUE) ``` ## Overlapping proportions Whilst the functions `findOverlaps()` and `overlapsAny()` are extremely useful, the addition of `propOverlap()` returns a numeric vector with the proportion of each query range (`x`) which overlaps any range in the subject range (`y`) ```{r prop-overlaps} propOverlap(gr, peaks) ``` This is also extended to enable comparison across multiple features and to classify each peak by which features that it overlaps the most strongly. ```{r best-overlap} bestOverlap(gr, peaks, var = "peak_id") ``` # More Complex Operations ## Range Partitioning In addition to standard set operations, one set of ranges can be used to partition another set of ranges, returning `mcols` from both ranges. Ranges from the query range (`x`) are returned after being partitioned by the ranges in the subject range (`y`). Subject ranges used for partitioning *must be non-overlapping*, and if overlapping ranges are provided, these will be reduced prior to partitioning. This enables the identification of specific ranges from the query range (`x`) which overlap ranges from the subject range (`y`) Under this approach, `mcols` from both `query` and `subject` ranges will be returned to enable the clear ranges which are common and distinct within the two sets of ranges. ```{r partition-ranges} partitionRanges(gr, peaks) ``` Whilst this shares some similarity with `intersectMC()` the additional capabilities provide greater flexibility. ```{r partition-ranges-subset} partitionRanges(gr, peaks) %>% subset(is.na(peak_id)) ``` ## Stitching Ranges Using the function `stitchRanges()` we are able to join together sets of nearby ranges, but with the option of placing clear barriers between ranges, across which ranges cannot be joined. This may be useful if joining enhancers to form putative super-enhancers, but explicitly omitting defined promoter regions. ```{r stitch-ranges} enh <- GRanges(c("chr1:1-10", "chr1:101-110", "chr1:181-200")) prom <- GRanges("chr1:150:+") se <- stitchRanges(enh, exclude = prom, maxgap = 100) se ``` As a visualisation (below) ranges within 100bp were stitched together, however the region defined as a 'promoter' acted as a barrier and ranges were not stitched together across this barrier. ```{r plot-stitched, echo = FALSE, fig.height=3, fig.width=8} knitr::include_graphics("stitched.png") ``` # Session Info ```{r} sessionInfo() ```