1 Synopsis

seq - set - vis

seqsetvis enables the visualization and analysis of multiple genomic datasets. Although seqsetvis was designed for the comparison of mulitple ChIP-seq datasets, this package is domain-agnostic and allows the processing of multiple genomic coordinate files (bed-like files) and signal files (bigwig files).

seqsetvis is ideal for the following tasks:

  1. Feature assessment and comparison.
  2. Differential enrichment or expression results assessment.
  3. Replicate comparison and reproducibility analysis.
  4. Chromatin state inspection.
  5. Clustering of histone modification and/or transcription factor enrichment at features such as promoters or enhancers.

seqsetvis accepts and outputs the Bioconductor standard GRanges object while leveraging the power and speed of the data.table package (an enhancement of data.frame) with the flexibility of ggplot2 to provide users with visualizations that are simultaneously publication ready while still being highly customizable. By default, data.table is used entirely internally so users unfamiliar with its use can safely ignore it. That said, fans of data.table can use seqsetvis in a data.table exclusive mode if they wish.

2 Features

  • Mostly it “just works” with minimal assumed knowledge of R or other Bioconductor packages.
  • All plots produced are ggplot2 objects.
    • easily customized according to user preferences by appending ggplot2 functions.
    • powerful plot arrangement with with cowplot, gridExtra, etc.
  • Although various plots are implemented, the underlying data are simple in structure and easily accessible so you can do what you want with them.
    • membership table for feature overlaps is just a logical matrix.
    • signal data is stored in a tidy formatted data.table, ideal for ggplot2 visualization.

Although initially designed to focus on peak call sets, seqsetvis accepts any data in GRanges format and is therefore flexible enough to work dowstream of many genomics tools and packages. Indeed, set comparison visualizations are genomics agnostic and also accept generic data types defining sets.

3 Functions

  • ssv* - most primary user facing functions are prefixed by ssv
    • ssvOverlapIntervalSets - derives sets from a list of GRanges
    • ssvMakeMembTable - coerces various ways of defining sets to a membership table
    • ssvFactorizeMembTable - converts a multi column membership table into a single factor.
    • ssvFetch* - retrieves properly formatted data from .bigwig and .bam files
    • ssvFeature* - visualizations of set counts and overlaps
    • ssvSignal* - visualizations of signal mapped to genomic coordinates in sets context
  • utility functions
    • easyLoad_* - loads files into a list of GRanges for input into ssv* functions
    • miscellaneous functions such as wrappers for data.table functions.

4 Installation and Loading

4.1 From Bioconductor

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("seqsetvis")

4.2 Load the library

library(seqsetvis)

4.3 Load optional useful libraries

library(GenomicRanges)
library(data.table)
library(cowplot)
theme_set(cowplot::theme_cowplot())

5 Defining sets

seqsetvis has several ways of accepting set information. Since seqsetvis was originally intended to focus on comparing sets of peak calls, a list of GRanges or a GRangesList is directly supported but there are more generic alternatives.

5.1 Overlapping a list of GRanges

Many functions in seqsetvis require some kind of sets definition as input. Sets can be defined in one of several ways.
ssvOverlapIntervalSets() creates a single GRanges from a list of GRanges with a logical membership table as metadata columns defining sets.

olaps = ssvOverlapIntervalSets(CTCF_in_10a_narrowPeak_grs)
head(olaps)
## GRanges object with 6 ranges and 3 metadata columns:
##     seqnames            ranges strand | MCF10A_CTCF MCF10AT1_CTCF MCF10CA1_CTCF
##        <Rle>         <IRanges>  <Rle> |   <logical>     <logical>     <logical>
##   1     chr1 17507898-17508319      * |       FALSE          TRUE         FALSE
##   2     chr1 40855546-40855866      * |        TRUE         FALSE         FALSE
##   3     chr1 53444333-53445008      * |        TRUE          TRUE          TRUE
##   4     chr1 64137020-64137657      * |        TRUE          TRUE          TRUE
##   5     chr1 65075339-65076014      * |        TRUE          TRUE         FALSE
##   6     chr1 92818291-92818802      * |        TRUE          TRUE          TRUE
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

ssvOverlapIntervalSets() also accepts a GRangesList if that’s more natural.

olaps_fromGRangesList = ssvOverlapIntervalSets(
    GenomicRanges::GRangesList(CTCF_in_10a_narrowPeak_grs))

The membership data contained in the metadata columns of olaps is
the data structure seqsetvis set overlap functions use internally.
The S4 method ssvMakeMembTable() handles the necessary conversions and can be extended to accept a new class of data as long as it converts to an already accepted data type and calls ssvMakeMembTable again.

For instance, ssvMakeMembTable(olaps) extracts the membership table from the mcols of a single GRanges object.

head(ssvMakeMembTable(olaps))
##   MCF10A_CTCF MCF10AT1_CTCF MCF10CA1_CTCF
## 1       FALSE          TRUE         FALSE
## 2        TRUE         FALSE         FALSE
## 3        TRUE          TRUE          TRUE
## 4        TRUE          TRUE          TRUE
## 5        TRUE          TRUE         FALSE
## 6        TRUE          TRUE          TRUE

5.2 Overlapping a list of more generic data

ssvMakeMembTable() can handle sets defined by a list of vectors.
These happen to be numeric but nearly anything that as.character() works on should be acceptable.

my_set_list = list(1:3, 2:3, 3:6)
ssvMakeMembTable(my_set_list)
##   set_A set_B set_C
## 1  TRUE FALSE FALSE
## 2  TRUE  TRUE FALSE
## 3  TRUE  TRUE  TRUE
## 4 FALSE FALSE  TRUE
## 5 FALSE FALSE  TRUE
## 6 FALSE FALSE  TRUE

Supplying a named list is recommended

names(my_set_list) = c("first", "second", "third")
ssvMakeMembTable(my_set_list)
##   first second third
## 1  TRUE  FALSE FALSE
## 2  TRUE   TRUE FALSE
## 3  TRUE   TRUE  TRUE
## 4 FALSE  FALSE  TRUE
## 5 FALSE  FALSE  TRUE
## 6 FALSE  FALSE  TRUE

A list of character vectors works just as well.

my_set_list_char = lapply(my_set_list, function(x)letters[x])
ssvMakeMembTable(my_set_list_char)
##   first second third
## a  TRUE  FALSE FALSE
## b  TRUE   TRUE FALSE
## c  TRUE   TRUE  TRUE
## d FALSE  FALSE  TRUE
## e FALSE  FALSE  TRUE
## f FALSE  FALSE  TRUE

Notice the rownames changed as the items changed.

6 Visualization of set overlaps, ssvFeature*

The ssvFeature* family of functions all accept a valid set definition and output a ggplot.

6.1 Barplot

Barplots are useful for for conveying the size of each set/group.

ssvFeatureBars(olaps)

6.2 Pie chart

Pie charts are an alternative to barplots but are generally more difficult to interpret.

ssvFeaturePie(olaps)

6.3 Venn diagram

Venn diagrams show the number of items falling into each overlap category. At release, seqsetvis is limited to displaying 3 sets via venn diagram. This will be increased in a future update.

ssvFeatureVenn(olaps)

6.4 Euler diagram

Often confused with Venn diagrams, Euler diagrams attempt to convey the size of each set and the degree to which the overlap with circles or ellipses of varying sizes. For data of any appreciable complexity, Euler diagrams are approximations by neccessity.

There is no hard limit to the number of sets seqsetvis can show in an Euler diagram but computation time increases rapidly along with the inaccuracy of approximation.

ssvFeatureEuler(olaps)

6.5 Binary Heatmap

An alternative to Venn/Euler diagrams, this “heatmap” contains one column per set and one row per item with item membership per set mapped to color. In this example black indicates TRUE for membership hence the top 2/3 is black for all three sets.

ssvFeatureBinaryHeatmap(olaps)

7 Visualization of genomic signal, ssvSignal*

The ssvSignal* family of functions all take as input continuous signal data mapped to the genome. This type of data is generally stored in bedGraph, wiggle, or bigWig format. At this time, seqsetvis directly supports the bigwig format and creating pileups from indexed .bam files.
This is accomplished via the ssvFetchBigwig() and ssvFetchBam() functions.

If you don’t have .bigWig (or .bigwig or .bw) files you should be able to easily generate some using rtracklayer export.bw() alongside import.bedGraph() or import.wig() as appropriate.

7.1 Loading bigwig data

bigwig_files = c(
    system.file("extdata", "MCF10A_CTCF_FE_random100.bw", 
                package = "seqsetvis"),
    system.file("extdata", "MCF10AT1_CTCF_FE_random100.bw", 
                package = "seqsetvis"),
    system.file("extdata", "MCF10CA1_CTCF_FE_random100.bw", 
                package = "seqsetvis")
)
names(bigwig_files) = sub("_FE_random100.bw", "", basename(bigwig_files))
# names(bigwig_files) = letters[1:3]
olap_gr = CTCF_in_10a_overlaps_gr
target_size = quantile(width(olap_gr), .75)
window_size = 50
target_size = round(target_size / window_size) * window_size
olap_gr = resize(olap_gr, target_size, fix = "center")
bw_gr = ssvFetchBigwig(bigwig_files, olap_gr, win_size = window_size)

bw_gr
olap_gr = CTCF_in_10a_overlaps_gr
bw_gr = CTCF_in_10a_profiles_gr

Overlap information must be transformed into a factor tied to id to be useful.

olap_groups = ssvFactorizeMembTable(mcols(olap_gr))

7.2 Line plots

Line plots are perhaps the most natural and common type of visualization for this type of continuous genomic data. Many genome browsers and other BioC packages support this type of visualization. These other approaches are very good at show lot’s of different data of multiple types at single regions. In contrast, seqsetvis is focused on bringing in data from multiple regions simultaneously.

7.2.1 Individual line plots

Basic line plot of a subset of regions

# facet labels will display better if split into multiple lines
bw_gr$facet_label = sub("_", "\n", bw_gr$sample)
ssvSignalLineplot(bw_data = subset(bw_gr, id %in% 1:12), facet_ = "facet_label")