%\VignetteIndexEntry{soggi} %\VignettePackage{soGGi} %\VignetteEngine{knitr::knitr} % To compile this document % library('knitr'); rm(list=ls()); knit('soggi.Rnw') \documentclass[12pt]{article} <>= library("knitr") opts_chunk$set(tidy=FALSE,dev="png",fig.show="hide", fig.width=4,fig.height=4.5, message=FALSE) @ <>= BiocStyle::latex() @ <>= library("soGGi") @ \author{Thomas Carroll$^{1*}$\\[1em] \small{$^{1}$ Bioinformatics Facility, MRC Clincal Sciences Centre;} \\ \small{\texttt{$^*$thomas.carroll (at)imperial.ac.uk}}} \title{Visualisation, transformations and arithmetic operations for grouped genomic intervals} \begin{document} <>= library(knitr) opts_chunk$set( concordance=TRUE ) @ <>= library(knitr) opts_chunk$set( concordance=TRUE ) @ \maketitle \begin{abstract} The soGGi package provides tools to summarise sequence data, genomic signal and motif occurrence over grouped genomic intervals as well to perform complex subsetting, arithmetic operations and transformations on these genomic summaries prior to visualisation. As with other Bioconductor packages such as CoverageView and seqPlots. soGGi plots average signal across groups of genomic regions. soGGI provides flexibity in both its data aquisition and visualisation. Single or paired-end BAM files, bigWigs, rleLists and PWM matrices can be provided as input alongside a GRanges object or BED file location. soGGi can plot summarises across actual size and normalised features such as genomic intervals of differing length ( as with genes). The use of normalised size plots for genes can however obsure high resolution events around the TSS. To address this, combination plots can be created within soGGI allowing for fine detail at the edges of normalised regions. Arithmetic operation and transformations can be easily performed on soGGi objects allowing for rapid operations between profiles such as the substraction of input signal or quantile normalisation of replicates within a group. soGGi integrates the ggplot2 package and add functionality to rapidly subset, facet and colour profiles by their overlaps with GenomicRanges objects or grouping by metadata column IDs. The plotting, arithmetic and tranformation functions within soGGi allow for rapid evaluation of groups of genomic intervals and provide a toolset for user-defined analysis of these summaries over groups. \vspace{1em} \end{abstract} <>= options(digits=3, width=80, prompt=" ", continue=" ") @ \newpage \tableofcontents \section{Standard workflow} \subsection{The soggi function} The \Rfunction{regionPlot()} function is used to summarise signal, reads or PWM occurrence over grouped genomic intervals. Input can be BAM, bigWig or a PWM matrix and the regions to summarise over a character string of path to BED file or a GRanges object. The full set of soggi function arguments can be found in help pages. In this example, signal coverage is summarised from a BAM file using the defaults. The default style of region plot is to produce a normalised by size region plot. <>= library(soGGi) chipExample <- regionPlot("pathToBAM/mybam.bam",myGRangesObject,format="bam") @ A pre-computed data set is included in the package containing averages profiles created with command above for DNAse, Pol2, H3k9ac and H3k3me3. The object itself cantains all counts along interval region windows in assays slots and information of the samples in exptData slot accessible by \Rfunction{assays()} and \Rfunction{exptData} respectively. <>= library(soGGi) data(chipExampleBig) chipExampleBig @ This object contains 10 sets of profiles for 200 genes. The object can be subset using [[ to select samples of interest. <>= chipExampleBig[[1]] chipExampleBig$highdnase @ Similarly profile objects can be concatenated or bound together using c and rbind. <>= c(chipExampleBig[[1]],chipExampleBig[[2]]) rbind(chipExampleBig[[1]],chipExampleBig[[2]]) @ \subsection{Plotting profiles} The \Rfunction{plotRegion()} function is used to produce profile plots. \Rfunction{plotRegion()} uses ggplot2 to generate plots and so returned object can be highly customisable using ggplot2 methods. <>= plotRegion(chipExampleBig[[3]]) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quicka-1.png} \caption{ \textbf{Example profile plot.} The plot generated by plotRegion() function on a single sample soGGi object. The x-axis shows the normalised length and the y-axis shows the coverage in windows. A Clear peak around the TSS can be observed for this Pol2 ChIP-seq profile. } \label{Example-1} \end{figure} When dealing with objects with multiple samples, the arguments groupBy and colourBy specify whether to facet or colour by Sample/Group respectively. <>= library(ggplot2) plotRegion(chipExampleBig,colourBy="Sample", groupBy="Sample", freeScale=TRUE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick2-1.png} \caption{ \textbf{Multi-Sample profile plot.} Here multiple samples are plotted simultaneously. Enrichment around TSS can be seen for all profiles with H3k9ac showing more enrichment in the gene body. } \label{multiplot} \end{figure} Here some samples can be seen to be noisy. Windsorisation can be applied when plotting using the outliers argument. <>= library(ggplot2) plotRegion(chipExampleBig,colourBy="Sample", outliers=0.01, groupBy="Sample",freeScale=TRUE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick2b-1.png} \caption{ \textbf{Multi-Sample profile plot with windsorisation.} This multi-sample plot has windsorisation applied to outliers. The resulting plot is smoother than that seen in figure 2. } \label{mutliplot_Win} \end{figure} The \Rfunction{plotRegion()} can also be used to group genomic intervals while plotting using the \Rfunction{gts} argument. The \Rfunction{gts} argument either takes a GRanges object or a list of character vectors and the \Rfunction{summariseBy} argument to specify metadata to use. <>= library(GenomicRanges) subsetsCharacter <- list(first25 = (as.vector(rowRanges(chipExampleBig[[1]])$name[1:25])), fourth25 = as.vector(rowRanges(chipExampleBig[[1]])$name[76:100])) subsetsGRanges <- GRangesList(low=(rowRanges(chipExampleBig[[1]])[1:25]), high=rowRanges(chipExampleBig[[2]])[76:100]) plotRegion(chipExampleBig[[1]],gts=subsetsCharacter,summariseBy = "name") plotRegion(chipExampleBig[[1]],gts=subsetsGRanges) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick3-1.png} \caption{ \textbf{DNAse grouped genomic intervals plot.} This plots shows the plot for DNAse signal across normalised regions with separate profiles for each group of genomic intervals defined in gts. .} \label{gtsarguments} \end{figure} \newpage \section{Transformations and arithmetic operations} \subsection{Simple arithmetic operations on grouped profiles} Common arithmetic operations and tranformations can be used with soGGi profile objects allowing for further analysis post summarisation and iteratively over visualisations. Here we summarise RNApol2 high and low and compare between replicates. <>= pol_Profiles <- c((chipExampleBig$highPol+chipExampleBig$midPol) , (chipExampleBig$highPol_Rep2+chipExampleBig$midPol_Rep2)) plotRegion(pol_Profiles,colourBy="Sample",outliers=0.01, groupBy="Sample", freeScale=TRUE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick4-1.png} \caption{ \textbf{Plotting arithmetic results.} This plot demonstrates the ability to plot the profiles generated by the results of arithmetic operations. Here data is combined within replicate number and results plotted. } \label{arithmeticplotting} \end{figure} Common normalisations, log transformations and other mathematical functions such as mean() are also implemented to allow for the comparison within and between profiles. In this example the profiles are log2 transformed with zeros being assigned the minimum value for that region. <>= log2Profiles <- log2(chipExampleBig) plotRegion(log2Profiles,colourBy="Sample",outliers=0.01, groupBy="Sample",freeScale=TRUE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick6-1.png} \caption{ \textbf{Plotting log2 transformed profiles.} This plots shows the resulting profiles after log2 transformation of ChIP-data across antibodies. } \label{log2plotting} \end{figure} From this log2 transformed data we can look at the difference between Pol2 profiles <>= log2Polhigh <- mean(log2Profiles$highPol, log2Profiles$highPol_Rep2) log2Polmid <- mean(log2Profiles$midPol, log2Profiles$midPol_Rep2) diffPol <- log2Polhigh-log2Polmid diffh3k9ac <- log2Profiles$highk9ac-log2Profiles$midk9ac plotRegion(c(diffPol,diffh3k9ac),colourBy="Sample",outliers=0.01, groupBy="Sample", freeScale=TRUE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick6b-1.png} \caption{ \textbf{Plotting differentials.} In this plot the log2 difference between high and low samples for H3k9ac and Pol2 replicates is shown. } \label{diffplot} \end{figure} Quantile normalisation of allow windows in regions between samples can allow for better better visual comparison of changes between conditions when dealing with larger numbers of replicates. Here for demonstration we apply it two samples but with real data higher sample numbers would be recommended. <>= normHighPol <- normalise(c(chipExampleBig$highPol, chipExampleBig$highPol_Rep2), method="quantile",normFactors = 1) normMidPol <- normalise(c(chipExampleBig$midPol, chipExampleBig$midPol_Rep2), method="quantile",normFactors = 1) normPol <-c(normHighPol$highPol, normHighPol$highPol_Rep2, normMidPol$midPol, normMidPol$midPol_Rep2) plotRegion(normPol,colourBy="Sample",outliers=0.01, groupBy="Sample", freeScale=TRUE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick7-1.png} \caption{ \textbf{Quantile normalisation.} In this toy example, data has been quantiled normalised within groups and the results plotted. This demonstrates the uniformity in data following quantile normalisation. } \label{quantilenormplot} \end{figure} %-------------------------------------------------- \section{Creating GRanges combinations for plotting} %-------------------------------------------------- A common operation in the analysis of summaries over genomic intervals is to compare between different sets of grouped genomic intervals. soGGi includes helper functions to deal with grouped genomic intervals. The \Rfunction{groupByOverlaps()} function creates all combinations of grouped genomic intervals from GRangesLists and so is useful to evaluate how summaries change over subclasses of genomic intervals (such as over co-occuring peak sets) The \Rfunction{findconsensusRegions()} and \Rfunction{summitPipeline()} functions identifies consensus regions between GRanges objects and re-summits consensus region sets respectively. This approach has been previously implemented to identify reproducible peak sets between biological replciates. \subsection{Grouping genomic interval sets and plotting results} In this example, two antibodies for the transcription factor Ikaros are used to plot the signal over common and unique peaks for each antibody. First the peaks sets are defined using \Rfunction{groupByOverlaps()} <>= data(ik_Example) ik_Example peakSetCombinations <- groupByOverlaps(ik_Example) peakSetCombinations @ The output from \Rfunction{groupByOverlaps()} can then be used to subset precomputed profiles of HA and Endogenous ChIP signal. Here we apply a log2 transformation to the data before plotting and cleaning up profile with windsorisation <>= data(ik_Profiles) ik_Profiles log2Ik_Profiles <- log2(ik_Profiles) plotRegion(log2Ik_Profiles,outliers=0.01,gts=peakSetCombinations, groupBy="Group",colourBy="Sample", freeScale=TRUE) plotRegion(log2Ik_Profiles[[1]] - log2Ik_Profiles[[2]] ,outliers=0.01,gts=peakSetCombinations, groupBy="Group", colourBy="Sample", freeScale=FALSE) @ \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick10-1.png} \caption{ \textbf{Signal over common and unique Ikaros peaks.} This plot shows that, as expected, common and unique peaks show different profiles for the Ikaros antibodies. } \label{IkExample} \end{figure} This confirms that common and unique peaksets have different levels of the separate antibody signals. This can be better demonstrated by subtracting to signal sets from each other and re-plotting over groups as seen in the final example. \begin{figure} \centering \includegraphics[width=1\textwidth]{figure/quick10-2.png} \caption{ \textbf{Differential Ikaros profiles over common and unique sets.} The plot of difference in log2 HA and Endogenous Ikaros signal over peaks shows the expected difference in Ikaros antibody signaland uniformity of signal of common peaks. } \label{DiffIkaros} \end{figure} \newpage <>= toLatex(sessionInfo()) @ \end{document}