--- title: "segmentSeq: methods for identifying small RNA loci from high-throughput sequencing data" author: Thomas J. Hardcastle date: "`r format(Sys.time(), '%d %B, %Y')`" package: segmentSeq output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{segmentSeq: small RNA locus detection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction High-throughput sequencing technologies allow the production of large volumes of short sequences, which can be aligned to the genome to create a set of *matches* to the genome. By looking for regions of the genome which to which there are high densities of matches, we can infer a segmentation of the genome into regions of biological significance. The methods we propose allows the simultaneous segmentation of data from multiple samples, taking into account replicate data, in order to create a consensus segmentation. This has obvious applications in a number of classes of sequencing experiments, particularly in the discovery of small RNA loci and novel mRNA transcriptome discovery. We approach the problem by considering a large set of potential *segments* upon the genome and counting the number of tags that match to that segment in multiple sequencing experiments (that may or may not contain replication). We then adapt the empirical Bayesian methods implemented in the ``baySeq`` package [Hardcastle2010](#Hardcastle2010) to establish, for a given segment, the likelihood that the count data in that segment is similar to background levels, or that it is similar to the regions to the left or right of that segment. We then rank all the potential segments in order of increasing likelihood of similarity and reject those segments for which there is a high likelihood of similarity with the background or the regions to the left or right of the segment. This gives us a large list of overlapping segments. We reduce this list to identify non-overlapping loci by choosing, for a set of overlapping segments, the segment which has the lowest likelihood of similarity with either background or the regions to the left or right of that segment and rejecting all other segments that overlap with this segment. For fuller details of the method, see Hardcastle *et al.* [Hardcastle2011](#Hardcastle2011). # Preparation We begin by loading the ``segmentSeq`` package. ```{r, message=FALSE} library(segmentSeq) ``` Note that because the experiments that ``segmentSeq`` is designed to analyse are usually massive, we should use (if possible) parallel processing as implemented by the ``parallel`` package. If using this approach, we need to begin by define a *cluster*. The following command will use eight processors on a single machine; see the help page for ``makeCluster`` for more information. If we don't want to parallelise, we can proceed anyway with a ``NULL`` cluster. ```{r results='hide', eval = TRUE} if(FALSE) { # set to FALSE if you don't want parallelisation numCores <- min(8, detectCores()) cl <- makeCluster(numCores) } else { cl <- NULL } ``` ```{r echo = FALSE, results='hide'} set.seed(1) ``` The ``readGeneric`` function is able to read in tab-delimited files which have appropriate column names, and create an ``alignmentData`` object. Alternatively, if the appropriate column names are not present, we can specify which columns to use for the data. In either case, to use this function we pass a character vector of files, together with information on which data are to be treated as replicates to the function. We also need to define the lengths of the chromosome and specifiy the chromosome names as a character. The data here, drawn from text files in the 'data' directory of the ``segmentSeq`` package are taken from the first million bases of an alignment to chromosome 1 and the first five hundred thousand bases of an alignment to chromosome 2 of *Arabidopsis thaliana* in a sequencing experiment where libraries 'SL9' and 'SL10' are replicates, as are 'SL26' and 'SL32'. Libraries 'SL9' and 'SL10' are sequenced from an Argonaute 6 IP, while 'SL26' and 'SL32' are an Argonaute 4 IP. A similar function, ``readBAM`` performs the same operation on files in the BAM format. Please consult the help page for further details. ```{r } datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c("AGO6", "AGO6", "AGO4", "AGO4") aD <- readGeneric(files = libfiles, dir = datadir, replicates = replicates, libnames = libnames, polyLength = 10, header = TRUE, gap = 200) aD ``` Next, we process this ``alignmentData`` object to produce a ``segData`` object. This ``segData`` object contains a set of potential segments on the genome defined by the start and end points of regions of overlapping alignments in the ``alignmentData`` object. It then evaluates the number of tags that hit in each of these segments. ```{r } sD <- processAD(aD, cl = cl) sD ``` ```{r echo = FALSE, results = 'hide'} if(nrow(sD) != 1452) stop("sD object is the wrong size (should have 1452 rows). Failure.") ``` We can now construct a segment map from these potential segments. # Segmentation by heuristic methods A fast method of segmentation can be achieved by exploiting the bimodality of the densities of small RNAs in the potential segments. In this approach, we assign each potential segment to one of two clusters for each replicate group, either as a segment or a null based on the density of sequence tags within that segment. We then combine these clusterings for each replicate group to gain a consensus segmentation map. ```{r } hS <- heuristicSeg(sD = sD, aD = aD, RKPM = 1000, largeness = 1e8, getLikes = TRUE, cl = cl) ``` ```{r echo = FALSE, results = 'hide'} if(nrow(hS) != 507) stop("hS object is the wrong size (should have 507 rows). Failure.") if(any(abs(colSums(exp(hS@locLikelihoods)) - c(88, 208)) > 2)) stop("hS object contains wrong number of loci. Likely failure.") ``` # Segmentation by empirical Bayesian methods A more refined approach to the problem uses an existing segment map (or, if not provided, a segment map defined by the ``hS`` function) to acquire empirical distributions on the density of sequence tags within a segment. We can then estimate posterior likelihoods for each potential segment as being either a true segment or a null. We then identify all potential segments in the with a posterior likelihood of being a segment greater than some value 'lociCutoff' and containing no subregion with a posterior likelihood of being a null greater than 'nullCutoff'. We then greedily select the longest segments satisfying these criteria that do not overlap with any other such segments in defining our segmentation map. ```{r } cS <- classifySeg(sD = sD, aD = aD, cD = hS, cl = cl) cS ``` ```{r echo = FALSE, results = 'hide'} if(abs(nrow(cS) - 64) > 2) stop("cS object is the wrong size (should have ~142 rows). Likely failure.") if(any(abs(colSums(exp(cS@locLikelihoods)) - c(29,36)) > 2)) stop("cS object contains wrong number of loci. Likely failure.") ``` By one of these methods, we finally acquire an annotated ``lociData`` object, with the annotations describing the co-ordinates of each segment. We can use this ``lociData`` object, in combination with the ``alignmentData`` object, to plot the segmented genome. ```{r fig = FALSE, height = 10, width = 12, fig.cap="The segmented genome (first $10^5$ bases of chromosome 1)."} par(mfrow = c(2,1), mar = c(2,6,2,2)) plotGenome(aD, hS, chr = ">Chr1", limits = c(1, 1e5), showNumber = FALSE, cap = 50) plotGenome(aD, cS, chr = ">Chr1", limits = c(1, 1e5), showNumber = FALSE, cap = 50) ``` Given the calculated likelihoods, we can filter the segmented genome by controlling on likelihood, false discovery rate, or familywise error rate ```{r } loci <- selectLoci(cS, FDR = 0.05) loci ``` The ``lociData`` objects can now be examined for differential expression with the ``baySeq`` package. First we define the possible models of differential expression on the data. In this case, the models are of non-differential expression and pairwise differential expression. ```{r } groups(cS) <- list(NDE = c(1,1,1,1), DE = c("AGO6", "AGO6", "AGO4", "AGO4")) ``` Then we get empirical distributions on the parameter space of the data. ```{r } cS <- getPriors(cS, cl = cl) ``` Then we get the posterior likelihoods of the data conforming to each model. Since the ``cS`` object contains null regions as well as true loci, we will use the ``nullData = TRUE`` option to distinguish between non-differentially expressed loci and non-expressed regions. By default, the loci likelihoods calculated earlier will be used to weight the initial parameter fit in order to detect null data. ```{r } cS <- getLikelihoods(cS, nullData = TRUE, cl = cl) ``` We can examine the highest likelihood non-expressed ('null') regions ```{r } topCounts(cS, NULL, number = 3) ``` The highest likelihood expressed but non-differentially expressed regions ```{r } topCounts(cS, "NDE", number = 3) ``` And the highest likelihood differentially expressed regions ```{r } topCounts(cS, "DE", number = 3) ``` Finally, to be a good citizen, we stop the cluster we started earlier: ```{r stopCluster} if(!is.null(cl)) stopCluster(cl) ``` # Bibliography Thomas J. Hardcastle and Krystyna A. Kelly. *baySeq: Empirical Bayesian Methods For Identifying Differential Expression In Sequence Count Data.* BMC Bioinformatics (2010). Thomas J. Hardcastle and Krystyna A. Kelly and David C. Baulcombe. *Identifying small RNA loci from high-throughput sequencing data.* Bioinformatics (2012). # Session Info ```{r } sessionInfo() ```