%% LyX 2.0.1 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[a4paper,english]{scrartcl} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{babel} \usepackage{url} \usepackage[unicode=true,pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=true,pdfborder={0 0 0},backref=false,colorlinks=false] {hyperref} \usepackage{breakurl} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands. \special{papersize=\the\paperwidth,\the\paperheight} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. <>= if(exists(".orig.enc")) options(encoding = .orig.enc) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Introduction to the TSSi package: Identification of Transcription Start Sites} %\VignettePackage{TSSi} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rvar}[1]{{\textit{\textsf{#1}}}} %% avoid single lines \widowpenalty=10000 \clubpenalty=10000 %% format captions \usepackage[small,bf,margin=.5cm]{caption} \makeatother \begin{document} \title{Introduction to the \Rpackage{TSSi} package:\\ Identification of Transcription Start Sites} \author{Julian Gehring, Clemens Kreutz} \maketitle <>= set.seed(1) options(width=65, SweaveHooks=list(fig=function() par(mar=c(5.1, 5.1, 4.1, 1.5)))) @ \begin{abstract} Along with the advances in high-throughput sequencing, the detection of transcription start sites \emph{(TSS)} using CAP-capture techniques has evolved recently. While many experimental applications exist, the analysis of such data is still non-trivial. Approaching this, the \Rpackage{TSSi} package offers a flexible statistical preprocessing for CAP-capture data and an automated identification of start sites. \end{abstract} \section{Introduction} High throughput sequencing has become an essential experimental approach to investigate genomes and transcriptional processes. While cDNA sequencing (\emph{RNA-seq}) using random priming and/or fragmentation of cDNA will result in a shallow distribution of reads typically biased towards the 3' end, approaches like CAP-capture enrich 5' ends and yield more clearly distinguishable peaks around the transcription start sites. Predicting the location of transcription start sites \emph{(TSS)} is hampered by the existence of alternative ones since their number within regions of transcription is unknown. In addition, measurements contain false positive counts. Therefore, only the counts which are significantly larger than an expected number of background reads are intended to be predicted as TSS. The number of false positive reads increases in regions of transcriptional activity and such reads obviously do not map to random positions. On the one hand, these reads seem to occur sequence dependently and therefore cluster to certain genomic positions, on the other hand, they are detected more frequently than being originated from real TSS. Because currently, there is no error model available describing such noise, the \Rpackage{TSSi} package implements an heuristic approach for an automated and flexible prediction of TSS. \section{Data set} To demonstrate the functionality and usage of this package, experimental CAP-capture data obtained with Solexa sequencing is used. The reads were mapped to the genome, such that the number of sequenced 5' ends and their positions in the genome are available% \footnote{The \emph{sequencing workflow} at the bioconductor website describes basic steps and tools for the import and processing of sequcencing data (\url{http://bioconductor.org/help/workflows/high-throughput-sequencing/}).% }. The data frame \Rvar{physcoCounts} contains information about the chromosome, the strand, the 5' position, and the total number of mapped reads. Additionally, regions based on existing annotation are provided which are used here to divide the data into independent subsets for analysis. <>= library(TSSi) @ <>= data(physcoCounts) head(physcoCounts) table(physcoCounts$chromosome, physcoCounts$strand) @ \section{Segment read data} As a first step in the analysis, the reads are passed to the \Rmethod{segmentizeCounts} method. Here, the data is divided into \emph{segments}, for which the following analysis is performed independently. This is performed based on the information about the chromosomes, the strands, and the regions of the reads. The segmented data is returned as an object of the class \Rclass{TssData}. <>= attach(physcoCounts) x <- segmentizeCounts(counts=counts, start=start, chr=chromosome, region=region, strand=strand) detach(physcoCounts) x @ The segments and the associated read data are accessible through several \Rmethod{get} methods. Data from individual segments can be referred to by either its name or an index. <>= segments(x) names(x) @ <>= head(reads(x, 3)) head(start(x, 3)) head(start(x, names(x)[3])) @ \section{Normalization} The normalization reduces the noise by shrinking the counts towards zero. This step is intended to eliminate false positive counts as well as making further analyzes more robust by reducing the impact of large counts. Such a shrinkage or regularization procedure constitutes a well-established strategy in statistics to make predictions conservative, that means to reduce the number of false positive predictions. To enhance the shrinkage of isolated counts in comparison to counts in regions of strong transcriptional activity, the information of consecutive genomic positions in the measurements is regarded by evaluating differences between adjacent count estimates. The computation can be performed with a approximation based on the frequency of all reads or fitted explicitly for each segment. On platforms supporting the \Rpackage{multicore} package, the fitting can be spread over multiple processor cores in order to decrease computation time. <>= yRatio <- normalizeCounts(x) @ <>= yFit <- normalizeCounts(x, fit=TRUE) yFit head(reads(yFit, 3)) @ <>= plot(yFit, 3) @ \section{Identifying transcription start sites} After normalization of the count data, an iterative algorithm is applied for each segment to identify the TSS. The expected number of false positive counts is initialized with a default value given by the read frequency in the whole data set. The position with the largest counts above is identified as a TSS, if the expected transcription level is at least one read above the expected number of false positive reads. The transcription levels for all TSS are calculated by adding all counts to their nearest neighbor TSS. Then, the expected number of false positive reads is updated by convolution with exponential kernels. The decay rates \Rfunarg{tau} in 3' direction and towards the 5'-end can be chosen differently to account for the fact that false positive counts are preferably found in 5' direction of a TSS. This procedure is iterated as long as the set of TSS increases. <>= z <- identifyStartSites(yFit) z head(segments(z)) head(tss(z, 3)) head(reads(z, 3)) @ <>= plot(z, 3) @ \section{Visualizing and customizing figures} The \Rmethod{plot} method allows for a simple, but powerful visualization and customization of the produced figures. For each element of the figure, all graphical parameters can be set, supplying them in the form of named lists. In the following, plotting of the threshold and the ratio estimates are omitted, as well as the representation of some components is adapted. For a detailed description on the individual settings, please refer to the \Rmethod{plot} documentation of this package. <>= plot(z, 4, ratio=FALSE, threshold=FALSE, baseline=FALSE, expect=TRUE, expectArgs=list(type="l"), extend=TRUE, countsArgs=list(type="h", col="darkgray", pch=NA), plotArgs=list(xlab="Genomic position", main="TSS for segment 's1_-_155'")) @ \section{Converting and exporting results} While the get methods \Rmethod{reads}, \Rmethod{segments}, and \Rmethod{tss} provide a simple access to relevant results, such data can also be represented with the framework provided by the \Rpackage{IRanges} package. Converting the data to an object of class \Rclass{RangedData} allows for a standard representation and interface to other formats, for example using the \Rpackage{rtracklayer} package. <>= readsRd <- readsAsRangedData(z) segmentsRd <- segmentsAsRangedData(z) tssRd <- tssAsRangedData(z) tssRd @ <>= #library(rtracklayer) #tmpFile <- tempfile() #export.gff3(readsRd, paste(tmpFile, "gff", sep=".")) #export.bed(segmentsRd, paste(tmpFile, "bed", sep=".")) #export.bed(tssRd, paste(tmpFile, "bed", sep=".")) @ \newpage{} \section*{Session info} <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}