%\VignetteIndexEntry{An introduction to ShortRead} %\VignetteDepends{} %\VignetteKeywords{Short read, I/0, quality assessment} %\VignettePackage{ShortRead} \documentclass[]{article} \usepackage{times} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\ShortRead}{\Rpackage{ShortRead}} \title{An Introduction to \Rpackage{ShortRead}} \author{Martin Morgan} \date{25 July, 2008} \begin{document} \maketitle <>= options(width=60) @ <>= library(ShortRead) @ The \Rpackage{ShortRead} package aims to provide key functionality for input, quality assurance, and basic manipulation of `short read' DNA sequences such as those produced by Solexa, 454, Helicos, SOLiD, and related technologies. This vignette introduces key functionality. The package is still very much in development. Support is most fully developed for Solexa; contributions from the community are welcome. \section{A first workflow} This section walks through a simple work flow. It outlines the hierarchy of files produced by Solexa. It then illustrates a common way for reading short read data into \R{}. \subsection{\Rclass{SolexaPath}: navigating Solexa output} \Rclass{SolexaPath} provides functionality to navigate files produced by Solexa Genome Analyzer pipeline software. A typical way to start a \ShortRead{} session is to point to the root of the output file hierarchy. The \ShortRead{} package includes a very small subset of files emulating this hierarchy. The root is found at <>= exptPath <- system.file("extdata", package="ShortRead") @ %% Usually \Rcode{exptPath} would be a location on the users' file system. Key components of the hierarchy are parsed into \R{} with <>= sp <- SolexaPath(exptPath) sp @ %% \Rfunction{SolexaPath} scans the directory hierarchy to identifying useful directories. For instance, image intensity files are in the `Firecrest' directory, while summary and alignment files are in the analysis directory <>= imageAnalysisPath(sp) analysisPath(sp) @ %% Most functionality in \ShortRead{} uses \Rcode{baseCallPath} or \Rcode{analysisPath}. Solexa documentation provides details of file content. \Rfunction{SolexaPath} accepts additional arguments that allow individual file paths to be specified. Many functions for Solexa data input `know' where appropriate files are located,. Specifying \Rcode{sp} is often sufficient for identifying the desired directory path. Examples of this are illustrated below, with for instance \Rfunction{readAligned} and \Rfunction{readFastq}. Displaying an object, e.g., \Robject{sp}, provides hints at how to access information in the object, e.g., \Rfunction{analysisPath}. This is a convention in \ShortRead{}. \subsection{\Rfunction{readAligned}: reading aligned data into \R{}} Solexa \texttt{s\_N\_export.txt} files (\texttt{\_N\_} is a placeholder for the lane identifier) represent one place to start working the short read data in \R{}. These files result from running ANALYSIS eland\_extended in the Solexa Genome Analyzer. The files contain information on all reads, including alignment information for those reads successfully aligned to the genome. \ShortRead{} parses additional alignment files, including MAQ binary and text (\texttt{mapview}) files. \ShortRead{} flexibly parses many other Solexa files; aligned reads represent just one entry point. To read a single \texttt{s\_N\_export.txt} file into \R{}, for instance from lane 2, use the command <>= aln <- readAligned(sp, "s_2_export.txt") aln @ \Rfunction{readAligned} illustrates the convention used for identifying files for input into \R{} and used by \ShortRead{}. The function takes a directory path and a pattern (as a regular expression, similar to the \R{} function \Rfunction{list.files}) of file names to match in the directory. Usually, all files matching the pattern are read into a \emph{single} \R{} object; this behavior is desirable for several of the input functions in \ShortRead{}. In the present case the usual expectation is that a single \texttt{s\_N\_export.txt} file will be read into a single \R{} object, so the \Rfunarg{pattern} argument will identify a single file. Currently supported alignment files include: \begin{description} \item[SolexaExport] the Solexa `export' file format described in the Solexa pipeline version 0.3 documentation. \item[MAQMapview] the MAQ `mapview' format described at \url{http://maq.sourceforge.net/maq-manpage.shtml#5}, as viewed 3 May, 2008. \item[MAQMap] the MAQ binary `map' format described at \url{http://maq.sourceforge.net/maq-manpage.shtml#5}, as viewed 3 May, 2008. \end{description} Other parser contributions are welcome. Paired end read support is not yet available. \subsection{Exploring \ShortRead{} objects} \Robject{aln} is an object of \Sexpr{class(aln)} class. It contains short reads and their (calibrated) qualities: <>= sread(aln) quality(aln) @ The short reads are stored as a \Rclass{DNAStringSet} class. This class is defined in \Rpackage{Biostrings}. It represents DNA sequence data relatively efficiently. There are a a number of very useful methods defined for \Rclass{DNAStringSet}. Some of these methods are illustrated in this vignette. Other methods are described in the help pages and vignettes of the \Rpackage{Biostrings} package. Qualities are represented as \Sexpr{class(quality(aln))}-class objects. The qualities in the \Robject{aln} object returned by \Rfunction{readAligned} are of class \Rclass{BStringSet}. The \Rclass{BStringSet} class is also defined in \Rpackage{Biostrings}, and shares many methods with those of \Rclass{DNAStringSet}. The \Robject{aln} object contains additional information about alignments. Some of this additional information is expected from any alignment, whether generated by Solexa or other software. For example, \Robject{aln} contains the particular sequence within a target (e.g., chromosomes in a genome assembly), the position (e.g., base pair coordinate), and strand to which the alignment was made, and the quality of the alignment. The display of \Robject{aln} suggests how to access this information. For instance, the strand to which alignments are made can be extracted (as a factor with three levels; the level \Rcode{""} corresponds to unaligned reads) and tabulated using familiar \R{} functions. <>= whichStrand <- strand(aln) class(whichStrand) levels(whichStrand) table(whichStrand) @ %% This shows that about \Sexpr{format(100*sum(whichStrand=="")/ length(whichStrand))} percent of reads were not aligned (level \Rcode{""}). The \Robject{aln} object contains information in addition to that expected of all alignments. This information is accessible using \Rfunction{alignData}: <>= alignData(aln) @ %% Users familiar with the \Rclass{ExpressionSet} class in \Rpackage{Biobase} will recognize this as an \Rclass{AnnotatedDataFrame}-like object, containing a data frame with rows for each short read. The \Rclass{AlignedDataFrame} contains additional meta data about the meaning of each column. For instance, data extracted from the Solexa export file includes: <>= varMetadata(alignData(aln)) @ %% Guides to the precise meaning of this data are on the help page for the \Rclass{AlignedRead} class, and in the manufacturer manuals. Simple information about the alignments can be found by querying this object. For instance, unaligned reads have \Rcode{NA} as their position, and reads passing Solexa `filtering' (their base purity and chastity criteria) have a component of their auxiliary \Robject{alignData} set to \Rcode{"Y"}. Thus the fraction of unaligned reads and reads passing filtering are <>= mapped <- !is.na(position(aln)) filtered <- alignData(aln)[["filtering"]] =="Y" sum(!mapped) / length(aln) sum(filtered) / length(aln) @ Extracting the reads that passed filtering but were unmapped is accomplished with <>= failedAlign <- aln[filtered & !mapped] failedAlign @ %% Alternatively, we can extract just the short reads, and select the subset of those that failed filtering. <>= failedReads <- sread(aln)[filtered & !mapped] @ \subsection{Quality assessment} The \Rfunction{qa} function provides a convenient way to summarize read and alignment quality. One way of obtaining quality assessment results is <>= qaSummary <- qa(sp) @ %% The \Robject{qa} object is a list-like structure. As invoked above and currently implemented, \Rfunction{qa} visits all \texttt{s\_N\_export.txt} files in the appropriate directory. It extracts useful information from the files, and summarizes the results into a nested list-like structure. Evaluating \Rfunction{qa} for a single lane can take several minutes. For this reason a common use case is to evaluate \Rfunction{qa} and save the result to disk for later use, e.g., <>= save(qaSummary, file="/path/to/file.rda") @ %% A feature of \ShortRead{} is the use of \Rpackage{Rmpi} and coarse-grained parallel processing when available. Thus commands such as <>= library(Rmpi) mpi.spawn.Rslaves(nsl=8) qaSummary <- qa(sp) mpi.close.Rslaves() @ %% will distribute the task of processing each lane to each of the \Rpackage{Rmpi} workers. With \Rpackage{Rmpi}, all 8 lanes of a Solexa experiment are processed in the time take to process a single lane. The elements of the quality assessment list are suggested by the output: <>= qaSummary @ %% For instance, the count of reads in each lane is summarized in the \Robject{readCounts} element, and can be displayed as <>= qaSummary[["readCounts"]] qaSummary[["baseCalls"]] @ %% The \Robject{readCounts} element contains a data frame with 1 row and 3 columns (these dimensions are indicated in the parenthetical annotation of \Robject{readCounts} in the output of \Rcode{qaSummary}). The rows represent different lanes. The columns indicated the number of reads, the number of reads surviving the Solexa filtering criteria, and the number of reads aligned to the reference genome for the lane. The \Robject{baseCalls} element summarizes base calls in the unfiltered reads. Other elements of \Robject{qaSummary} are more complicated, and their interpretation is not directly obvious. Instead, a common use is to forward the results of \Rfunction{qa} to a report generator. <>= report(qaSummary, dest="/path/to/qa_report.pdf") @ %% The report includes \R{} code that can be used to understand how \Sexpr{class(qaSummary)}-class objects can be processed. \section{Using \Rpackage{ShortRead} for data exploration} \subsection{Data I/O} \ShortRead{} provides a variety of methods to read data into \R{}, in addition to \Rfunction{readAligned}. \subsubsection{\Rfunction{readXStringColumns}} \Rfunction{readXStringColumns} reads a column of DNA or other sequence-like data. For instance, the Solexa files \texttt{s\_N\_export.txt} contain lines with the following information: <>= pattern <- "s_2_export.txt" fl <- file.path(analysisPath(sp), pattern) strsplit(readLines(fl, n=1), "\t") length(readLines(fl)) @ % Column 9 is the read, and column 10 the ASCII-encoded Solexa Fastq quality score; there are 1000 lines (i.e., 1000 reads) in this sample file. Suppose the task is to read column 9 as a \Rclass{DNAStringSet} and column 10 as a \Rclass{BStringSet}. \Rclass{DNAStringSet} is a class that contains IUPAC-encoded DNA strings (IUPAC code allows for nucleotide ambiguity); \Rclass{BStringSet} is a class that contains any character with ASCII code 0 through 255. Both of these classes are defined in the \Rpackage{Biostrings} package. \Rfunction{readXStringColumns} allows us to read in columns of text as these classes. Important arguments for \Rfunction{readXStringColumns} are the \Rfunarg{dirPath} in which to look for files, the \Rfunarg{pattern} of files to parse, and the \Rfunarg{colClasses} of the columns to be parsed. The \Rfunarg{dirPath} and \Rfunarg{pattern} arguments are like \Rfunarg{list.files}. \Rfunarg{colClasses} is like the corresponding argument to \Rfunction{read.table}: it is a \Rclass{list} specifying the class of each column to be read, or \Robject{NULL} if the column is to be ignored. In our case there are 21 columns, and we would like to read in columns 9 and 10. Hence <>= colClasses <- rep(list(NULL), 21) colClasses[9:10] <- c("DNAString", "BString") names(colClasses)[9:10] <- c("read", "quality") @ % We use the class of the underlying object, \Rclass{DNAString} or \Rclass{BString}, rather than the class of the object we will create, e.g., \Rclass{DNAStringSet}. Applying names to \Robject{colClasses} is not required but makes subsequent manipulation easy. We are now ready to read our file <>= cols <- readXStringColumns(analysisPath(sp), pattern, colClasses) cols @ % The file is parsed and appropriate data objects created. A feature of \Rfunction{readXStringColumns}, and other input functions in the \Rpackage{ShortRead} package is that all files matching \Rfunarg{pattern} in the specified \Rfunarg{dirPath} will be read into a single object. This provides a convenient way to, for instance, parse all tiles in a Solexa lane into a single \Rclass{DNAStringSet} object. There are several advantages to reading columns as \Rclass{XStringSet} objects. These are more compact than the corresponding character representation: <>= object.size(cols$read) object.size(as.character(cols$read)) @ % They are read in much more quickly. And the \Rclass{DNAStringSet} and related classes are used extensively in \Rpackage{ShortRead}, \Rpackage{Biostrings}, \Rpackage{BSgenome} and other packages relevant to short read technology. \subsubsection{\Rfunction{readFastq}} \Rfunction{readXStringColumns} should be considered a `low-level' function providing easy access to columns of data. Another flexible input function is \Rfunction{readFastq}. Fastq files combine reads and their base qualities in four-line records such as the following: <>= fqpattern <- "s_1_sequence.txt" fl <- file.path(analysisPath(sp), fqpattern) readLines(fl, 4) @ % The first and third lines are an identifier (encoding the machine, run, lane, tile, x and y coordinates of the cluster that gave rise to the read, in this case). The second line is the read, and the fourth line the per-base quality. Files of this sort can be read in as <>= fq <- readFastq(sp, fqpattern) fq @ % This resulting object (of class \Sexpr{class(fq)}) contains the short reads, their qualities, and the identifiers: <>= reads <- sread(fq) qualities <- quality(fq) class(qualities) id(fq) @ % Notice that the class of the qualities is \Rclass{SFastqQuality}, to indicate that these are quality scores derived using the Solexa convention, rather than ordinary \Rclass{BStringSet} objects. The object has essential operations for convenient manipulation, for instance simultaneously forming the subset of all three components: <>= fq[1:5] @ \subsubsection{Additional input functions} \ShortRead{} includes additional functions to facilitate input. For instance, \Rfunction{readPrb} reads Solexa \texttt{\_prb.txt} files. These files contain base-specific quality information, and \Rfunction{readPrb} returns an \Rclass{SFastqQuality}-class object representing the fastq-encoded base-specific quality scores of all reads. Additional files can be parsed using standard \R{} input methods. For instance, the \texttt{s\_N\_LLLL\_int.txt} files in the \Rfunction{imageAnalysisPath} directory contain lines, one line per read, of nucleotide intensities. Each line contain lane, tile, X and Y coordinate information, followed by quadruplets of intensity values. There are as many quadruplets as there are cycles. Each quadruplet represents the intensity of the \texttt{A}, \texttt{C}, \texttt{G}, and \texttt{T} nucleotide at the corresponding cycle. Thus <>= intFile <- list.files(imageAnalysisPath(sp), "s_1_0001_int.txt", full=TRUE) strsplit(readLines(intFile, 1), "\t")[[1]][1:6] intDf <- read.table(intFile) dim(intDf) @ %% An interesting exercise is to display the intensities at cycle 2 (below) and to compare these to cycle, e.g., 30. <>= c2 <- intDf[,2*4 + 1:4] colnames(c2) <- c("A", "C", "G", "T") print(splom(c2, pch=".", cex=3)) @ \subsection{Sorting} Short reads can be sorted, or the permutation required to bring the short read into lexicographic order, using \Rfunction{srsort} and \Rfunction{srorder}. These functions are different from \Rfunction{sort} and \Rfunction{order} because the result is independent of the locale, and operate quickly on \Rclass{DNAStringSet} and \Rclass{BStringSet} objects. The function \Rfunction{srduplicated} identifies duplicate reads. This function returns a logical vector much like \Rfunction{duplicated}. The negation of the result from \Rfunction{srduplicated} is useful to create a collection of unique reads. An experimental scenario where This might be useful is when sample preparation involves PCR. In this case, replicate reads may be due to artifacts of sample preparation, rather than differential representation of sequence in the sample prior to PCR. \subsection{Summarizing read occurrence} The \Rfunction{tables} function summarizes read occurrences, for instance, <>= tbls <- tables(aln) names(tbls) tbls$top[1:5] head(tbls$distribution) @ %% The \Robject{top} component returned by \Robject{tables} is a list tallying the most commonly occurring sequences in the short reads. Knowledgeable readers will recognize the top-occurring read as a close match to one of the manufacturer adapters. The \Robject{distribution} component returned by \Robject{tables} is a data frame that summarizes how many reads (e.g., \Sexpr{tbls[["distribution"]][1,"nReads"]}) are represented exactly \Sexpr{tbls[["distribution"]][1,"nOccurrences"]} times. \subsection{Finding near matches to short sequences} Facilities exist for finding reads that are near matches to specific sequences, e.g., manufacturer adapter or primer sequences. \Rfunction{srdistance} reports the edit distance between each read and a (short!) reference sequence. To find reads close to the most common read in the example above, one might <>= dist <- srdistance(sread(aln), names(tbls$top)[1])[[1]] table(dist)[1:10] @ %% `Near' matches can be filtered from the alignment, e.g., <>= alnSubset <- aln[dist>4] @ A different strategy can be used to tally or eliminate reads that are predominantly of a single nucleotide. \Rfunction{alphabetFrequency} calculates the frequency of each nucleotide (in DNA strings) or letter (for other string sets) in each read. Thus one could identify and eliminate reads with more than 30 A nucleotides with <>= countA <- alphabetFrequency(sread(aln))[,"A"] alnNoPolyA <- aln[countA < 30] @ %% \Rfunction{alphabetFrequency}, which simply counts nucleotides, is much faster than \Rfunction{srdistance}, which performs full pairwise alignemnt of each read to the subject. Users wanting to use \R{} for whole-genome alignments or more flexible pairwise aligment are encouraged to investigate the \Rpackage{Biostrings} package, especially the \Rclass{PDict} class and \Rfunction{matchPDict} and \Rfunction{pairwiseAlignment} functions. \subsection{\Rfunction{pileup}} \Rfunction{pileup} provides a convenient way to summarize where reads align on reference sequences. \Rfunction{pileup} uses alignments obtained from other sources, e.g., \Rfunction{readAligned} or the \Robject{PDict} family of functions in \Rpackage{Biostrings}. \section{Advance features} \subsection{The \Rfunarg{pattern} argument to input functions} Most \ShortRead{} input functions are designed to accept a directory path argument, and a \Rfunarg{pattern} argument. The latter is a grep-like pattern (as used by, e.g., \Rfunction{list.files}). Many input functions are implemented so that all files matching the pattern are read into a single large input object. Thus the \texttt{s\_N\_LLLL\_seq.txt} files consist of four numeric columns and a fifth column corresponding to the short read. The following code illustrates the file structure and inputs the final column as a \Rclass{DNAStringSet}: <>= seqFls <- list.files(baseCallPath(sp), "_seq.txt", full=TRUE) strsplit(readLines(seqFls[[1]], 1), "\t") colClasses <- c(rep(list(NULL), 4), "DNAString") reads <- readXStringColumns(baseCallPath(sp), "s_1_0001_seq.txt", colClasses=colClasses) @ %% The more general pattern <>= reads <- readXStringColumns(baseCallPath(sp), "s_1_.*_seq.txt", colClasses=colClasses) @ %% inputs all lane 1 tile files into a single \Rclass{DNAStringSet} object. \subsection{\Rfunction{srapply}} Solexa and other short read technologies often include many files, e.g., 1 \texttt{s\_L\_NNNN\_int.txt} file per tile, 300 tiles per lane, 8 lanes per flow cell for 2400 \texttt{s\_L\_NNNN\_int.txt} files per flow cell. A natural way to extract information from these is to write short functions, e.g., to find the average intensity per base at cycle 12. <>= calcInt <- function(filename, cycle, verbose=FALSE) { if (verbose) cat("calcInt", filename, cycle, "\n") c12 <- read.table(filename)[,cycle*4 + 1:4] colnames(c12) <- c("A", "C", "G", "T") colMeans(c12) } @ %% One way to apply this function to all intensity files in a Solexa run is <>= intFls <- list.files(imageAnalysisPath(sp), ".*_int.txt", full=TRUE) lres <- lapply(intFls, calcInt, cycle=12) @ %% The files are generally large and numerous, so even simple calculations consume significant computational resources. The \Rfunction{srapply} function is meant to provide a transparent way to perform calculations like this distributed over multiple nodes of an MPI cluster. Thus <>= srres <- srapply(intFls, calcInt, cycle=12) identical(lres, srres) @ %% evaluates the function as \Rfunction{lapply}, whereas <>= library(Rmpi) mpi.spawn.Rslaves(nsl=16) srres <- srapply(intFls, calcInt, cycle=12) mpi.close.Rslaves() @ %% distributes the calculation over available workers, resulting in a speedup approximately inversely proportional to the number of available compute nodes. \section{Conclusions and directions for development} \ShortRead{} provides tools for reading, manipulation, and quality assessment of short read data. Current facilities in \ShortRead{} emphasize processing of single-end Solexa data. Development priorities in the near term include expanded facilities for importing key file types from additional manufactures, more extensive quality assessment methodologies, and development of infrastructure for paired-end reads. \end{document}