\name{readAligned} \alias{readAligned} \alias{readAligned,character-method} \title{Read aligned reads and their quality scores into R representations} \description{ \code{readAligned} reads all aligned read files in a directory \code{dirPath} whose file name matches \code{pattern}, returning a compact internal representation of the alignments, sequences, and quality scores in the files. Methods read all files into a single R object; a typical use is to restrict input to a single aligned read file. } \usage{ readAligned(dirPath, pattern=character(0), ...) } \arguments{ \item{dirPath}{A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of aligned read files to be input.} \item{pattern}{The (\code{\link{grep}}-style) pattern describing file names to be read. The default (\code{character(0)}) results in (attempted) input of all files in the directory.} \item{...}{Additional arguments, used by methods. When \code{dirPath} is a character vector, the argument \code{type} must be provided. Possible values for \code{type} and their meaning are described below. Most methods implement \code{filter=srFilter()}, allowing objects of \code{\linkS4class{SRFilter}} to selectively returns aligned reads.} } \details{ There is no standard aligned read file format; methods parse particular file types. The \code{readAligned,character-method} interprets file types based on an additional \code{type} argument. Supported types are: \itemize{ \item{\code{type="SolexaExport"}}{ This type parses \code{.*_export.txt} files following the documentation in the Solexa Genome Alignment software manual, version 0.3.0. These files consist of the following columns; consult Solexa documentation for precise descriptions. If parsed, values can be retrieved from \code{\linkS4class{AlignedRead}} as follows: \describe{ \item{Machine}{Ignored} \item{Run number}{stored in \code{alignData}} \item{Lane}{stored in \code{alignData}} \item{Tile}{stored in \code{alignData}} \item{X}{stored in \code{alignData}} \item{Y}{stored in \code{alignData}} \item{Index string}{Ignored} \item{Read number}{Ignored} \item{Read}{\code{sread}} \item{Quality}{\code{quality}} \item{Match chromosome}{\code{chromosome}} \item{Match contig}{Ignored} \item{Match position}{\code{position}} \item{Match strand}{\code{strand}} \item{Match description}{Ignored} \item{Single-read alignment score}{\code{alignQuality}} \item{Paired-read alignment score}{Ignored} \item{Partner chromosome}{Ignored} \item{Partner contig}{Ignored} \item{Partner offset}{Ignored} \item{Partner strand}{Ignored} \item{Filtering}{\code{alignData}} } Paired read columns are not interpreted. The resulting \code{\linkS4class{AlignedRead}} object does \emph{not} contain a meaningful \code{id}; instead, use information from \code{alignData} to identify reads. Different interfaces to reading alignment files are described in \code{\linkS4class{SolexaPath}} and \code{\linkS4class{SolexaSet}}. } \item{\code{type="SolexaPrealign"}}{See SolexaRealign} \item{\code{type="SolexaAlign"}}{See SolexaRealign} \item{\code{type="SolexaRealign"}}{ These types parse \code{s_L_TTTT_prealign.txt}, \code{s_L_TTTT_align.txt} or \code{s_L_TTTT_realign.txt} files produced by default and eland analyses. From the Solexa documentation, \code{align} corresponds to unfiltered first-pass alignements, \code{prealign} adjusts alignments for error rates (when available), \code{realign} filters alignments to exclude clusters failing to pass quality criteria. Because base quality scores are not stored with alignments, the object returned by \code{readAligned} scores all base qualities as \code{-32}. If parsed, values can be retrieved from \code{\linkS4class{AlignedRead}} as follows: \describe{ \item{Sequence}{stored in \code{sread}} \item{Best score}{stored in \code{alignQuality}} \item{Number of hits}{stored in \code{alignData}} \item{Target position}{stored in \code{position}} \item{Strand}{stored in \code{strand}} \item{Target sequence}{Ignored; parse using \code{\link{readXStringColumns}}} \item{Next best score}{stored in \code{alignData}} } } \item{\code{type="SolexaResult"}}{ This parses \code{s_L_eland_results.txt} files, an intermediate format that does not contain read or alignment quality scores. Because base quality scores are not stored with alignments, the object returned by \code{readAligned} scores all base qualities as \code{-32}. Columns of this file type can be retrieved from \code{\linkS4class{AlignedRead}} as follows (description of columns is from Table 19, Genome Analyzer Pipeline Software User Guide, Revision A, January 2008): \describe{ \item{Id}{Not parsed} \item{Sequence}{stored in \code{sread}} \item{Type of match code}{Stored in \code{alignData} as \code{matchCode}. Codes are (from the Eland manual): NM (no match); QC (no match due to quality control failure); RM (no match due to repeat masking); U0 (best match was unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best match was unique, with 2 mismatches); R0 (multiple exact matches found); R1 (multiple 1 mismatch matches found, no exact matches); R2 (multiple 2 mismatch matches found, no exact or 1-mismatch matches).} \item{Number of exact matches}{stored in \code{alignData} as \code{nExactMatch}} \item{Number of 1-error mismatches}{stored in \code{alignData} as \code{nOneMismatch}} \item{Number of 2-error mismatches}{stored in \code{alignData} as \code{nTwoMismatch}} \item{Genome file of match}{stored in \code{chromosome}} \item{Position}{stored in \code{position}} \item{Strand}{(direction of match) stored in \code{strand}} \item{\sQuote{N} treatment}{stored in \code{alignData}, as \code{NCharacterTreatment}. \sQuote{.} indicates treatment of \sQuote{N} was not applicable; \sQuote{D} indicates treatment as deletion; \sQuote{|} indicates treatment as insertion} \item{Substitution error}{stored in \code{alignData} as \code{mismatchDetailOne} and \code{mismatchDetailTwo}. Present only for unique inexact matches at one or two positions. Position and type of first substituation error, e.g., 11A represents 11 matches with 12th base an A in reference but not read. The reference manual cited below lists only one field (\code{mismatchDetailOne}), but two are present in files seen in the wild.} } } \item{\code{type="MAQMap", records=-1L}}{Parse binary \code{map} files produced by MAQ. See details in the next section. The \code{records} option determines how many lines are read; \code{-1L} (the default) means that all records are input.} \item{\code{type="MAQMapShort", records=-1L}}{The same as \code{type="MAQMap"} but for map files made with Maq prior to version 0.7.0. (These files use a different maximum read length [64 instead of 128], and are hence incompatible with newer Maq map files.)} \item{\code{type="MAQMapview"}}{ Parse alignment files created by MAQ's \sQuote{mapiew} command. Interpretation of columns is based on the description in the MAQ manual, specifically \preformatted{ ...each line consists of read name, chromosome, position, strand, insert size from the outer coordinates of a pair, paired flag, mapping quality, single-end mapping quality, alternative mapping quality, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp on the reference, length of the read, read sequence and its quality. } The read name, read sequence, and quality are read as \code{XStringSet} objects. Chromosome and strand are read as \code{factor}s. Position is \code{numeric}, while mapping quality is \code{numeric}. These fields are mapped to their corresponding representation in \code{AlignedRead} objects. Number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp are represented in the \code{AlignedRead} object as components of \code{alignData}. Remaining fields are currently ignored. } \item{\code{type="Bowtie"}}{ Parse alignment files created with the Bowtie alignment algorithm. Parsed columns can be retrieved from \code{\linkS4class{AlignedRead}} as follows: \describe{ \item{Identifier}{\code{id}} \item{Strand}{\code{strand}} \item{Chromosome}{\code{chromosome}} \item{Position}{\code{position}; see comment below} \item{Read}{\code{sread}; see comment below} \item{Read quality}{\code{quality}; see comments below} \item{Bowtie reserved}{ignored} \item{Alignment mismatch locations}{\code{alignData}} } This method includes the argument \code{qualityType} to specify how quality scores are encoded. Bowtie quality scores are \sQuote{Solexa}-like by default, with \code{qualityType='SFastqQuality'}, but can be specified as \sQuote{Phred}-like, with \code{qualityType='FastqQuality'}. Bowtie outputs positions that are 0-offset from the left-most end of the \code{+} strand. \code{ShortRead} parses position information to be 1-offset from the left-most end of the \code{+} strand. Bowtie outputs reads aligned to the \code{-} strand as their reverse complement, and reverses the quality score string of these reads. \code{ShortRead} parses these to their original sequence and orientation. } \item{\code{type="SOAP"}}{ Parse alignment files created with the SOAP alignment algorithm. Parsed columns can be retrieved from \code{\linkS4class{AlignedRead}} as follows: \describe{ \item{id}{\code{id}} \item{seq}{\code{sread}; see comment below} \item{qual}{\code{quality}; see comment below} \item{number of hits}{\code{alignData}} \item{a/b}{\code{alignData} (\code{pairedEnd})} \item{length}{\code{alignData} (\code{alignedLength})} \item{+/-}{\code{strand}} \item{chr}{\code{chromosome}} \item{location}{\code{position}; see comment below} \item{types}{\code{alignData} (\code{typeOfHit}: integer portion; \code{hitDetail}: text portion)} } This method includes the argument \code{qualityType} to specify how quality scores are encoded. It is unclear from SOAP documentation what the quality score is; the default is \sQuote{Solexa}-like, with \code{qualityType='SFastqQuality'}, but can be specified as \sQuote{Phred}-like, with \code{qualityType='FastqQuality'}. SOAP outputs positions that are 1-offset from the left-most end of the \code{+} strand. \code{ShortRead} preserves this representation. SOAP reads aligned to the \code{-} strand are reported by SOAP as their reverse complement, with the quality string of these reads reversed. \code{ShortRead} parses these to their original sequence and orientation. } } } \value{ A single R object (e.g., \code{\linkS4class{AlignedRead}}) containing alignments, sequences and qualities of all files in \code{dirPath} matching \code{pattern}. There is no guarantee of order in which files are read. } \seealso{ A \code{\linkS4class{AlignedRead}} object. Genome Analyzer Pipeline Software User Guide, Revision A, January 2008. The MAQ reference manual, \url{http://maq.sourceforge.net/maq-manpage.shtml#5}, 3 May, 2008. The Bowtie reference manual, \url{http://bowtie-bio.sourceforge.net}, 28 October, 2008. The SOAP reference manual, \url{http://soap.genomics.org.cn/soap1}, 16 December, 2008. } \author{ Martin Morgan , Simon Anders (MAQ map)} \examples{ sp <- SolexaPath(system.file("extdata", package="ShortRead")) ap <- analysisPath(sp) ## ELAND_EXTENDED readAligned(ap, "s_2_export.txt", "SolexaExport") ## PhageAlign readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign") ## MAQ dirPath <- system.file('extdata', 'maq', package='ShortRead') list.files(dirPath) ## First line readLines(list.files(dirPath, full.names=TRUE)[[1]], 1) countLines(dirPath) ## two files collapse into one readAligned(dirPath, type="MAQMapview") ## select only chr1-5.fa, '+' strand filt <- compose(chromosomeFilter("chr[1-5].fa"), strandFilter("+")) readAligned(sp, "s_2_export.txt", filter=filt) } \keyword{manip}