%\VignetteIndexEntry{Searching biological sequences} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \setlength{\parindent}{1cm} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Searching Biological Sequences for Research} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) @ \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ Sequence searching is an essential part of biology research. The word \textit{research} even originates from a word in Old French meaning `to search'. Yet, the sheer amount of biological sequences to comb through can make (re)search feel like finding a needle in a haystack. To avoid heading out on a wild goose chase, it's important to master the ins and outs of searching. The goal of this vignette is to help you leave no stone unturned as you scout out homologous sequences. Mixed metaphors aside, understanding how to properly use \Rpackage{DECIPHER}'s search functions is critical for optimizing performance. The search functions' versatility makes them highly customizable, but this also places the onus of understanding on the user. This vignette is intended to bring you up to speed on how to best apply the search functions in your research. %------------------------------------------------------------ \section{Getting Started} %------------------------------------------------------------ Fast sequence searching is performed by breaking the query and target sequences into k-mers. Behind the scenes, the search function has to rapidly find significant paths among many k-mer matches shared by both sequences. This process is especially difficult in the presence of many spurious matches and repeats (Fig. \ref{f1}). The \Rfunction{IndexSeqs} and \Rfunction{SearchIndex} functions introduced in this vignette employ many techniques to increase the accuracy and speed of search, such as masking problematic regions. The examples below illustrate how to best apply these techniques to solve different search problems, including searching translated sequences, extending k-mer matches, aligning search hits, and mapping reads. \begin{wrapfigure}{r}{0.5\textwidth} \includegraphics[width=0.5\textwidth]{SearchMatches} \caption{\label{f1} K-mer matches between a query and target sequence colored by their length. The presence of repeats can be seen in rectangular blocks with many k-mer matches.} \end{wrapfigure} \newlength\tindent \setlength{\tindent}{\parindent} \setlength{\parindent}{0pt} \subsection{Startup} To get started we need to load the \Rpackage{DECIPHER} package, which automatically loads a few other required packages. <>= library(DECIPHER) @ Help for a function can be accessed through: \begin{Schunk} \begin{Sinput} > ? SearchIndex \end{Sinput} \end{Schunk} Once \Rpackage{DECIPHER} is installed, the code in this tutorial can be obtained via: \begin{Schunk} \begin{Sinput} > browseVignettes("DECIPHER") \end{Sinput} \end{Schunk} \setlength{\parindent}{1cm} \subsection{Gathering the evidence} There are umpteen reasons to search through biological sequences. For the purposes of this vignette, we are going to focus on finding homologous proteins in a genome. In this case, our pattern (query) is the protein sequence and the subject (target) is a 6-frame translation of the genome. Feel free to follow along with your own (nucleotide or protein) sequences or use those in the vignette: <>= # specify the path to your file of pattern (query) sequences: fas1 <- "<>" # OR use the example protein sequences: fas1 <- system.file("extdata", "PlanctobacteriaNamedGenes.fas.gz", package="DECIPHER") # read the sequences into memory pattern <- readAAStringSet(fas1) pattern @ Protein search is more accurate than nucleotide search, so we are going to import a genome and perform 6-frame translation to get the subject sequences. Feel free to carry on without translating the sequences if you are searching nucleotides or otherwise would prefer to skip translation. Note that \Rfunction{SearchIndex} only searches the nucleotides in the direction they are provided, so if you desire to search both strands then you will need to combine with the \Rfunction{reverseComplement} as shown below. <>= # specify the path to your file of subject (target) sequences: fas2 <- "<>" # OR use the example subject genome: fas2 <- system.file("extdata", "Chlamydia_trachomatis_NC_000117.fas.gz", package="DECIPHER") # read the sequences into memory genome <- readDNAStringSet(fas2) genome genome <- c(genome, reverseComplement(genome)) # two strands subject <- subseq(rep(genome, each=3), rep(1:3, 2)) # six frames subject <- suppressWarnings(translate(subject)) # 6-frame translation subject @ %------------------------------------------------------------ \section{Searching for hits between pattern and subject sequences} %------------------------------------------------------------ Once the sequences are imported, we need to build an \Rclass{InvertedIndex} object from the \code{subject} sequences. We can accomplish this with \Rfunction{IndexSeqs} by specifying the k-mer length (\Rfunarg{K}). If you don't know what value to use for \Rfunarg{K}, then you can specify \Rfunarg{sensitivity}, \Rfunarg{percentIdentity}, and \Rfunarg{patternLength} in lieu of \Rfunarg{K}. Here, we want to ensure we find 99\% (0.99) of sequences with at least 70\% identity to a pattern with 300 or more residues. Note that \Rfunarg{sensitivity} is defined as a fraction, whereas \Rfunarg{percentIdentity} is defined as a percentage. <>= index <- IndexSeqs(subject, sensitivity=0.99, percentIdentity=70, patternLength=300, processors=1) index @ Printing the \code{index} shows that we created an \Rclass{InvertedIndex} object containing over 1 million 5-mers in a reduced amino acid alphabet with 12 symbols. Before we can find homologous hits to our protein sequences with \Rfunction{SearchIndex}, we must decide how many hits we desire. The default is to return the best hit (above \Rfunarg{minScore}) per subject sequence, but this may return too many hits. Instead we could limit the number of hits by specifying a positive value for \code{perPatternLimit} or \code{perSubjectLimit} (unlimited is \code{0}). Note that we do not need to specify a value for \Rfunarg{minScore}, because it is automatically set based on the size of the \Rclass{InvertedIndex}. <>= hits <- SearchIndex(pattern, index, perPatternLimit=100, processors=1) dim(hits) head(hits) @ The result of our search is a \Rclass{data.frame} with four columns: Pattern (index in \code{pattern}), Subject (index in \code{subject}), Score, and Position (of k-mer matches). The Position column is optional and can be disabled by setting \Rfunarg{scoreOnly} to \code{TRUE}. We can take a closer look at the number of hits per protein, their scores, and locations (Fig. \ref{f2}): \begin{centerfig} <>= layout(matrix(1:4, nrow=2)) hist(hits$Score, breaks=100, xlab="Score", main="Distribution of scores") plot(NA, xlim=c(0, max(width(subject))), ylim=c(1, 6), xlab="Genome position", ylab="Genome frame", main="Location of k-mer matches") segments(sapply(hits$Position, `[`, i=3), # third row hits$Subject, sapply(hits$Position, `[`, i=4), # fourth row hits$Subject) plot(hits$Score, sapply(hits$Position, function(x) sum(x[2,] - x[1,] + 1)), xlab="Score", ylab="Sum of k-mer matches", main="Matches versus score", log="xy") plot(table(tabulate(hits$Pattern, nbins=length(pattern))), xlab="Hits per pattern sequence", ylab="Frequency", main="Number of hits per query") @ \caption{\label{f2} Summary of hits found between a set of proteins and the genome's 6-frame translation.} \end{centerfig} \clearpage The calculated \textit{Score} for each search hit is defined by the negative log-odds of observing the hit by chance. We see that most scores were near zero, but there were many high scoring hits. Hits tended to be clustered along specific frames of the genome, with some genome regions devoid of hits. Also, most pattern (protein) sequences were found at zero or one location in the genome. As expected, a hit's score is correlated with the length of k-mer matches, although greater separation between matches lowers the score. One protein was found many times more than all the others. We can easily figure out which protein was found the most times: <>= w <- which.max(tabulate(hits$Pattern)) hits[hits$Pattern == w,] names(pattern)[w] @ Likely these hits are to multiple paralogous genes on the genome, as can be seen by the wide distribution of scores. %------------------------------------------------------------ \section{Aligning the search hits between pattern and subject} %------------------------------------------------------------ So far, we've identified the location and score of search hits without alignment. Aligning the hits would provide us with their local start and stop boundaries, percent identity, and the locations of any insertions or deletions. Thankfully, alignment is elementary once we've completed our search. <>= aligned <- AlignPairs(pattern=pattern, subject=subject, pairs=hits, processors=1) head(aligned) @ The \Rfunction{AlignPairs} function returns a \Rclass{data.frame} containing the Pattern (i.e., \code{pattern} index), Subject (i.e., \code{subject} index), their start and end positions, the number of matched and mismatched positions, the alignment length and its score, as well as the location of any gaps in the \Rfunarg{pattern} or \Rfunarg{subject}. We can use this information to calculate a percent identity, which can be defined a couple of different ways (Fig. \ref{f3}). \begin{centerfig} <>= PID1 <- aligned$Matches/(aligned$Matches + aligned$Mismatches) PID2 <- aligned$Matches/aligned$AlignmentLength layout(matrix(1:4, ncol=2)) plot(hits$Score, PID2, xlab="Hit score", ylab="Matches / (Aligned length)") plot(hits$Score, aligned$Score, xlab="Hit score", ylab="Aligned score") plot(aligned$Score, PID1, xlab="Aligned score", ylab="Matches / (Matches + Mismatches)") plot(PID1, PID2, xlab="Matches / (Matches + Mismatches)", ylab="Matches / (Aligned length)") @ \caption{\label{f3} Scatterplots of different scores and methods of formulating percent identity.} \end{centerfig} \clearpage By default, \Rfunction{AlignPairs} gives us everything we need to align the sequences except the alignments themselves. If needed, we can easily request the alignments be included in the output. <>= alignments <- AlignPairs(pattern=pattern, subject=subject, pairs=hits, type="both", processors=1) patterns <- alignments[[2]] subjects <- alignments[[3]] c(patterns[1], subjects[1]) # view the first pairwise alignment @ %------------------------------------------------------------ \section{Calibrating an expect value (E-value) from hit scores} %------------------------------------------------------------ \Rfunction{SearchIndex} returns hits with scores above \Rfunarg{minScore}. However, it is often useful to compute an expect value (E-value) representing the number of times we expect to see a hit at least as high scoring in a database of the same size. A lesser known fact is that E-values are a function of the substitution matrix, \Rfunarg{gapOpening} penalty, \Rfunarg{gapExtension} penalty, and other search parameters, so E-values must be empirically determined. There are two straightforward ways to calibrate E-values: (1) create an equivalent database of random sequences with matched composition to the input, or (2) search for the reverse of the sequences under the assumption that reverse hits are unexpected (i.e., false positives). The second approach is more conservative, because we will find more hits than expected if its underlying assumption is not true. Here, we will try the second approach: <>= revhits <- SearchIndex(reverse(pattern), # reverse the query index, # keep the same target database minScore=10, # set low to get many hits processors=1) dim(revhits) @ Next, our goal is to fit the distribution of background scores (i.e., reverse hits), which is reasonably well-modeled by an exponential distribution. We will bin the reverse hits' scores into intervals of one score unit between \code{10} and \code{100}. Then we will use the fact that the integral of $e^{-rate*x}$ (with respect to $x$) is $e^{-rate*x}$ to estimate the number of background hits at each score cutoff. Note how there is an outlying point that violated the assumption reversed sequences should not have strong hits (Fig. \ref{f4}). We can use the \textit{sum of absolute error} (L1 norm) rather than the \textit{sum of squared error} (L2 norm) to make the fit more robust to outliers. We will perform the fit in log-space to emphasize points across many orders of magnitude. \begin{centerfig} <>= X <- 10:100 # score bins Y <- tabulate(.bincode(revhits$Score, X), length(X) - 1) Y <- Y/length(pattern) # average per query w <- which(Y > 0) # needed to fit in log-space plot(X[w], Y[w], log="y", xlab="Score", ylab="Average false positives per query") fit <- function(rate) # integrate from bin start to end sum(abs((log((exp(-X[w]*rate) - exp(-X[w + 1]*rate))*length(subject)) - log(Y[w])))) o <- optimize(fit, c(0.01, 2)) # optimize rate lines(X[-length(X)], (exp(-X[-length(X)]*o$minimum) - exp(-(X[-1])*o$minimum))*length(subject)) rate <- o$minimum print(rate) @ \caption{\label{f4} Fitting an exponential distribution to the score background.} \end{centerfig} \clearpage Now that we've optimized the \textit{rate} parameter, it is feasible to convert our original scores into E-values. We are interested in the number of false positive hits expected across all queries at every value of \code{Score}. This differs from the standard definition of E-value, which is defined on a per query basis. However, since we are performing multiple queries it is preferable to apply a multiple testing correction for the number of searches. We can convert our original scores to E-values, as well as define score thresholds for a given number of acceptable false positives across all \code{pattern} sequences: <>= # convert each Score to an E-value Evalue <- exp(-rate*hits$Score)*length(subject)*length(pattern) # determine minimum Score for up to 1 false positive hit expected log(1/length(subject)/length(pattern))/-rate @ As can be seen, for this particular combination of dataset and parameters, a score threshold of \code{22} is sufficient to only permit one combined false positive across all queries. Since E-value is a function of the dataset's size and the specific search parameters, you should calibrate the E-value for each set of searches performed. Once you have calibrated the \textit{rate}, it is straightforward to find only those hits that are statistically significant: <>= # determine minimum Score for 0.05 (total) false positive hits expected threshold <- log(0.05/length(subject)/length(pattern))/-rate hits <- hits[hits$Score > threshold,] dim(hits) @ %------------------------------------------------------------ \section{Maximizing search sensitivity to find distant hits} %------------------------------------------------------------ We've already employed some strategies to improve search sensitivity: choosing a small value for k-mer length and step size, searching amino acids rather than nucleotides, and masking low complexity regions and repeats. Although k-mer search is very fast, sometimes k-mers alone are insufficient to find distant homologs. In these cases, search sensitivity can be improved by providing the \code{subject} (target) sequences, which causes \Rfunction{SearchIndex} to extend k-mer matches to their left and right. This is as simple as adding a single argument: <>= # include the target sequences to increase search sensitivity hits <- SearchIndex(pattern, index, subject, # optional parameter perPatternLimit=100, processors=1) dim(hits) head(hits) @ We can see that search took longer when providing \code{subject} sequences, but the number of hits also increased. The \Rfunarg{dropScore} parameter, which controls the degree of extension, can be adjusted to balance sensitivity and speed. In this manner, high sensitivity can be achieved by providing \code{subject} sequences in conjunction with a low value of k-mer length and \Rfunarg{dropScore}. %------------------------------------------------------------ \section{Mapping long reads to a genome} %------------------------------------------------------------ Read alignment is one of the most common bioinformatics tasks. The higher error rates of long read sequences inspired new mapping approaches with greater error tolerance than traditional short read mapping algorithms. \Rpackage{DECIPHER}'s search functions are well suited to locating the position of significant matches in a genome. As an example, we will try mapping error-prone long reads to a genome. There are two ways to search both strands with nucleotide sequences: (1) search the forward and reverse complement of the query (pattern) sequences with \Rfunarg{SearchIndex}, or (2) build an inverted index from the forward and reverse complement of the target (subject) sequences with \Rfunarg{IndexSeqs}. The first approach will usually have a lower memory footprint, but the second approach only requires a single call to \Rfunction{SearchIndex}. Long reads could overlap with repeat-rich or low-complexity regions of the genome, so we will disable masking of those regions. <>= index <- IndexSeqs(genome, K=11, maskRepeats=FALSE, maskLCRs=FALSE, processors=1) index @ Here, we will use a set of simulated long reads with known mapping positions on the genome. We can see that the header for each sequence contains the strand and starting location of the read on the genome. <>= fas3 <- system.file("extdata", "Simulated_ONT_Long_Reads.fas.gz", package="DECIPHER") reads <- readDNAStringSet(fas3) head(names(reads)) @ In this case, we are only interested in unambiguous mappings, so we will compare the top two hits for each sequence. We can subtract the score for the second best hit from the score for the top hit to approximate how well a read mapped to a single location on the genome. Also, we will disable masking of the reads in the same manner as we did for the genome above. <>= maps <- SearchIndex(reads, index, perPatternLimit=2, # two hits per read perSubjectLimit=0, # unlimited maskRepeats=FALSE, maskLCRs=FALSE, processors=1) dim(maps) head(maps) # when > 1 hit, subtract 2nd highest score from top hit o <- order(maps$Pattern, maps$Score, decreasing=TRUE) w <- which(duplicated(maps$Pattern[o])) maps$Score[o[w - 1]] <- maps$Score[o[w - 1]] - maps$Score[o[w]] maps <- maps[-o[w],] # remove 2nd best hits nrow(maps) == length(reads) @ It is possible to see that all the simulated reads mapped to the first subject sequence, which was expected because only reads on the forward strand were simulated. Now, it is possible to compare the predicted mapping locations with the expected mapping locations. The ``Position'' column contains matrices with four rows giving the start and end positions of anchors in the pattern and subject, respectively. The start of the predicted mapping location can be calculated from the first column of the matrix and compared to the actual mapping location in the sequences header. Since the genome is over a million nucleotides long, any read that maps near the expected position is a positive result. <>= pos <- sapply(maps$Position, function(x) x[3] - x[1] + 1) offset <- abs(pos - as.integer(names(reads))) table(offset) maps$Score[which.max(offset)] @ All reads mapped to locations nearby the correct location except one, which was far from the actual mapping location. This read had a mapping score of zero, because this read mapped equally well to multiple locations on the genome. Our strategy of subtracting second best scores worked as intended to eliminate this ambiguously mapped read. We could impose a score cutoff after subtraction to eliminate ambiguously mapped reads. In practice, we would see reads mapped to both strands of the genome and we could use \Rfunction{AlignPairs} to obtain more information if desired. %------------------------------------------------------------ \section{Recommended search settings} %------------------------------------------------------------ It is best to tune function parameters for your particular search task. Most default parameters can be left alone. Shown below are recommended alternative settings for common use cases to serve as a starting point: \begin{center} \begin{tabular}{||c|c|c c c c||} \hline Function(s) & Parameter & Long read mapping & Short read mapping & Nucleotide search & Protein search \\ [0.5ex] \hline\hline \Rfunction{IndexSeqs} & \Rfunarg{K} & 11 (10 to 12) & 12 (11 to 13) & 9 (8 to 10) & 5 (4 to 6) \\ \hline \Rfunction{IndexSeqs} & \Rfunarg{maskRepeats} & \code{FALSE} & \code{FALSE} & & \\ \hline \Rfunction{IndexSeqs} & \Rfunarg{maskLCRs} & \code{FALSE} & \code{FALSE} & & \\ \hline \Rfunction{SearchIndex} & \Rfunarg{subject} & & & supply & supply \\ \hline \Rfunction{SearchIndex} & \Rfunarg{perSubjectLimit} & \code{2} & \code{2} & \code{1} or more & \code{1} or more \\ \hline \Rfunction{SearchIndex} & \Rfunarg{perPatternLimit} & \code{2} & \code{2} & typically >> \code{1} & typically >> \code{1} \\ \hline All & \Rfunarg{processors} & \code{NULL} & \code{NULL} & \code{NULL} & \code{NULL} \\ \hline \end{tabular} \end{center} Note that none of these functions removes duplicate sequences. If your \Rfunarg{pattern} or \Rfunarg{subject} contain many similar or identical sequences, reducing redundancy with \Rfunction{unique} or \Rfunction{Clusterize} may improve speed. %------------------------------------------------------------ \section{Session Information} %------------------------------------------------------------ All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}