%\VignetteIndexEntry{Upsize Your Clustering with Clusterize} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \setlength{\parindent}{1cm} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Upsize your clustering with Clusterize} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) set.seed(123) @ \tableofcontents %------------------------------------------------------------ \section{Introduction to supersized clustering} %------------------------------------------------------------ You may have found yourself in a familiar predicament for many bioinformaticians: you have a lot of sequences and you need to downs\emph{ize} before you can get going. You may also theor\emph{ize} that this must be an easy problem to solve \textemdash given sequences, output clusters. But what can you util\emph{ize} to solve this problem? This vignette will familiar\emph{ize} you with the \Rfunction{Clusterize} function in the \Rpackage{DECIPHER} package. Clusterize will revolution\emph{ize} all your clustering needs! \newline \smallskip \noindent Why \Rfunction{Clusterize}?: \begin{itemize} \item Scalability - \Rfunction{Clusterize} will linear\emph{ize} the search space so that many sequences can be clustered in a reasonable amount of time. \item Simplicity - Although you can individual\emph{ize} \Rfunction{Clusterize}, the defaults are straightforward and should meet most of your needs. \item Accuracy - \Rfunction{Clusterize} will maxim\emph{ize} your ability to extract biologically meaningful results from your sequences. \end{itemize} \noindent This vignette will summar\emph{ize} the use of \Rfunction{Clusterize} to cluster DNA, RNA, or protein sequences. %------------------------------------------------------------ \section{Getting started with Clusterize} %------------------------------------------------------------ \newlength\tindent \setlength{\tindent}{\parindent} \setlength{\parindent}{0pt} To get started we need to load the \Rpackage{DECIPHER} package, which automatically mobil\emph{ize} a few other required packages. <>= library(DECIPHER) @ There's no need to memor\emph{ize} the inputs to \Rfunction{Clusterize}, because its help page can be accessed through: \begin{Schunk} \begin{Sinput} > ? Clusterize \end{Sinput} \end{Schunk} Note that, while it's easy to fantas\emph{ize} about using \Rfunction{Clusterize}, if you only have a moderate number of \textbf{homologous} sequences (<< 100k) then it's more accurate to use \Rfunction{TreeLine} with a distance matrix created from a multiple sequence alignment. This function provides hierarchical clustering (i.e., single-linkage, UPGMA, or complete-linkage) that is impossible to critic\emph{ize} as inexact. %------------------------------------------------------------ \section{Optimize your inputs to Clusterize} %------------------------------------------------------------ \setlength{\parindent}{1cm} \Rfunction{Clusterize} requires that you first digit\emph{ize} your sequences by loading them into memory. For the purpose of this vignette, we will capital\emph{ize} on the fact that \Rpackage{DECIPHER} already includes some built-in sets of sequences. <>= # specify the path to your file of sequences: fas <- "<>" # OR use the example DNA sequences: fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER") # read the sequences into memory dna <- readDNAStringSet(fas) dna @ The \Rfunction{Clusterize} algorithm will general\emph{ize} to nucleotide or protein sequences, so we must choose which we are going to use. Here, we hypothes\emph{ize} that weaker similarities can be detected between proteins and, therefore, decide to use the translated coding (amino acid) sequences. If you wish to cluster at high similarity, you could also strateg\emph{ize} that nucleotide sequences would be better because there would be more nucleotide than amino acid differences. <>= aa <- translate(dna) aa seqs <- aa # could also cluster the nucleotides length(seqs) @ Now you can choose how to parameter\emph{ize} the function, with the main arguments being \Rfunarg{myXStringSet} and \Rfunarg{cutoff}. In this case, we will initial\emph{ize} \Rfunarg{cutoff} at \code{seq(0.5, 0, -0.1)} to cluster sequences from 50\% to 100\% similarity by 10\%'s. It is important to recogn\emph{ize} that \Rfunarg{cutoff}s can be provided in \emph{ascending} or \emph{descending} order and, when \emph{descending}, groups at each \Rfunarg{cutoff} will be nested within the previous \Rfunarg{cutoff}'s groups. We must also choose whether to custom\emph{ize} the calculation of distance. The defaults will penal\emph{ize} gaps as single events, such that each consecutive set of gaps (i.e., insertion or deletion) is considered equivalent to one mismatch. If you want to standard\emph{ize} the definition of distance to be the same as most other clustering programs then set: \Rfunarg{penalizeGapLetterMatches} to \code{TRUE} (i.e., every gap position is a mismatch), \Rfunarg{method} to \code{"shortest"}, \Rfunarg{minCoverage} to \code{0}, and \Rfunarg{includeTerminalGaps} to \code{TRUE}. It is possible to rational\emph{ize} many different measures of distance -- see the \Rfunction{DistanceMatrix} function for more information about alternative distance parameterizations. \begin{figure} \begin{center} \includegraphics[height=0.6\textheight,keepaspectratio]{ClusteringParameters} \end{center} \caption{\label{f0} The most important parameters (in \textbf{bold}) to custom\emph{ize} your use of \Rfunction{Clusterize}.} \end{figure} \clearpage We can further personal\emph{ize} the inputs as desired. The main function argument to emphas\emph{ize} is \Rfunarg{processors}, which controls whether the function is parallelized on multiple computer threads (if \Rpackage{DECIPHER}) was built with OpenMP enabled). Setting \Rfunarg{processors} to a value greater than \code{1} will speed up clustering considerably, especially for large s\emph{ize} clustering problems. Once we are ready, it's time to run \Rfunction{Clusterize} and wait for the output to material\emph{ize}! <>= clusters <- Clusterize(seqs, cutoff=seq(0.5, 0, -0.1), processors=1) class(clusters) colnames(clusters) str(clusters) apply(clusters, 2, max) # number of clusters per cutoff apply(clusters, 2, function(x) which.max(table(x))) # max sizes @ Notice that \Rfunction{Clusterize} will character\emph{ize} the clustering based on how many clustered pairs came from relatedness sorting versus rare k-mers, and \Rfunction{Clusterize} will predict the effectiveness of clustering. Depending on the input sequences, the percentage of clusters originating from relatedness sorting will equal\emph{ize} with the number originating from rare k-mers, but more commonly clusters will originate from one source or the other. The clustering effectiveness formal\emph{ize}s the concept of ``inexact'' clustering by approximating the fraction of possible sequence pairs that were correctly clustered together. You can incentiv\emph{ize} a higher clustering effectiveness by increasing \Rfunarg{maxPhase3} at the expense of (proportionally) longer run times. We can now real\emph{ize} our objective of decreasing the number of sequences. Here, we will priorit\emph{ize} keeping only the longest diverse sequences. <>= o <- order(clusters[[2]], width(seqs), decreasing=TRUE) # 40% cutoff o <- o[!duplicated(clusters[[2]])] aa[o] dna[o] @ %------------------------------------------------------------ \section{Visualize the output of Clusterize} %------------------------------------------------------------ We can scrutin\emph{ize} the clusters by selecting them and looking at their multiple sequence alignment: <>= t <- table(clusters[[1]]) # select the clusters at a cutoff t <- sort(t, decreasing=TRUE) head(t) w <- which(clusters[[1]] == names(t[1])) AlignSeqs(seqs[w], verbose=FALSE) @ It's possible to util\emph{ize} the \Rfunction{heatmap} function to view the clustering results. \begin{centerfig} <>= aligned_seqs <- AlignSeqs(seqs, verbose=FALSE) d <- DistanceMatrix(aligned_seqs, verbose=FALSE) tree <- TreeLine(myDistMatrix=d, method="UPGMA", verbose=FALSE) heatmap(as.matrix(clusters), scale="column", Colv=NA, Rowv=tree) @ \caption{\label{f1} Visualization of the clustering.} \end{centerfig} As can be seen in Figure \ref{f1}, \Rfunction{Clusterize} will organ\emph{ize} its clusters such that each new cluster is within the previous cluster when \Rfunarg{cutoff} is provided in descending order. We can also see that sequences from the same species tend to cluster together, which is an alternative way to systemat\emph{ize} sequences without clustering. \clearpage %------------------------------------------------------------ \section{Specialize clustering for your goals} %------------------------------------------------------------ The most common use of clustering is to categor\emph{ize} sequences into groups sharing similarity above a threshold and pick one representative sequence per group. These settings empitom\emph{ize} this typical user scenario: <>= c1 <- Clusterize(dna, cutoff=0.2, invertCenters=TRUE, processors=1) w <- which(c1 < 0 & !duplicated(c1)) dna[w] # select cluster representatives (negative cluster numbers) @ By default, \Rfunction{Clusterize} will cluster sequences with linkage to the representative sequence in each group, but it is also possible to tell \Rfunction{Clusterize} to minim\emph{ize} the number of clusters by establishing linkage to any sequence in the cluster (i.e., single-linkage). This is often how we conceptual\emph{ize} natural groupings and, therefore, may better match alternative classification systems such as taxonomy: <>= c2 <- Clusterize(dna, cutoff=0.2, singleLinkage=TRUE, processors=1) max(abs(c1)) # center-linkage max(c2) # single-linkage (fewer clusters, but broader clusters) @ It is possible to synthes\emph{ize} a plot showing a cross tabulation of taxonomy and cluster number. We may ideal\emph{ize} the clustering as matching taxonomic labels (\ref{f2}), but this is not exactly the case. \begin{centerfig} <>= genus <- sapply(strsplit(names(dna), " "), `[`, 1) t <- table(genus, c2[[1]]) heatmap(sqrt(t), scale="none", Rowv=NA, col=hcl.colors(100)) @ \caption{\label{f2} Another visualization of the clustering.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Resize to fit within less memory} %------------------------------------------------------------ What should you do if you have more sequences than you can cluster on your mids\emph{ize} computer? If there are far fewer clusters than sequences (e.g., \Rfunarg{cutoff} is high) then it is likely possible to res\emph{ize} the clustering problem. This is accomplished by processing the sequences in batches that miniatur\emph{ize} the memory footprint and are at least as large as the final number of clusters. The number of sequences processed per batch is critical to atom\emph{ize} the problem appropriately while limiting redundant computations. Although not ideal from a speed perspective, the results will not jeopard\emph{ize} accuracy relative to as if there was sufficient memory available to process all sequences in one batch. <>= batchSize <- 2e2 # normally a large number (e.g., 1e6 or 1e7) o <- order(width(seqs), decreasing=TRUE) # process largest to smallest c3 <- integer(length(seqs)) # cluster numbers repeat { m <- which(c3 < 0) # existing cluster representatives m <- m[!duplicated(c3[m])] # remove redundant sequences if (length(m) >= batchSize) stop("batchSize is too small") w <- head(c(m, o[c3[o] == 0L]), batchSize) if (!any(c3[w] == 0L)) { if (any(c3[-w] == 0L)) stop("batchSize is too small") break # done } m <- m[match(abs(c3[-w]), abs(c3[m]))] c3[w] <- Clusterize(seqs[w], cutoff=0.05, invertCenters=TRUE)[[1]] c3[-w] <- ifelse(is.na(c3[m]), 0L, abs(c3[m])) } table(abs(c3)) # cluster sizes @ %------------------------------------------------------------ \section{Clustering both nucleotide strands} %------------------------------------------------------------ Sometimes the input sequences are present in different orientations and it is necessary to harmon\emph{ize} the clusterings from both strands. Without trying to hyperbol\emph{ize} how easy this is to do, here's an example of clustering both strands: <>= # simulate half of strands having opposite orientation s <- sample(c(TRUE, FALSE), length(dna), replace=TRUE) dna[s] <- reverseComplement(dna[s]) # cluster both strands at the same time clus <- Clusterize(c(dna, reverseComplement(dna)), cutoff=0.2, processors=1) clus <- match(clus[[1]], clus[[1]]) # renumber clusters ascending # if needed, reorient all clustered sequences to have the same orientation strand <- clus[seq_len(length(clus)/2)] >= clus[-seq_len(length(clus)/2)] dna[strand] <- reverseComplement(dna[strand]) # renumber clusters across both strands and compare to original clustering clus <- pmin(clus[seq_len(length(clus)/2)], clus[-seq_len(length(clus)/2)]) org <- match(abs(c1[[1]]), abs(c1[[1]])) # renumber original clustering mean(clus == org) # some differences expected due to algorithm stochasticity # verify the largest cluster is now back in the same orientation dna[clus == which.max(tabulate(clus))] @ %------------------------------------------------------------ \section{Finalize your use of Clusterize} %------------------------------------------------------------ Notably, \Rfunction{Clusterize} is a stochastic algorithm, meaning it will random\emph{ize} which sequences are selected during pre-sorting. Even though the clusters will typically stabil\emph{ize} with enough iterations, you can set the random number seed (before every run) to guarantee reproducibility of the clusters: <>= set.seed(123) # initialize the random number generator clusters <- Clusterize(seqs, cutoff=0.1, processors=1) set.seed(NULL) # reset the seed @ Now you know how to util\emph{ize} \Rfunction{Clusterize} to cluster sequences. To public\emph{ize} your results for others to reproduce, make sure to provide your random number seed and version number: <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}