%\VignetteIndexEntry{Upsize Your Clustering with Clusterize} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \setlength{\parindent}{1cm} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Upsize your clustering with Clusterize} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, 0.3, 0.1)))) set.seed(123) @ \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ You may have found yourself in a familiar predicament for many bioinformaticians: you have a lot of sequences and you need to downs\emph{ize} before you can get going. You may also theor\emph{ize} that this must be an easy problem to solve \textemdash given sequences, output clusters. But what can you util\emph{ize} to solve this problem? This vignette will familiar\emph{ize} you with the \Rfunction{Clusterize} function in the \Rpackage{DECIPHER} package. Clusterize will revolution\emph{ize} all your clustering needs! \\ Why \Rfunction{Clusterize}? \begin{itemize} \item Scalability - \Rfunction{Clusterize} will linear\emph{ize} the search space so that many sequences can be clustered in a reasonable amount of time. \item Simplicity - Although you can individual\emph{ize} \Rfunction{Clusterize}, the defaults are straightforward and should meet most of your needs. \item Accuracy - \Rfunction{Clusterize} will maxim\emph{ize} your ability to extract biologically meaningful results from your sequences. \end{itemize} This vignette will summar\emph{ize} the use of \Rfunction{Clusterize} to cluster DNA, RNA, or protein sequences. %------------------------------------------------------------ \section{Getting Started} %------------------------------------------------------------ \newlength\tindent \setlength{\tindent}{\parindent} \setlength{\parindent}{0pt} To get started we need to load the \Rpackage{DECIPHER} package, which automatically mobil\emph{ize} a few other required packages. <>= library(DECIPHER) @ There's no need to memor\emph{ize} the inputs to \Rfunction{Clusterize}, because its help page can be accessed through: \begin{Schunk} \begin{Sinput} > ? Clusterize \end{Sinput} \end{Schunk} %------------------------------------------------------------ \section{Optimize your inputs to Clusterize} %------------------------------------------------------------ \setlength{\parindent}{1cm} \Rfunction{Clusterize} requires that you first digit\emph{ize} your sequences by loading them into memory. For the purpose of this vignette, we will capital\emph{ize} on the fact that \Rpackage{DECIPHER} already includes some built-in sets of sequences. <>= # specify the path to your file of sequences: fas <- "<>" # OR use the example DNA sequences: fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER") # read the sequences into memory dna <- readDNAStringSet(fas) dna @ The \Rfunction{Clusterize} algorithm will general\emph{ize} to nucleotide or protein sequences, so we must choose which we are going to use. Here, we hypothes\emph{ize} that weaker similarities can be detected between proteins and, therefore, decide to use the translated coding (amino acid) sequences. If you wish to cluster at high similarity, you could also strateg\emph{ize} that nucleotide sequences would be better because there would be more nucleotide than amino acid differences. <>= aa <- translate(dna) aa seqs <- aa # could also cluster the nucleotides @ Now you can choose how to parameter\emph{ize} the function, with the main arguments being \Rfunarg{myXStringSet} and \Rfunarg{cutoff}. In this case, we will initial\emph{ize} \Rfunarg{cutoff} at \code{seq(0.5, 0, -0.1)} to cluster sequences from 50\% to 100\% similarity by 10\%'s. It is important to recogn\emph{ize} that \Rfunarg{cutoff}s can be provided in \emph{ascending} or \emph{descending} order and, when \emph{descending}, groups at each \Rfunarg{cutoff} will be nested within the previous \Rfunarg{cutoff}'s groups. We must also choose whether to custom\emph{ize} the calculation of distance. The defaults will penal\emph{ize} gaps as single events, such that each consecutive set of gaps (i.e., insertion or deletion) is considered equivalent to one mismatch. If you want to standard\emph{ize} the definition of distance to be the same as most other clustering programs then set: \Rfunarg{penalizeGapLetterMatches} to \code{TRUE} (i.e., every gap position is a mismatch), \Rfunarg{method} to \code{"shortest"}, \Rfunarg{minCoverage} to \code{0}, and \Rfunarg{includeTerminalGaps} to \code{TRUE}. It is possible to rational\emph{ize} many different measures of distance -- see the \Rfunction{DistanceMatrix} function for more information about alternative parameterizations. We can further personal\emph{ize} the inputs as desired. The main function argument to emphas\emph{ize} is \Rfunarg{processors}, which controls whether the function is parallelized on multiple computer threads (if \Rpackage{DECIPHER}) was built with OpenMP enabled). Setting \Rfunarg{processors} to a value greater than \code{1} will speed up clustering considerably, especially for large s\emph{ize} clustering problems. Once we are ready, it's time to run \Rfunction{Clusterize} and wait for the output to material\emph{ize}! <>= clusters <- Clusterize(seqs, cutoff=seq(0.5, 0, -0.1), processors=1) class(clusters) colnames(clusters) str(clusters) apply(clusters, 2, max) # number of clusters per cutoff @ Notice that \Rfunction{Clusterize} will character\emph{ize} the clustering based on how many clustered pairs came from relatedness sorting versus rare k-mers, and \Rfunction{Clusterize} will predict the effectiveness of clustering. Depending on the input sequences, the percentage of clusters originating from relatedness sorting will equal\emph{ize} with the number originating from rare k-mers, but more commonly clusters will originate from one source or the other. The clustering effectiveness formal\emph{ize}s the concept of ``inexact'' clustering by approximating the fraction of possible sequence pairs that were correctly clustered together. You can incentiv\emph{ize} a higher clustering effectiveness by increasing \Rfunarg{maxPhase3} at the expense of (proportionally) longer run times. We can now real\emph{ize} our objective of decreasing the number of sequences. Here, we will priorit\emph{ize} keeping only the longest diverse sequences. <>= o <- order(clusters[[2]], width(seqs), decreasing=TRUE) # 40% cutoff o <- o[!duplicated(clusters[[2]])] aa[o] dna[o] @ %------------------------------------------------------------ \section{Visualize the output of Clusterize} %------------------------------------------------------------ We can scrutin\emph{ize} the clusters by selecting them and looking at their multiple sequence alignment: <>= t <- table(clusters[[1]]) # select the clusters at a cutoff t <- sort(t, decreasing=TRUE) head(t) w <- which(clusters[[1]] == names(t[1])) AlignSeqs(seqs[w], verbose=FALSE) @ It's possible to util\emph{ize} the \Rfunction{heatmap} function to view the clustering results. \begin{centerfig} <>= aligned_seqs <- AlignSeqs(seqs, verbose=FALSE) d <- DistanceMatrix(aligned_seqs, verbose=FALSE) tree <- TreeLine(myDistMatrix=d, method="UPGMA", verbose=FALSE) heatmap(as.matrix(clusters), scale="column", Colv=NA, Rowv=tree) @ \caption{\label{f1} Visualization of the clustering.} \end{centerfig} As can be seen in Figure \ref{f1}, \Rfunction{Clusterize} will organ\emph{ize} its clusters such that each new cluster is within the previous cluster when \Rfunarg{cutoff} is provided in descending order. We can also see that sequences from the same species tend to cluster together, which is an alternative way to systemat\emph{ize} sequences without clustering. \clearpage %------------------------------------------------------------ \section{Specialize clustering for your goals} %------------------------------------------------------------ The most common use of clustering is to categor\emph{ize} sequences into groups sharing similarity above a threshold and pick one representative sequence per group. These settings empitom\emph{ize} this typical user scenario: <>= c1 <- Clusterize(dna, cutoff=0.2, invertCenters=TRUE, processors=1) w <- which(c1 < 0 & !duplicated(c1)) dna[w] # select cluster representatives (negative cluster numbers) @ By default, \Rfunction{Clusterize} will cluster sequences with linkage to the representative sequence in each group, but it is also possible to tell \Rfunction{Clusterize} to minim\emph{ize} the number of clusters by establishing linkage to any sequence in the cluster (i.e., single-linkage): <>= c2 <- Clusterize(dna, cutoff=0.2, singleLinkage=TRUE, processors=1) max(abs(c1)) # center-linkage max(c2) # single-linkage (fewer clusters, but broader clusters) @ It is possible to synthes\emph{ize} a plot showing a cross tabulation of taxonomy and cluster number. We may ideal\emph{ize} the clustering as matching taxonomic labels (\ref{f2}), but this is not exactly the case. \begin{centerfig} <>= genus <- sapply(strsplit(names(dna), " "), `[`, 1) t <- table(genus, c2[[1]]) heatmap(sqrt(t), scale="none", Rowv=NA, col=hcl.colors(100)) @ \caption{\label{f2} Another visualization of the clustering.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Finalize your use of Clusterize} %------------------------------------------------------------ Notably, \Rfunction{Clusterize} is a stochastic algorithm, meaning it will random\emph{ize} which sequences are selected during pre-sorting. Even though the clusters will typically stabil\emph{ize} with enough iterations, you can set the random number seed (before every run) to guarantee reproducibility of the clusters: <>= set.seed(123) # initialize the random number generator clusters <- Clusterize(seqs, cutoff=0.1, processors=1) set.seed(NULL) # reset the seed @ Now you know how to util\emph{ize} \Rfunction{Clusterize} to cluster sequences. %------------------------------------------------------------ \section{Session Information} %------------------------------------------------------------ All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}