%\VignetteIndexEntry{Upsize your clustering with Clusterize} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \setlength{\parindent}{1cm} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Upsize your clustering with Clusterize} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, 0.3, 0.1)))) set.seed(123) @ \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ You may have found yourself in a familiar predicament for many bioinformaticians: you have a lot of sequences and you need to downs\emph{ize} before you can get going. You may also theor\emph{ize} that this must be an easy problem to solve \textemdash given sequences, output clusters. But what can you util\emph{ize} to solve this problem? This vignette will familiar\emph{ize} you with the \Rfunction{Clusterize} function in the \Rpackage{DECIPHER} package. Clusterize will revolution\emph{ize} all your clustering needs! \\ Why \Rfunction{Clusterize}? \begin{itemize} \item Scalability - \Rfunction{Clusterize} will linear\emph{ize} the search space so that many sequences can be clustered in a reasonable amount of time. \item Simplicity - Although you can individual\emph{ize} \Rfunction{Clusterize}, the defaults are straightforward and should meet most of your needs. \item Accuracy - \Rfunction{Clusterize} will maxim\emph{ize} your ability to extract biologically meaningful results from your sequences. \end{itemize} This vignette will summar\emph{ize} the use of \Rfunction{Clusterize} to cluster DNA, RNA, or protein sequences. %------------------------------------------------------------ \section{Getting Started} %------------------------------------------------------------ \newlength\tindent \setlength{\tindent}{\parindent} \setlength{\parindent}{0pt} To get started we need to load the \Rpackage{DECIPHER} package, which automatically mobil\emph{ize} a few other required packages. <>= library(DECIPHER) @ There's no need to memor\emph{ize} the inputs to \Rfunction{Clusterize}, because its help page can be accessed through: \begin{Schunk} \begin{Sinput} > ? Clusterize \end{Sinput} \end{Schunk} %------------------------------------------------------------ \section{Optimize your inputs to Clusterize} %------------------------------------------------------------ \setlength{\parindent}{1cm} \Rfunction{Clusterize} requires that you first digit\emph{ize} your sequences by loading them into memory. For the purpose of this vignette, we will capital\emph{ize} on the fact that \Rpackage{DECIPHER} already includes some built-in sets of sequences. <>= # specify the path to your file of sequences: fas <- "<>" # OR use the example DNA sequences: fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER") # read the sequences into memory dna <- readDNAStringSet(fas) dna @ The \Rfunction{Clusterize} algorithm will general\emph{ize} to nucleotide or protein sequences, so we must choose which we are going to use. Here, we hypothes\emph{ize} that weaker similarities can be detected between proteins and, therefore, decide to use the translated coding (amino acid) sequences. If you wish to cluster at high similarity, you could also strateg\emph{ize} that nucleotide sequences would be better because there would be more nucleotide than amino acid substitutions. <>= aa <- translate(dna) aa seqs <- aa # could also cluster the nucleotides @ Now you can choose how to parameter\emph{ize} the function, with the only required arguments being \Rfunarg{myXStringSet} and \Rfunarg{cutoff}. In this case, we will initial\emph{ize} \Rfunarg{cutoff} at \code{seq(0.5, 0, -0.1)} to cluster sequences from 50\% to 100\% similarity by 10\%'s. It is important to recogn\emph{ize} that \Rfunarg{cutoff}s can be provided in \emph{ascending} or \emph{descending} order and, when \emph{descending}, groups at each \Rfunarg{cutoff} will be nested within the previous \Rfunarg{cutoff}'s groups. We must also choose whether to custom\emph{ize} the calculation of distance. The defaults will penal\emph{ize} gaps as single events, such that each consecutive set of gaps (i.e., insertion or deletion) is considered equivalent to one mismatch. If you want to standard\emph{ize} the definition of distance to be the same as most other clustering programs then set: \Rfunarg{penalizeGapLetterMatches} to \code{TRUE} (i.e., every gap position is a mismatch), \Rfunarg{method} to \code{"shortest"}, \Rfunarg{minCoverage} to \code{0}, and \Rfunarg{includeTerminalGaps} to \code{TRUE}. We can further personal\emph{ize} the inputs as desired. The main function argument to emphas\emph{ize} is \Rfunarg{processors}, which controls whether the function is parallelized on multiple computer threads (if \Rpackage{DECIPHER}) was built with OpenMP enabled). Setting \Rfunarg{processors} to a value greater than \code{1} will speed up clustering considerably, especially for large s\emph{ize} clustering problems. Once we are ready, it's time to run \Rfunction{Clusterize} and wait for the output to material\emph{ize}! <>= clusters <- Clusterize(seqs, cutoff=seq(0.5, 0, -0.1), processors=1) class(clusters) colnames(clusters) str(clusters) apply(clusters, 2, max) # number of clusters per cutoff @ We can now real\emph{ize} our objective of decreasing the number of sequences. Here, we will priorit\emph{ize} keeping only the longest diverse sequences. <>= o <- order(clusters[[2]], width(seqs), decreasing=TRUE) # 40\% cutoff o <- o[!duplicated(clusters[[2]])] aa[o] dna[o] @ %------------------------------------------------------------ \section{Visualize the output of Clusterize} %------------------------------------------------------------ We can scrutin\emph{ize} the clusters by selecting them and looking at their multiple sequence alignment: <>= t <- table(clusters[[1]]) # select the clusters at a cutoff t <- sort(t, decreasing=TRUE) head(t) w <- which(clusters[[1]] == names(t[1])) AlignSeqs(seqs[w], verbose=FALSE) @ It's possible to util\emph{ize} the \Rfunction{heatmap} function to view the clustering results. \begin{centerfig} <>= heatmap(as.matrix(clusters), scale="column", Colv=NA) @ \caption{\label{f1} Visualization of the clustering.} \end{centerfig} As can be seen in Figure \ref{f1}, \Rfunction{Clusterize} will organ\emph{ize} its clusters such that each new cluster is within the previous cluster when \Rfunarg{cutoff} is provided in descending order. We can also see that sequences from the same species tend to cluster together, which is another way to categor\emph{ize} sequences without clustering. \clearpage %------------------------------------------------------------ \section{Finalize your use of Clusterize} %------------------------------------------------------------ Notably, \Rfunction{Clusterize} is a stochastic algorithm, meaning it will random\emph{ize} which sequences are selected during pre-sorting. Even though the clusters will typically stabil\emph{ize} with enough iterations, you can set the random number seed to guarantee reproducibility of the clusters: <>= set.seed(123) # initialize the random number generator clusters <- Clusterize(seqs, cutoff=seq(0.5, 0, -0.1)) set.seed(NULL) # reset the seed @ Now you know how to util\emph{ize} \Rfunction{Clusterize} to cluster sequences. %------------------------------------------------------------ \section{Session Information} %------------------------------------------------------------ All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}