%\VignetteIndexEntry{Growing Phylogenetic Trees with TreeLine} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\C}{{\textsf{C}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Growing phylogenetic trees with TreeLine} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) @ \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ This document describes how to grow phylogenetic trees using the \Rfunction{TreeLine} function in the \Rpackage{DECIPHER} package. \Rfunction{TreeLine} takes as input a set of aligned nucleotide or amino acid sequences and returns a phylogenetic tree (i.e., \Rclass{dendrogram} object) as output. This vignette focuses on building maximum likelihood (ML) and maximum parsimony (MP) phylogenetic trees starting from sequences, but \Rfunction{TreeLine} can also be used to build additive trees from a distance matrix. Why is the function called \Rfunction{TreeLine}? The goal of \Rfunction{TreeLine} is to find the most likely/parsimonious tree for a given sequence alignment. There are often many trees with nearly maximal likelihood/parsimony. Therefore, \Rfunction{TreeLine} seeks to find a tree as close as possible to the treeline, analogous to how no trees can grow above the treeline on a mountain. Why use \Rfunction{TreeLine} versus other programs? The \Rfunction{TreeLine} function is designed to return an excellent phylogenetic tree with minimal user intervention. Many tree building programs have a large set of complex options for niche applications. In contrast, \Rfunction{TreeLine} simply builds a great tree when relying on its defaults. This vignette is intended to get you started and introduce additional options/functions that might be useful. TreeLine uses multi-start optimization followed by hill-climbing to find the highest trees on the likelihood or parsimony landscapes. Since TreeLine is a stochastic optimizer, it optimizes many trees to prevent luck from influencing the final result. The main difference from most other approaches to tree optimization is that TreeLine heavily uses past trees to generate new trees as it optimizes. With any luck it'll find the treeline! %------------------------------------------------------------ \section{Performance Considerations} %------------------------------------------------------------ Finding a tree with very high likelihood/parsimony is no easy feat. \Rfunction{TreeLine} systematically optimizes hundreds to thousands of candidate trees before returning the best one. This takes time, but there are things you can do to make it go faster. \begin{itemize} \item Only use the sequences you need: \Rfunction{TreeLine} scales approximately quadratically with the number of sequences. Hence, limiting the number of sequences is a worthwhile consideration. In particular, always eliminate redundant sequences, as shown below, and remove any sequences that are not necessary. This concern is shared for all tree building programs, and \Rfunction{TreeLine} is no exception. \item Choose a model: Automatic model selection is a useful feature, but frequently this time-consuming step can be skipped. For most modestly large sets of nucleotide sequences, the \code{"GTR+G4"} model will be automatically selected. Typical amino acid sequences will tend to pick the \code{"LG+G4"} or \code{"WAG+G4"} models, unless the sequences are from a particular origin (e.g., mitochondria). Pre-selecting a subset of the available \Robject{MODELS} and supplying this as the \Rfunarg{model} argument can save considerable time. \item Set a timeout: The \code{maxTime} argument specifies the (approximate) maximum number of hours you are willing to let \Rfunction{TreeLine} run. If you are concerned about the code running too long then simply specify this argument. \item Compile with OpenMP support: Significant speed-ups can be achieved with multi-threading using OpenMP. See the ``Getting Started DECIPHERing'' vignette for how to do this on your computer platform. Then you only need to set the argument \code{processors=NULL} and \Rfunction{TreeLine} will use all available processors. \item Compile for SIMD support: \Rfunction{TreeLine} is configured to make use of SIMD operations, which are available on some processors. The easiest way to enable SIMD is to add `` -O3 -march=native'' to the end of \code{PKG_CFLAGS} in the ``DECIPHER/src/MAKEVARS'' text file. Then, after recompiling, there may be an automatic speed-up on systems with SIMD support. Note that enabling SIMD makes the compiled code non-portable, so the code always needs to be compiled on the hardware being used. \end{itemize} %------------------------------------------------------------ \section{Growing a Phylogenetic Tree} %------------------------------------------------------------ \Rfunction{TreeLine} takes as input a multiple sequence alignment when constructing a maximum likelihood or maximum parsimony phylogenetic tree. Multiple sequence alignments can be constructed from a set of (unaligned) sequences using \Rfunction{AlignSeqs} or related functions. \Rfunction{TreeLine} will optimize trees for amino acid (i.e., \code{AAStringSet}) or nucleotide (i.e., \code{DNAStringSet} or \code{RNAStringSet}) sequences. Here, we are going to use a set of sequences that is included with \Rpackage{DECIPHER}. These sequences are from the internal transcribed spacer (ITS) between the 16S and 23S ribosomal RNA genes in several \textit{Streptomyces} species. <>= library(DECIPHER) # specify the path to your sequence file: fas <- "<>" # OR find the example sequence file used in this tutorial: fas <- system.file("extdata", "Streptomyces_ITS_aligned.fas", package="DECIPHER") seqs <- readDNAStringSet(fas) # use readAAStringSet for amino acid sequences seqs # the aligned sequences @ Many of these sequences are redundant or from the same genome. We can de-replicate the sequences to accelerate tree building: <>= seqs <- unique(seqs) # remove duplicated sequences ns <- gsub("^.*Streptomyces( subsp\\. | sp\\. | | sp_)([^ ]+).*$", "\\2", names(seqs)) names(seqs) <- ns # name by species (or any other preferred names) seqs <- seqs[!duplicated(ns)] # remove redundant sequences from the same species seqs @ Now, it's time to try our luck at finding the most likely tree. Here, we will set a stringent time limit (0.01 hours) to make this example faster, although longer time limits (e.g., 24 hours) are advised because setting very short time limits leaves the result partly up to luck. Note that \Rfunction{TreeLine} automatically selects a substitution model based on Akaike information criterion (by default). It is possible to specify specific model(s) (e.g., \code{model="GTR+G4"}) to limit the possible selections and test your luck with fewer models. Also, since \Rfunction{TreeLine} is a stochastic optimizer, it is critical to always set the random number seed for reproducibility. You can pick any lucky number, and if you ever wonder how much you pushed your luck, you can try running again from a different random number seed to see how much the result came down to luck of the draw. Note that setting a time limit, as done below with \Rfunarg{maxTime}, negates the purpose of setting a seed -- never set a time limit if reproducibility is desired or you'll have no such luck. \begin{centerfig} <>= set.seed(123) # set the random number seed tree <- TreeLine(seqs, method="ML", model="GTR+G4", reconstruct=TRUE, maxTime=0.01) set.seed(NULL) # reset seed plot(tree) @ \caption{\label{f1} Maximum likelihood tree showing the relationships between \textit{Streptomyces} species.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Plotting Branch Support Values} %------------------------------------------------------------ Maybe it was just beginner's luck, but we already have a reasonable looking starting tree! \Rfunction{TreeLine} automatically returns a variety of information about the tree that can be accessed with the \Rfunction{attributes} and \Rfunction{attr} functions: <>= #attributes(tree) # view all attributes attr(tree, "members") # number of leaves below this (root) node attr(tree, "height") # height of the node (in this case, the midpoint root) attr(tree, "state") # ancestral state reconstruction (if reconstruct=TRUE) attr(tree, "siteLnLs") # LnL for every alignment column (site) attr(tree, "score") # best score (in this case, the -LnL) attr(tree, "model") # either the specified or automatically select transition model attr(tree, "parameters") # the free model parameters (or NA if unoptimized) attr(tree, "midpoint") # center of the edge (for plotting) @ The tree is (virtually) rooted at its midpoint by default. For maximum likelihood trees, all internal nodes include aBayes branch support values \cite{Anisimova:2011}. These are given as probabilities that can be used in plotting on top of each edge. We can also italicize the leaf labels (species names). \begin{centerfig} <>= plot(dendrapply(tree, function(x) { s <- attr(x, "probability") # choose "probability" (aBayes) or "support" if (!is.null(s) && !is.na(s)) { s <- formatC(as.numeric(s), digits=2, format="f") attr(x, "edgetext") <- paste(s, "\n") } attr(x, "edgePar") <- list(p.col=NA, p.lwd=1e-5, t.col="#CC55AA", t.cex=0.7) if (is.leaf(x)) attr(x, "nodePar") <- list(lab.font=3, pch=NA) x }), horiz=TRUE, yaxt='n') # add a scale bar (placed manually) arrows(0, 0, 0.4, 0, code=3, angle=90, len=0.05, xpd=TRUE) text(0.2, 0, "0.4 subs./site", pos=3, xpd=TRUE) @ \caption{\label{f2} Tree with (aBayes) support probabilities at each internal node.} \end{centerfig} \clearpage We lucked out because maximum likelihood and maximum parsimony trees both provide branch supports in the form of the fraction of optimized trees that contained a given partition (branch). These are accessible from the ``support'' attribute. As luck would have it, support values and (aBayes) probabilities are correlated, but support tends to be more conservative. \begin{centerfig} <>= getSupports <- function(x) { if (is.leaf(x)) { NULL } else if (is.null(attr(x, "support"))) { rbind(getSupports(x[[1]]), getSupports(x[[2]])) } else { rbind(cbind(attr(x, "support"), attr(x, "probability"), attr(x, "members")), getSupports(x[[1]]), getSupports(x[[2]])) } } support <- getSupports(tree) plot(support[, 1], support[, 2], cex=log10(support[, 3]), xlab="Support", ylab="aBayes probability", asp=1) abline(a=0, b=1, lty=2) # line of identity (y=x) @ \caption{\label{f3} Comparison of aBayes probabilities and branch support values.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Ancestral State Reconstruction} %------------------------------------------------------------ We're in luck \textemdash one of the advantages of maximum likelihood and maximum parsimony tree building methods is that they automatically predict states at each internal node on the tree \cite{Joy:2016}. This feature is enabled when \Rfunarg{reconstruct} is set to \code{TRUE}. These character states can be used by the function \Rfunction{MapCharacters} to determine state transitions along each edge of the tree. This information enables us to plot the total number of substitutions occurring along each edge. The state transitions can be accessed along each edge by querying a new ``change'' attribute. \begin{centerfig} <>= new_tree <- MapCharacters(tree, labelEdges=TRUE) plot(new_tree, edgePar=list(p.col=NA, p.lwd=1e-5, t.col="#55CC99", t.cex=0.7)) attr(new_tree[[1]], "change") # state changes on first branch left of (virtual) root @ \caption{\label{f4} Edges labeled with the number of state transitions.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Exporting the Tree} %------------------------------------------------------------ We've had a run of good luck with this tree, so we'd better save it before our luck runs out! The functions \Rfunction{ReadDendrogram} and \Rfunction{WriteDendrogram} will import and export trees in Newick file format. If we leave the \Rfunarg{file} argument blank then it will print the output to the console for our viewing: <>= WriteDendrogram(tree, file="") @ To keep up our lucky streak, we should probably include any model parameters in the output along with the tree. Luckily, Newick format supports square brackets (i.e., ``[]'') for comments, which we can append to the end of the file for good luck: <>= params <- attr(tree, "parameters") cat("[", paste(names(params), params, sep="=", collapse=","), "]", sep="", append=TRUE, file="") @ %------------------------------------------------------------ \section{Session Information} %------------------------------------------------------------ All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \begin{thebibliography}{} \bibitem{Anisimova:2011} {Anisimova, M.}, {Gil, M.}, {Dufayard, J.}, {Dessimoz, C.}, \& {Gascuel, O.} \newblock Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. \newblock {\em Syst Biol.}, 60(5), 685-699. \bibitem{Joy:2016} {Joy, J.}, {Liang, R.}, {McCloskey, R.}, {Nguyen, T.}, \& {Poon, A.} \newblock Ancestral Reconstruction. \newblock {\em PLoS Comp. Biol.}, 12(7), e1004763. \end{thebibliography} \end{document}