%\VignetteIndexEntry{Growing Phylogenetic Trees with Treeline} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\C}{{\textsf{C}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Growing phylogenetic trees with Treeline} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) @ \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ This document describes how to grow phylogenetic trees using the \Rfunction{Treeline} function in the \Rpackage{DECIPHER} package. \Rfunction{Treeline} takes as input a set of aligned nucleotide or amino acid sequences and returns a phylogenetic tree (i.e., \Rclass{dendrogram} object) as output. This vignette focuses on optimizing, balanced minimum evolution (ME), maximum likelihood (ML), and maximum parsimony (MP) phylogenetic trees starting from sequences. Why is the function called \Rfunction{Treeline}? The goal of \Rfunction{Treeline} is to find the best tree according to an optimality criterion. There are often many trees near the optimum. Therefore, \Rfunction{Treeline} seeks to find a tree as close as possible to the Treeline, analogous to how trees cannot grow above the treeline on a mountain. Why use \Rfunction{Treeline} versus other programs? The \Rfunction{Treeline} function is designed to return an excellent phylogenetic tree with minimal user intervention. Many tree building programs have a large set of complex options for niche applications. In contrast, \Rfunction{Treeline} simply builds a great tree by default. This vignette is intended to get you started and introduce additional options/functions that might be useful. Treeline uses multi-start optimization followed by hill-climbing to find the highest trees on the optimality landscape. Since Treeline is a stochastic optimizer, it optimizes many trees to prevent chance from influencing the final result. With any luck it'll find the Treeline! %------------------------------------------------------------ \section{Performance Considerations} %------------------------------------------------------------ Finding an optimal tree is no easy feat. \Rfunction{Treeline} systematically optimizes hundreds of candidate trees before returning the best one. This takes time, but there are things you can do to make it go faster. \begin{itemize} \item Only use the sequences you need: \Rfunction{Treeline}'s runtime scales approximately quadratically with the number of sequences. Hence, limiting the number of sequences is a worthwhile consideration. In particular, always eliminate redundant sequences, as shown in the example below, and remove any sequences that are not necessary. \item Set a timeout: The \Rfunarg{maxTime} argument specifies the (approximate) maximum number of hours you are willing to let \Rfunction{Treeline} run. If you are concerned about the code running for too long then simply specify this argument. \item Compile with OpenMP support: Significant speed-ups can be achieved with multi-threading using OpenMP, particularly for ML and MP \Rfunarg{method}s. See the ``Getting Started DECIPHERing'' vignette for how to enable OpenMP on your computer. Then you only need to set the argument \code{processors=NULL} and \Rfunction{Treeline} will use all available processors. \item Compile for SIMD support: \Rfunction{Treeline} is configured to make use of SIMD operations, which are available on some processors. The easiest way to enable SIMD is to add a line with ``CFLAGS += -O3 -march=native'' to your $\sim$/.R/Makevars text file. Then, after recompiling, there may be an automatic speed-up on systems with SIMD support. Note that enabling SIMD makes the compiled code non-portable, so the code always needs to be compiled on the hardware being used. \item For ML, choose a model: Automatic model selection is a useful feature, but frequently this time-consuming step can be skipped. For most modestly large sets of nucleotide sequences, the \code{"GTR+G4"} model will be automatically selected. Typical amino acid sequences will tend to pick the \code{"LG+G4"} or \code{"WAG+G4"} models, unless the sequences are from a particular origin (e.g., mitochondria). Pre-selecting a subset of the available \Robject{MODELS} and supplying this as the \Rfunarg{model} argument can save considerable time. \end{itemize} %------------------------------------------------------------ \section{Growing a Phylogenetic Tree} %------------------------------------------------------------ \Rfunction{Treeline} takes as input a multiple sequence alignment and/or a distance matrix. All distance-based methods (including ME) only require specification of \code{myDistMatrix} but will generate a distance matrix using \Rfunction{DistanceMatrix} if \code{myXStringSet} is provided instead. The character-based methods (i.e., ML and MP) require a multiple sequence alignment and will generate a distance matrix to construct the first candidate tree unless one is provided. Multiple sequence alignments can be constructed from a set of (unaligned) sequences using \Rfunction{AlignSeqs} or related functions. \Rfunction{Treeline} will optimize trees for amino acid (i.e., \code{AAStringSet}) or nucleotide (i.e., \code{DNAStringSet} or \code{RNAStringSet}) sequences. Here, we are going to use a set of sequences that is included with \Rpackage{DECIPHER}. These sequences are from the internal transcribed spacer (ITS) between the 16S and 23S ribosomal RNA genes in several \textit{Streptomyces} species. <>= library(DECIPHER) # specify the path to your sequence file: fas <- "<>" # OR find the example sequence file used in this tutorial: fas <- system.file("extdata", "Streptomyces_ITS_aligned.fas", package="DECIPHER") seqs <- readDNAStringSet(fas) # use readAAStringSet for amino acid sequences seqs # the aligned sequences @ Many of these sequences are redundant or from the same genome. We can de-replicate the sequences to accelerate tree building: <>= seqs <- unique(seqs) # remove duplicated sequences ns <- gsub("^.*Streptomyces( subsp\\. | sp\\. | | sp_)([^ ]+).*$", "\\2", names(seqs)) names(seqs) <- ns # name by species (or any other preferred names) seqs <- seqs[!duplicated(ns)] # remove redundant sequences from the same species seqs @ Now, it's time to try our luck at finding the most likely tree. Here, we will set a stringent time limit (0.01 hours) to make this example faster, although longer time limits (e.g., 24 hours) are advised because setting very short time limits leaves the result partly up to luck. Here, it is necessary to choose a \Rfunarg{method} for optimizing the tree. The default \Rfunarg{method} is \code{"ME"} because it is fast and performs best on empirical datasets. For maximum parsimony, set \Rfunarg{method} to \code{"MP"} and (optionally) specify a \Rfunarg{costMatrix}. For maximum likelihood, set \Rfunarg{method} to \code{"ML"}, which requires a model of sequence evolution. Note that \Rfunction{Treeline} automatically selects the best \Rfunarg{model} according to Akaike information criterion (by default). It is possible to choose specific model(s) (e.g., \code{model="GTR+G4"}) to limit the possible selections and test your luck with fewer models. Also, since \Rfunction{Treeline} is a stochastic optimizer, it is critical to always set the random number seed for reproducibility. You can pick any lucky number, and if you ever wonder how much you pushed your luck, you can try running again from a different random number seed to see how much the result came down to luck of the draw. Note that setting a time limit, as done below with \Rfunarg{maxTime}, negates the purpose of setting a seed -- never set a time limit if reproducibility is desired or you'll have no such luck. \begin{centerfig} <>= set.seed(123) # set the random number seed tree <- Treeline(seqs, method="ML", model="GTR+G4", reconstruct=TRUE, maxTime=0.01) set.seed(NULL) # reset seed plot(tree) @ \caption{\label{f1} Maximum likelihood tree showing the relationships between \textit{Streptomyces} species.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Ancestral State Reconstruction} %------------------------------------------------------------ We're in luck \textemdash since we set \Rfunarg{reconstruct} to \code{TRUE} \Rfunction{Treeline} automatically predict states at each internal node on the tree \cite{Joy:2016}. These character states can be used by the function \Rfunction{MapCharacters} to determine state transitions along each edge of the tree. This information enables us to plot the total number of substitutions occurring along each edge. The state transitions can be accessed along each edge by querying a new ``change'' attribute. \begin{centerfig} <>= new_tree <- MapCharacters(tree, labelEdges=TRUE) plot(new_tree, edgePar=list(p.col=NA, p.border=NA, t.col="#55CC99", t.cex=0.7)) attr(new_tree[[1]], "change") # state changes on first branch left of (virtual) root @ \caption{\label{f2} Edges labeled with the number of state transitions.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Plotting Branch Support Values} %------------------------------------------------------------ Maybe it was just beginner's luck, but we already have a reasonable looking starting tree! \Rfunction{Treeline} automatically returns a variety of information about the tree that can be accessed with the \Rfunction{attributes} and \Rfunction{attr} functions: <>= #attributes(tree) # view all attributes attr(tree, "members") # number of leaves below this (root) node attr(tree, "height") # height of the node (in this case, the midpoint root) attr(tree, "state") # ancestral state reconstruction (if reconstruct=TRUE) head(attr(tree, "siteLnLs")) # LnL for every alignment column (site) attr(tree, "score") # best score (in this case, the -LnL) attr(tree, "model") # either the specified or automatically select transition model attr(tree, "parameters") # the free model parameters (or NA if unoptimized) attr(tree, "midpoint") # center of the edge (for plotting) @ The tree is (virtually) rooted at its midpoint by default. For maximum likelihood trees, all internal nodes include aBayes branch support values \cite{Anisimova:2011}. These are given as probabilities that can be used in plotting on top of each edge. We can also italicize the leaf labels (species names). \begin{centerfig} <>= plot(dendrapply(tree, function(x) { s <- attr(x, "probability") # choose "probability" (aBayes) or "support" if (!is.null(s) && !is.na(s)) { s <- formatC(as.numeric(s), digits=2, format="f") attr(x, "edgetext") <- paste(s, "\n") } attr(x, "edgePar") <- list(p.col=NA, p.border=NA, t.col="#CC55AA", t.cex=0.7) if (is.leaf(x)) attr(x, "nodePar") <- list(lab.font=3, pch=NA) x }), horiz=TRUE, yaxt='n') # add a scale bar (placed manually) arrows(0, 0, 0.4, 0, code=3, angle=90, len=0.05, xpd=TRUE) text(0.2, 0, "0.4 subs./site", pos=3, xpd=TRUE) @ \caption{\label{f3} Tree with aBayes probabilities at each internal node.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Calculating bootstrap support values} %------------------------------------------------------------ The aBayes probabilities are a good proxy for whether a partition in the tree is correct \cite{Ecker:2024}, but they are only available for maximum likelihood trees. For the other trees we need to make our own luck by bootstrapping the alignment. The idea behind bootstrapping is to resample columns (sites) of the alignment with replacement and determine whether each partition was found in the original tree. Repeating this process allows us to measure the level of support for each branch. \begin{centerfig} <>= reps <- 100 # number of bootstrap replicates tree1 <- Treeline(seqs, verbose=FALSE, processors=1L) partitions <- function(x) { if (is.leaf(x)) return(NULL) x0 <- paste(sort(unlist(x)), collapse=" ") x1 <- partitions(x[[1]]) x2 <- partitions(x[[2]]) return(list(x0, x1, x2)) } pBar <- txtProgressBar() bootstraps <- vector("list", reps) for (i in seq_len(reps)) { r <- sample(width(seqs)[1], replace=TRUE) at <- IRanges(r, width=1) seqs2 <- extractAt(seqs, at) seqs2 <- lapply(seqs2, unlist) seqs2 <- DNAStringSet(seqs2) temp <- Treeline(seqs2, verbose=FALSE) bootstraps[[i]] <- unlist(partitions(temp)) setTxtProgressBar(pBar, i/reps) } close(pBar) bootstraps <- table(unlist(bootstraps)) original <- unlist(partitions(tree1)) hits <- bootstraps[original] names(hits) <- original w <- which(is.na(hits)) if (length(w) > 0) hits[w] <- 0 hits <- round(hits/reps*100) labelEdges <- function(x) { if (is.null(attributes(x)$leaf)) { part <- paste(sort(unlist(x)), collapse=" ") attr(x, "edgetext") <- as.character(hits[part]) } return(x) } tree2 <- dendrapply(tree1, labelEdges) attr(tree2, "edgetext") <- NULL plot(tree2, edgePar=list(t.cex=0.5), nodePar=list(lab.cex=0.7, pch=NA)) @ \caption{\label{f4} Tree with bootstrap support probabilities at each internal node.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Exporting the Tree} %------------------------------------------------------------ We've had a run of good luck with this tree, so we'd better save it before our luck runs out! The functions \Rfunction{ReadDendrogram} and \Rfunction{WriteDendrogram} will import and export trees in Newick file format. If we leave the \Rfunarg{file} argument blank then it will print the output to the console for our viewing: <>= WriteDendrogram(tree, file="") @ To keep up our lucky streak, we should probably include any model parameters in the output along with the tree. Luckily, Newick format supports square brackets (i.e., ``[]'') for comments, which we can append to the end of the file for good luck: <>= params <- attr(tree, "parameters") cat("[", paste(names(params), params, sep="=", collapse=","), "]", sep="", append=TRUE, file="") @ %------------------------------------------------------------ \section{Session Information} %------------------------------------------------------------ All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \begin{thebibliography}{} \bibitem{Anisimova:2011} {Anisimova, M.}, {Gil, M.}, {Dufayard, J.}, {Dessimoz, C.}, \& {Gascuel, O.} \newblock Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. \newblock {\em Syst Biol.}, 60(5), 685-699. \bibitem{Ecker:2024} {Ecker, N.}, {Huchon, D.}, {Mansour, Y.}, {Mayrose, I.}, \& {Pupko, T.} \newblock A machine-learning-based alternative to phylogenetic bootstrap. \newblock {\em Bioinformatics}, 40, i208-i217. \bibitem{Joy:2016} {Joy, J.}, {Liang, R.}, {McCloskey, R.}, {Nguyen, T.}, \& {Poon, A.} \newblock Ancestral Reconstruction. \newblock {\em PLoS Comp. Biol.}, 12(7), e1004763. \end{thebibliography} \end{document}