--- title: "Importing trees with data" author: "Guangchuang Yu\\ School of Public Health, The University of Hong Kong" date: "`r Sys.Date()`" bibliography: treeio.bib biblio-style: apalike output: prettydoc::html_pretty: toc: true theme: cayman highlight: github pdf_document: toc: true vignette: > %\VignetteIndexEntry{01 Importing trees with data} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} %\VignetteEncoding{UTF-8} --- ```{r style, echo=FALSE, results="asis", message=FALSE} knitr::opts_chunk$set(tidy = FALSE, message = FALSE) ``` ```{r echo=FALSE, results="hide", message=FALSE} library(tidyr) library(dplyr) library(tidytree) library(ggplot2) library("treeio") CRANpkg <- function (pkg) { cran <- "https://CRAN.R-project.org/package" fmt <- "[%s](%s=%s)" sprintf(fmt, pkg, cran, pkg) } Biocpkg <- function (pkg) { sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg) } ``` # Introduction Phylogenetic trees are commonly used to present evolutionary relationships of species. Information associated with taxon species/strains may be further analyzed in the context of the evolutionary history depicted by the phylogenetic tree. For example, host information of the influenza virus strains in the tree could be studied to understand host range of a virus linage. Moreover, such meta-data (*e.g.*, isolation host, time, location, *etc.*) directly associated with taxon strains are also often subjected to further evolutionary or comparative phylogenetic models and analyses, to infer their dynamics associated with the evolutionary or transmission processes of the virus. All these meta-data or other phenotypic or experimental data are stored either as the annotation data associated with the nodes or branches, and are often produced in inconsistent format by different analysis programs. Getting trees in to R is still limited. Newick and Nexus can be imported by several packages, including `r CRANpkg("ape")`, `r CRANpkg("phylobase")`. NeXML format can be parsed by `r CRANpkg("RNeXML")`. However, analysis results from widely used software packages in this field are not well supported. SIMMAP output can be parsed by `r CRANpkg("phyext2")` and `r CRANpkg("phytools")`. Although [PHYLOCH](http://www.christophheibl.de/Rpackages.html) can import BEAST and MrBayes output, only internal node attributes were parsed and tip attributes were ignore. Many other software outputs are mainly required programming expertise to import the tree with associated data. Linking external data, including experimental and clinical data, to phylogeny is another obstacle for evolution biologists. The [treeio](https://bioconductor.org/packages/treeio/) package defines base classes and functions for phylogenetic tree input and output. It is an infrastructure that enables evolutionary evidences that inferred by commonly used software packages to be used in `R`. For instance, *d~N~/d~S~* values or ancestral sequences inferred by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) [@yang_paml_2007], clade support values (posterior) inferred by [BEAST](http://beast2.org/) [@bouckaert_beast_2014] and short read placement by [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) [@berger_EPA_2011] and [pplacer](http://matsen.fhcrc.org/pplacer/) [@matsen_pplacer_2010]. These evolutionary evidences can be further analyzed in `R` and used to annotate phylogenetic tree using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]. The growth of analysis tools and models available introduces a challenge to integrate different varieties of data and analysis results from different sources for an integral analysis on the the same phylogenetic tree background. The [treeio](https://bioconductor.org/packages/treeio/) package provides a `merge_tree` function to allow combining tree data obtained from different sources. In addition, [treeio](https://bioconductor.org/packages/treeio/) also enables external data to be linked to phylogenetic tree structure. # Getting tree data from evolutionary analysis result To fill the gap that most of the tree formats or software outputs cannot be easily parsed with the same software/platform, [treeio](https://bioconductor.org/packages/treeio/) implemented several functions for parsing various tree file formats and outputs of common evolutionary analysis software. Not only the tree structure can be parsed but also the associated data and evolutionary inferences, including NHX annotation, clock rate inferences (from [BEAST](http://beast2.org/) or [r8s](http://ginger.ucdavis.edu/r8s) [@sanderson_r8s:_2003] programs), snynonymous and non-synonymous substitutions (from CodeML), and ancestral sequence construction (from [HyPhy](https://veg.github.io/hyphy-site/), [BaseML](http://abacus.gene.ucl.ac.uk/software/paml.html) or [CodeML](http://abacus.gene.ucl.ac.uk/software/paml.html)), *etc.*. Currently, [treeio](https://bioconductor.org/packages/treeio/) is able to read the following file formats: Newick, Nexus, New Hampshire eXtended format (NHX), jplace and Phylip as well as the data outputs from the following analysis programs: [BEAST](http://beast2.org/), [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html), [HyPhy](https://veg.github.io/hyphy-site/), [MrBayes](http://nbisweden.github.io/MrBayes/), [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html), [PHYLDOG](https://pbil.univ-lyon1.fr/software/phyldog/), [pplacer](http://matsen.fhcrc.org/pplacer/), [r8s](http://ginger.ucdavis.edu/r8s), [RAxML](http://evomics.org/learning/phylogenetics/raxml/) and [RevBayes](https://revbayes.github.io/intro.html). ```{r treeio-function, echo=F, message=FALSE} ff <- matrix(c( "read.beast" , "parsing output of BEAST", "read.codeml" , "parsing output of CodeML (rst and mlc files)", "read.codeml_mlc" , "parsing mlc file (output of CodeML)", "read.hyphy" , "parsing output of HYPHY", "read.jplace" , "parsing jplace file including output of EPA and pplacer", "read.mrbayes" , "parsing output of MrBayes", "read.newick" , "parsing newick string, with ability to parse node label as support values", "read.nhx" , "parsing NHX file including output of PHYLDOG and RevBayes", "read.paml_rst" , "parsing rst file (output of BaseML or CodeML)", "read.phylip" , "parsing phylip file (phylip alignment + newick string)", "read.r8s" , "parsing output of r8s", "read.raxml" , "parsing output of RAxML" ), ncol=2, byrow=TRUE) ff <- as.data.frame(ff) colnames(ff) <- c("Parser function", "Description") knitr::kable(ff, caption = "Parser functions defined in treeio", booktabs = T) ``` After parsing, storage of the tree structure with associated data is made through a S4 class, treedata, defined in the [treeio](https://bioconductor.org/packages/treeio/) package. These parsed data are mapped to the tree branches and nodes inside `treedata` object, so that they can be efficiently used to visually annotate the tree using [ggtree](https://bioconductor.org/packages/ggtree/) package [@yu_ggtree:_2017]. [treeio](https://bioconductor.org/packages/treeio/) provides functions to merge these phylogeny-associated data for comparison and further analysis. ## Parsing BEAST output ```{r} file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio") beast <- read.beast(file) beast ``` Since _`%`_ is not a valid character in _`names`_, all the feature names that contain _`x%`_ will convert to _`0.x`_. For example, _`length_95%_HPD`_ will be changed to _`length_0.95_HPD`_. The _`get.fields`_ method return all available features that can be used for annotation. ```{r} get.fields(beast) ``` ## Parsing MrBayes output ```{r} file <- system.file("extdata/MrBayes", "Gq_nxs.tre", package="treeio") read.mrbayes(file) ``` ## Parsing PAML output The `read.paml_rst` function can parse *rst* file from [BASEML](http://abacus.gene.ucl.ac.uk/software/paml.html) and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html). The only difference is the space in the sequences. For [BASEML](http://abacus.gene.ucl.ac.uk/software/paml.html), each ten bases are separated by one space, while for [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html), each three bases (triplet) are separated by one space. ```{r fig.width=12, fig.height=10, warning=FALSE, fig.align="center"} brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio") brst <- read.paml_rst(brstfile) brst ``` Similarly, we can parse the *rst* file from [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html). ```{r} crstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio") ## type can be one of "Marginal" or "Joint" crst <- read.paml_rst(crstfile, type = "Joint") crst ``` Ancestral sequences inferred by [BASEML](http://abacus.gene.ucl.ac.uk/software/paml.html) or [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) via marginal or joint ML reconstruction methods will be stored in the S4 object and mapped to tree nodes. [treeio](https://bioconductor.org/packages/treeio/) will automatically determine the substitutions between the sequences at the both ends of each branch. Amino acid substitution will also be determined by translating nucleotide sequences to amino acid sequences. These computed substitutions will also be stored in the S4 object. [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) infers selection pressure and estimated *d~N~/d~S~*, *d~N~* and *d~S~*. These information are stored in output file *mlc*, which can be parsed by `read.codeml_mlc` function. ```{r} mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio") mlc <- read.codeml_mlc(mlcfile) mlc ``` In previous session, we separately parsed *rst* and *mlc* files. However, they can also be parsed together using `read.codeml` function. ```{r} ## tree can be one of "rst" or "mlc" to specify ## using tree from which file as base tree in the object ml <- read.codeml(crstfile, mlcfile, tree = "mlc") ml ``` All the features in both *rst* and *mlc* files were imported into a single S4 object and hence are available for further annotation and visualization. For example, we can annotate and display both *d~N~/d~S~* (from *mlc* file) and amino acid substitutions (derived from *rst* file) on the same phylogenetic tree. ## Parsing HyPhy output Ancestral sequences inferred by [HyPhy](https://veg.github.io/hyphy-site/) are stored in the Nexus output file, which contains the tree topology and ancestral sequences. To parse this data file, users can use the `read.hyphy.seq` function. ```{r warning=FALSE} ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio") read.hyphy.seq(ancseq) ``` To map the sequences on the tree, user shall also provide an internal-node-labelled tree. If users want to determine substitution, they need also provide tip sequences. ```{r warning=FALSE} nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio") tipfas <- system.file("extdata", "pa.fas", package="treeio") hy <- read.hyphy(nwk, ancseq, tipfas) hy ``` ## Parsing r8s output [r8s](http://loco.biosci.arizona.edu/r8s/) uses parametric, semiparametric and nonparametric methods to relax molecular clock to allow better estimations of divergence times and evolution rates [@@sanderson_r8s:_2003]. It outputs three trees in log file, namely *TREE*, *RATO* and *PHYLO* for time tree, rate tree and absolute substitution tree respectively. Time tree is scaled by divergence time, rate tree is scaled by substitution rate and absolute substitution tree is scaled by absolute number of substitution. After parsing the file, all these three trees are stored in a *multiPhylo* object. ```{r fig.width=4, fig.height=6, width=60, warning=FALSE, fig.align="center"} r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="treeio")) r8s ``` ## Parsing output of RAxML bootstraping analysis [RAxML](http://evomics.org/learning/phylogenetics/raxml/) bootstraping analysis output a Newick tree text that is not standard as it stores bootstrap values inside square brackets after branch lengths. This file usually cannot be parsed by traditional Newick parser, such as `ape::read.tree`. The function `read.raxml` can read such file and stored the bootstrap as an additional features, which can be used to display on the tree or used to color tree branches, *etc.*. ```{r fig.width=12, fig.height=10, width=60, warning=FALSE, fig.align="center"} raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="treeio") raxml <- read.raxml(raxml_file) raxml ``` ## Parsing NHX tree NHX (New Hampshire eXtended) format is an extension of Newick by introducing NHX tags. NHX is commonly used in phylogenetics software (including [PHYLDOG](http://pbil.univ-lyon1.fr/software/phyldog/) [@boussau_genome-scale_2013], [RevBayes](http://revbayes.github.io/intro.html) [@hohna_probabilistic_2014]) for storing statistical inferences. The following codes imported a NHX tree with associated data inferred by PHYLDOG. ```{r} nhxfile <- system.file("extdata/NHX", "phyldog.nhx", package="treeio") nhx <- read.nhx(nhxfile) nhx ``` ## Parsing Phylip tree Phylip format contains multiple sequence alignment of taxa in Phylip sequence format with corresponding Newick tree text that was built from taxon sequences. Sequence alignment can be sorted based on the tree structure and displayed at the right hand side of the tree using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]. ```{r} phyfile <- system.file("extdata", "sample.phy", package="treeio") phylip <- read.phylip(phyfile) phylip ``` ## Parsing EPA and pplacer output [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) [@berger_EPA_2011] and [PPLACER](http://matsen.fhcrc.org/pplacer/) [@matsen_pplacer_2010] have common output file format, `jplace`, which can be parsed by `read.jplace()` function. ```{r} jpf <- system.file("extdata/EPA.jplace", package="treeio") jp <- read.jplace(jpf) print(jp) ``` The number of evolutionary placement on each branch will be calculated and stored as the *nplace* feature, which can be mapped to line size and/or color using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]. ## Parsing jtree format{#jtree} The *jtree* is a JSON based format that was defined in this [treeio](https://bioconductor.org/packages/treeio/) package to support [tree data inter change](Exporter.html#jtree). Phylogenetic tree with associated data can be exported to a single *jtree* file using `write.jtree` function. The *jtree* can be easily parsed using any JSON parser. The *jtree* format contains three keys: tree, data and metadata. The tree value contains tree text extended from Newick tree format by putting the edge number in curly braces after branch length. The data value contains node/branch-specific data, while metadata value contains additional meta information. ```{r} jtree_file <- tempfile(fileext = '.jtree') write.jtree(beast, file = jtree_file) read.jtree(file = jtree_file) ``` # Linking external data to phylogeny In addition to analysis findings that are associated with the tree as we showed above, there is a wide range of heterogeneous data, including phenotypic data, experimental data and clinical data *etc.*, that need to be integrated and linked to phylogeny. For example, in the study of viral evolution, tree nodes may associated with epidemiological information, such as location, age and subtype. Functional annotations may need to be mapped on gene trees for comparative genomics studies. To facilitate data integration, [treeio](https://bioconductor.org/packages/treeio) provides `full_join` method to link external data to phylogeny and stored in `treedata` object. Here are examples of linking external data to a phylogenetic tree. After that, we can use [exporter](Exporter.html) to combine the tree and the data to a single tree file. The data that mapped on the phylogenetic tree can also be used to visualize or annotate the tree using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]. ```{r} x <- data_frame(label = as.phylo(beast)$tip.label, trait = rnorm(Ntip(beast))) full_join(beast, x, by="label") N <- Nnode2(beast) y <- data_frame(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N)) full_join(beast, y, by="node") ``` # Combining tree data The [treeio](https://bioconductor.org/packages/treeio/) package serves as an infrastructure that enables various types of phylogenetic data inferred from common analysis programs to be imported and used in R. For instance *d~N~/d~S~* or ancestral sequences estimated by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html), and clade support values (posterior) inferred by [BEAST](http://beast2.org/)/[MrBayes](http://nbisweden.github.io/MrBayes/). In addition, [treeio](https://bioconductor.org/packages/treeio/) package supports linking external data to phylogeny. It brings these external phylogenetic data (either from software output or exteranl sources) to the R community and make it available for further analysis in R. Furthermore, [treeio](https://bioconductor.org/packages/treeio/) can combine multiple phylogenetic trees together into one with their node/branch-specific attribute data. Essentially, as a result, one such attribute (*e.g.*, substitution rate) can be mapped to another attribute (*e.g.*, *d~N~/d~S~*) of the same node/branch for comparison and further computations. A previously published data set, seventy-six H3 hemagglutinin gene sequences of a lineage containing swine and human influenza A viruses [@liang_expansion_2014], was here to demonstrate the utilities of comparing evolutionary statistics inferred by different software. The dataset was re-analyzed by [BEAST](http://beast2.org/) for timescale estimation and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) for synonymous and non-synonymous substitution estimation. In this example, we first parsed the outputs from [BEAST](http://beast2.org/) using `read.beast` and from [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) using `read.codeml` into two `treedata` objects. Then the two objects containing separate sets of node/branch-specific data were merged via the `merge_tree` function. ```{r} beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree") rst_file <- system.file("examples/rst", package="ggtree") mlc_file <- system.file("examples/mlc", package="ggtree") beast_tree <- read.beast(beast_file) codeml_tree <- read.codeml(rst_file, mlc_file) merged_tree <- merge_tree(beast_tree, codeml_tree) merged_tree ``` After merging the `beast_tree` and `codeml_tree` objects, all node/branch-specific data imported from [BEAST](http://beast2.org/) and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) output files are all available in the `merged_tree` object. The tree object was converted to tidy data frame using [tidytree](https://cran.r-project.org/package=tidytree) package and visualized as hexbin scatterplot of *d~N~/d~S~*, *d~N~* and *d~S~* inferred by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) versus *rate* (substitution rate in unit of substitutions/site/year) inferred by [BEAST](http://beast2.org/) on the same branches. ```{r warning=FALSE, fig.width=9, fig.height=3} library(tidytree) library(ggplot2) as_data_frame(merged_tree) %>% dplyr::select(dN_vs_dS, dN, dS, rate) %>% subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>% tidyr::gather(type, value, dN_vs_dS:dS) %>% ggplot(aes(rate, value)) + geom_hex() + facet_wrap(~factor(type, levels = c('dN_vs_dS', 'dN', 'dS')), scale='free_y') + ylab(NULL) ``` Using `merge_tree`, we are able to compare analysis results using identical model from different software packages or different models using different or identical software. It also allows users to integrate different analysis finding from different software packages. Merging tree data is not restricted to software findings, associating external data to analysis findings is also granted. The `merge_tree` function is chainable and allows several tree objects to be merged into one. ```{r} phylo <- as.phylo(beast_tree) N <- Nnode2(phylo) d <- data_frame(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N)) fake_tree <- treedata(phylo = phylo, data = d) triple_tree <- merge_tree(merged_tree, fake_tree) triple_tree ``` The `triple_tree` object showed above contains analysis results obtained from [BEAST](http://beast2.org/) and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html), and evolutionary trait from external sources. All these information can be used to annotate the tree using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]. # Getting information from *treedata* object After the tree was imported, users may want to extract information that stored in the `treedata` object. `r Biocpkg("treeio")` provides several accessor methods to extract tree structure, features/attributes that stored in the object and their corresponding values. The `get.tree` or `as.phylo` methods can convert the `treedata` object to `phylo` object which is the fundamental tree object in the R community and many packages work with `phylo` object. ```{r} # or get.tree as.phylo(beast_tree) ``` The `get.fields` method return a vector of features/attributes that stored in the object and associated with the phylogeny. ```{r} get.fields(beast_tree) ``` The `get.data` method return a tibble of all the associated data. ```{r} get.data(beast_tree) ``` If users are only interesting a subset of the features/attributes return by `get.fields`, they can extract the information from the output of `get.data` or directly subset the data by `[` or `[[`. ```{r} beast_tree[, c("node", "height")] head(beast_tree[["height_median"]]) ``` # Manipulating tree data using *tidytree* All the tree data parsed/merged by [treeio](https://bioconductor.org/packages/treeio/) can be converted to tidy data frame using the [tidytree](https://cran.r-project.org/package=tidytree) package. The [tidytree](https://cran.r-project.org/package=tidytree) package provides tidy interfaces to manipulate tree with associated data. For instances, external data can be linked to phylogeny or evolutionary data obtained from different sources can be merged using tidyverse verbs. After the tree data was manipulated, it can be converted back to `treedata` object and [exported to a single tree file](Exporter.html), further analyzed in R or visualized using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]. For more details, please refer to the [tidytree package vignette](https://cran.r-project.org/web/packages/tidytree/vignettes/tidytree.html). # Visualizing tree data with *ggtree* [treeio](https://bioconductor.org/packages/treeio/) is seamlessly integrated into the [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017] package and all the information either directly imported or linking from external sources can be used to visualize and annotate the tree. See the `r Biocpkg("ggtree")` package vignettes for more details: + [Tree Visualization](treeVisualization.html) + [Tree Manipulation](treeManipulation.html) + [Tree Annotation](treeAnnotation.html) + [Phylomoji](https://cran.r-project.org/web/packages/emojifont/vignettes/phylomoji.html) + [Annotating phylogenetic tree with images](https://guangchuangyu.github.io/software/ggtree/vignettes/ggtree-ggimage.html) + [Annotate a phylogenetic tree with insets](https://guangchuangyu.github.io/software/ggtree/vignettes/ggtree-inset.html) # Need helps? If you have questions/issues, please visit [treeio homepage](https://guangchuangyu.github.io/software/treeio/) first. Your problems are mostly documented. If you think you found a bug, please follow [the guide](https://guangchuangyu.github.io/2016/07/how-to-bug-author/) and provide a reproducible example to be posted on [github issue tracker](https://github.com/GuangchuangYu/treeio/issues). For questions, please post to [Bioconductor support site](https://support.bioconductor.org/) and tag your post with *treeio*. For Chinese user, you can follow me on [WeChat (微信)](https://guangchuangyu.github.io/blog_images/biobabble.jpg). # References