---
title: 'TreeAndLeaf: an alternative to dendrogram visualization.'
author: 'Leonardo Kume, Luis E A Rizzardi, Milena A Cardoso, Sheyla Trefflich, Mauro A A Castro'
date: "`r BiocStyle::doc_date()`"
abstract: "**TreeAndLeaf** is a R-based package combined with **RedeR** and the **igraph** format
to enhance the visualization of dendrograms."
package: "`r BiocStyle::pkg_ver('TreeAndLeaf')`"
output:
BiocStyle::html_document:
css: custom.css
vignette: >
%\VignetteIndexEntry{TreeAndLeaf: an alternative to dendrogram visualization.}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(TreeAndLeaf)
library(RedeR)
library(RColorBrewer)
library(igraph)
library(ape)
```
# Overview
**TreeAndLeaf** is an R-based package for better visualization of dendrograms
and phylogenetic trees. The package changes the way a dendrogram is viewed.
Through the use of the **igraph** format and the package **RedeR**, the
nodes are rearranged and the hierarchical relations are kept intact, resulting
in an image that is easier to read and can be enhanced with additional
layers of information.
The classical dendrogram is a limited format in two ways. Firstly, it only
displays one type of information, which is the hierarchical relation between
the data. Secondly, it is limited by its size, the larger the database, the
less readable it becomes. The **TreeAndLeaf** enhances space distribution
because it uses all directions, allowing for an improved visualization and
a better image for publications. The package **RedeR**, used for plotting
in this package, uses a force-based relaxation algorithm that helps nodes
in avoiding overlaps. By implementing **RedeR** and the **igraph** format,
the package allows for customization of the dendrogram inserting multiple layers
of information to be represented by edge widths and colors, nodes colors, nodes
sizes, line color, etc. The package also includes a fast formatting option for
quick and exploratory analysis usage. Therefore, the package is designed to make
plotting dendrograms more useful, less confusing and more productive.
The workflow while using this package is depicted from **Figure 1**.
**Figure 1**. A brief representation of what **TreeAndLeaf** functions are
capable of. **(A,B)** The dendrogram in A was used to construct the graph
representation shown in B. **(C)** Workflow summary. The main input data
consists of a distance matrix, which is used to generate a dendrogram.
The **TreeAndLeaf** package transforms the dendrogram into a graph
representation.
This document intends to guide you through the basics and give you ideas of how
to use the functions to their full potential. Although **TreeAndLeaf** was
created for systems biology application, it is not at all limited to this use.
# USArrests - a small dendrogram example
This section provides a quick and basic example using the R built-in dataframe
`USArrests`, shown below. To know more about the info shown in this dataframe,
use `?USArrests`. To use **TreeAndLeaf** functions to their full potential, it
is recommended that your dataframe has rownames set before making the dendrogram,
like this one has.
```{r echo=TRUE}
dim(USArrests)
head(USArrests)
```
## Building a dendrogram using R `hclust()`
In order to build a dendrogram, you need to have a distance matrix of the
observations. For example, the default “euclidean distance” method of
`dist()` can be used to generate one, and then use the “average” method of
`hclust()` to create a dendrogram.
```{r echo=TRUE}
hc <- hclust(dist(USArrests), "ave")
plot(hc)
```
## Converting your hclust object to an igraph object
This is a rather simple but important step. Since **TreeAndLeaf**
and **RedeR** work with **igraph** objects, a function is provided to
convert an **hclust** dendrogram into an **igraph**. For that, simply
follow use `hclust2igraph()`.
```{r}
gg <- hclust2igraph(hc)
```
## Formating the igraph for better visualization in RedeR
There is a quick formatting option in **TreeAndLeaf** package by using
the function `formatTree()`, which is a theme function used to standardize
node sizes and colors. This is an important step because the tree will
have leaf nodes (the ones representing your observations) and non-leaf nodes
(the ones representing bifurcations of the dendrogram), and this function
makes the last ones invisible to achieve the desired appearance and proper
relaxation. A description of available themes can be consulted at `?formatTree`.
```{r}
gg <- formatTree(gg = gg, theme = 5)
```
Now, the tree-and-leaf diagram is ready to be shown in **RedeR** with `treeAndLeaf()`,
or you can have layers of information added to it, as shown below.
## Inserting additional layers of information
**RedeR** offers a set of functions to manipulate **igraph** attributes
according to the parameters the application reads.
First, `att.mapv()` is used to insert the dataframe inside the **igraph** object
and make it available for setting node attributes. In this step, it is crucial
that the `refcol` points to a column with the same content as `hc$labels`.
In this case, `refcol = 0` indicates the rownames of the dataframe.
```{r}
gg <- att.mapv(g = gg, dat = USArrests, refcol = 0)
```
Now that the info is available, `att.setv()` changes the **igraph** attributes.
The package RColorBrewer can be used to generate a palette for reference.
Try `?addGraph` to see the options of **igraph** attributes **RedeR** can read.
```{r}
pal <- brewer.pal(9, "Reds")
gg <- att.setv(g = gg, from = "Murder", to = "nodeColor",
cols = pal, nquant = 5)
gg <- att.setv(g = gg, from = "UrbanPop", to = "nodeSize",
xlim = c(50, 150, 1), nquant = 5)
```
## Calling the RedeR interface
With the **igraph** ready to be visualized, you need to invoke **RedeR**
interface. This might take some seconds.
```{r, eval = FALSE}
rdp <- RedPort()
calld(rdp)
resetd(rdp)
```
## Calling `treeAndLeaf()` and adding legends
This is **TreeAndLeaf**'s main function. It will read your **igraph** object,
generate the tree layout, plot it in **RedeR** interface and use functions
to enhance appeal and distribution.
```{r, eval = FALSE}
treeAndLeaf(obj = rdp,
gg = gg)
```
Adding legends is optional. When you call for `att.setv()` and inform column
names for `nodeColor` and `nodeSize`, it will automatically generate a **RedeR**
readable legend, which can be plotted using the code below.
```{r, eval = FALSE}
addLegend.color(obj = rdp,
gg,
title = "Murder Rate",
position = "right")
addLegend.size(obj = rdp,
gg,
title = "Urban Population Size",
position = "bottomright")
```
## Making manual adjustments
At this stage the image produced needs small adjustments to solve the
residual edge crossings. It is possible to just click and drag a node
to adjust it while the relaxation algorithm is still running.
All the different parameters can be changed and personalized throughout the
steps to achieve the desired image.
# A large dendrogram example - quakes
The **TreeAndLeaf** package is particularly useful when dealing with large
dendrograms. This section uses the `quakes` built-in dataframe as an
example. To know more about this data, check `?quakes`. Since each step was
detailed in the first example, this one will focus on describing only features
we were not able to see with `USArrests`.
```{r echo=TRUE}
dim(quakes)
head(quakes)
```
## Building the dendrogram
Clearly, when it comes to big dendrograms, it gets harder to show
clusterization and any other information by conventional plotting.
This is where **TreeAndLeaf** really makes a difference.
```{r echo=TRUE}
hc <- hclust(dist(quakes))
plot(hc)
```
## Converting and formating the igraph object
As described before, the package functions `hclust2igraph()` is used for
converting, `formatTree()` for initial attribute setting, `att.mapv()` for
inserting the dataframe inside the **igraph** object and `att.setv()` to
change graph characteristics.
```{r}
gg <- hclust2igraph(hc)
gg <- formatTree(gg, theme = 1, cleanalias = TRUE)
gg <- att.mapv(gg, quakes, refcol = 0)
pal <- brewer.pal(9, "Greens")
gg <- att.setv(gg, from = "mag", to = "nodeColor",
cols = pal, nquant = 10)
gg <- att.setv(gg, from = "depth", to = "nodeSize",
xlim = c(240, 880, 1), nquant = 5)
```
As stated above, **RedeR** uses a relaxation force-based algorithm to achieve a
stable distribution of nodes. One of the parameters used to calculate
attraction and repulsion forces is `nodeSize`. On the first example, the node
sizes ranged from 50 to 150 and on this one, it ranged from 240 to 880. The
`treeAndLeaf()` function uses less zoom to plot if the dendrogram has a great
number of nodes, so it is necessary to use bigger sizes for bigger trees.
Therefore, the `nodeSize` is a vital attribute for the tree-and-leaf structure
formation. If sizes are too small, the nodes will barely move during the
relaxation process. If sizes are too big, overlaps will be difficult to solve
and unwanted behaviors can arise. If the sizes are too different (i.e. 10 and
1000), you probably won’t be able to see the smaller ones. That being said, if
the tree is not clear, try changing parameters such as `nodeSize` to achieve the
desired image.
## Calling RedeR interface and plotting
Repeat the step described in the first example.
```{r, eval = FALSE}
rdp <- RedPort()
calld(rdp)
resetd(rdp)
```
```{r, eval = FALSE}
treeAndLeaf(rdp, gg)
addLegend.color(obj = rdp,
gg,
title = "Richter Magnitude")
addLegend.size(obj = rdp,
gg,
title = "Depth (km)")
```
## Making manual adjustments
After manually solving some overlaps, you should be able to achieve the result
shown below.
# Phylogenetic tree
The **TreeAndLeaf** package is also able to work with phylogenetic trees.
To show how it works, we will apply these steps to plot a tree from
**geneplast** package. It is a tree with 121 tips listing the eukaryotes
in STRING-db, release 9.1.
## Loading data
As mentioned, the tree can be loaded from **geneplast** package by running
the code below.
```{r}
library(geneplast)
data("gpdata.gs")
plot(phyloTree)
```
Aside from exhibiting the phylogenetic tree as a tree-and-leaf diagram, extra
layers of data to each species can also be added. **TreeAndLeaf** package offers
a dataframe containing statistical data of eukaryotes complete genomes, downloaded
from NCBI Genomes database. For more information, type `?spdata`.
```{r}
data("spdata")
```
## Matching data from both sources
The `spdata` object only shows data for eukaryotes with complete genomes
available, an inner join has to be made to select only the species
available in both datasets used. Therefore, it is necessary to check
]which tips of the `phylo` object has a match with a row in `spdata`. Then,
the tree is plotted again only with the selected tips.
```{r}
#Accessory indexing
idx <- match(as.numeric(spdata$tax_id), as.numeric(phyloTree$tip.label))
idx <- idx[!is.na(idx)]
tokeep <- phyloTree$tip.label[idx]
phyloTree$tip.label <- as.character(phyloTree$tip.label)
#Remaking the tree
pruned.tree <-
drop.tip(phyloTree,phyloTree$tip.label[-match(tokeep,
phyloTree$tip.label)])
```
## Converting the phylo object to igraph
For converting a phylogenetic tree to an **igraph** object, the package provides
another function: `phylo2igraph()`.
```{r}
tal.phylo <- phylo2igraph(pruned.tree)
```
## Formatting and adding extra layers of information
The following steps are the same as described before, so they won’t
be explained here.
```{r}
#Formatting of the graph
tal.phylo <- formatTree(tal.phylo, theme = 4)
tal.phylo <- att.mapv(g = tal.phylo, dat = spdata, refcol = 1)
tal.phylo <- att.setv(g = tal.phylo, from = "genome_size_Mb",
to = "nodeSize",
xlim = c(120, 250, 1), nquant = 5)
pal <- brewer.pal(9, "Purples")
tal.phylo <- att.setv (g = tal.phylo, from = "proteins",
to = "nodeColor",
nquant = 5, cols = pal, na.col = "black")
```
## Selecting names to be shown when plotting
If `treeAndLeaf()` is called now the NCBI TaxIDs will be shown above each node,
which is not desired. So the **igraph** object needs to be modified to show
species names, but not all of them, to prevent unreadability. For that, general
**igraph** manipulation functions can be used.
```{r}
#Changing the alias to show the names and making them invisible
idx <- match(V(tal.phylo)$nodeAlias, spdata$tax_id)
V(tal.phylo)$nodeAlias <- spdata$sp_name[idx]
V(tal.phylo)$nodeAlias[is.na(V(tal.phylo)$nodeAlias)] <- ""
V(tal.phylo)$nodeFontSize <- 1
#Randomly selecting some names to be shown
set.seed(9)
V(tal.phylo)$nodeFontSize[sample(
1:length(V(tal.phylo)$nodeFontSize), 50)] <- 100
V(tal.phylo)$nodeFontSize[V(
tal.phylo)$name == "9606"] <- 100 #Homo sapiens
```
## Plotting and making manual adjustments
```{r eval = FALSE}
#Calling RedeR and plotting
rdp <- RedPort()
calld(rdp)
resetd(rdp)
treeAndLeaf(rdp, tal.phylo)
addLegend.size(rdp, tal.phylo, title = "Genome Size (Mb)")
addLegend.color(rdp, tal.phylo, title = "Protein Count")
```
# Nonbinary species Tree from STRING-db v11.0
Although **TreeAndLeaf** was written to work with binary trees, the package
also works for some non binary diagrams such as the STRING-db species tree,
release v11.0. Since all features were detailed on previous sections, this
is just a demonstration and there will be no code explanation other than
comments. This example uses the same dataframe downloaded from NCBI Genomes,
applied on the previous example.
```{r}
#Loading data
data("spdata") #NCBI Genomes scraped info
data("phylo_species") #STRING-db tree metadata
data("phylo_tree") #STRING-db phylo object
#Remaking the tree with species inside spdata
idx <- match(as.numeric(spdata$tax_id), as.numeric(phylo_species$X...taxon_id))
idx <- idx[!is.na(idx)]
tokeep <- phylo_species$X...taxon_id[idx]
pruned.tree <-
drop.tip(phylo_tree,phylo_tree$tip.label[-match(tokeep,
phylo_tree$tip.label)])
#Converting phylo to igraph
tal.phy <- phylo2igraph(pruned.tree)
#Formatting the tree
tal.phy <- formatTree(gg = tal.phy, theme = 3)
tal.phy <- att.mapv(g = tal.phy, dat = spdata, refcol = 1)
tal.phy <- att.setv(g = tal.phy, from = "genome_size_Mb",
to = "nodeSize",
nquant = 5, xlim = c(200, 600, 1))
pal <- brewer.pal(9, "Blues")
tal.phy <- att.setv(g = tal.phy, from = "proteins", to = "nodeColor",
nquant = 5, cols = pal, na.col = "black")
#Randomly selecting names to be shown
set.seed(9)
V(tal.phy)$nodeFontSize <- 1
V(tal.phy)$nodeFontSize[sample(
1:length(V(tal.phy)$nodeFontSize), 80)] <- 300
V(tal.phy)$nodeFontSize[V(tal.phy)$name == 9606] <- 300
idx <- match(V(tal.phy)$nodeAlias, spdata$tax_id)
V(tal.phy)$nodeAlias <- spdata$sp_name[idx]
V(tal.phy)$nodeAlias[is.na(V(tal.phy)$nodeAlias)] <- ""
```
```{r eval = FALSE}
#Calling RedeR and plotting
rdp <- RedPort()
calld(rdp)
resetd(rdp)
treeAndLeaf(rdp, tal.phy)
addLegend.color(rdp, tal.phy)
addLegend.size(rdp, tal.phy)
```
# Installation
The package is freely available from the Bioconductor at
https://bioconductor.org/packages/TreeAndLeaf.
# Session information
```{r label='Session information', eval=TRUE, echo=FALSE}
sessionInfo()
```