The package geneplast.data provides datasets from different sources via AnnotationHub to use in geneplast pipelines. The datasets have species, phylogenetic trees, and orthology relationships among eukaryotes from different orthologs databases.
geneplast.data 0.99.9
Geneplast is designed for large-scale evolutionary plasticity and rooting analysis based on orthologs groups (OG) distribution in a given species tree. This supporting package provides datasets obtained and processed from different orthologs databases for geneplast evolutionary analyses.
Currently, data from the following sources are available:
Each dataset consists of four objects used as input for geneplast:
data.frame mapping identifiers from OG and species to proteins/genes.phylo object representing a phylogenetic tree where the tips’ labels correspond to each species identifiers from cogdata.data.frame containing OG identifiers from cogdata.data.frame with species identifiers from cogdata.Create a new AnnotationHub connection and query for all geneplast resources.
library('AnnotationHub')
ah <- AnnotationHub()
meta <- query(ah, "geneplast")
head(meta)Load the objects into the session using the ID of the chosen dataset.
# Dataset derived from STRING database v11.0
load(meta[["AH83116"]])The general procedure for creating previously described objects starts by selecting only eukaryotes from the orthologs database with NCBI taxonomy classification.
We build a graph from taxonomy nodes and locate the root of eukaryotes. Then, we traverse this sub-graph from root to leaves corresponding to the taxonomy identifiers of the species in the database. By selecting the leaves of the resulting sub-graph, we obtain the sspids object.
Once the species of interest are selected, the orthology information of corresponding proteins is filtered to obtain the cogdata object.
The cogids object consists of unique orthologs identifiers from cogdata.
Finally, the phyloTree object is built from TimeTree full eukaryotes phylogenetic tree, which is pruned to show only our species of interest. The missing species are filled using strategies of matching genera and the closest species inferred from NCBI’s tree previously built.
Users are encouraged to use preprocessed datasets from AnnotationHUB since they’ve already been validated for geneplast analysis. However, it is possible to build customized datasets from any source of orthology information, provided they follow some requirements. This section presents an example with mock data on how to build the input data objects for geneplast package.
The minimal input set for running geneplast are: 1) a data.frame object cogdata, and 2) a phylo object phyloTree.
The cogdata table maps proteins to OGs and species, and must have at least three columns: 1) a protein identifier column protein_id, 2) an OG identifier column cog_id, and a species identifier column ssp_id.
Suppose the user have collected orthology information between proteins from the following species:
species <- data.frame (
    label = c("sp01", "sp02", "sp03", "sp04", "sp05", "sp06", "sp07", "sp08", "sp09", "sp10", "sp11", "sp12", "sp13", "sp14", "sp15"),
                scientific_name = c("Homo sapiens", "Pan troglodytes", "Gorilla gorilla gorilla", "Macaca mulatta", "Papio anubis", "Rattus norvegicus", "Mus musculus", "Canis lupus familiaris", "Marmosa mexicana", "Monodelphis domestica", "Gallus gallus", "Meleagris gallopavo", "Xenopus tropicalis", "Latimeria chalumnae", "Danio rerio"),
                taxon_id = c("9606", "9598", "9595", "9544", "9555", "10116", "10090", "9615", "225402", "13616", "9031", "9103", "8364", "7897", "7955")
)
species
#>    label         scientific_name taxon_id
#> 1   sp01            Homo sapiens     9606
#> 2   sp02         Pan troglodytes     9598
#> 3   sp03 Gorilla gorilla gorilla     9595
#> 4   sp04          Macaca mulatta     9544
#> 5   sp05            Papio anubis     9555
#> 6   sp06       Rattus norvegicus    10116
#> 7   sp07            Mus musculus    10090
#> 8   sp08  Canis lupus familiaris     9615
#> 9   sp09        Marmosa mexicana   225402
#> 10  sp10   Monodelphis domestica    13616
#> 11  sp11           Gallus gallus     9031
#> 12  sp12     Meleagris gallopavo     9103
#> 13  sp13      Xenopus tropicalis     8364
#> 14  sp14     Latimeria chalumnae     7897
#> 15  sp15             Danio rerio     7955The user’s dataset have 3 orthologous groups identified as: OG1, OG2, and OG3. The proteins whithin each OG are identified by a combination of a species label plus a protein ID (ex.: “spXX.protXx”).
og1 <- expand.grid(cog_id = "OG1", 
                   protein_id = c(
    "sp01.prot1a", "sp01.prot1b", "sp01.prot1c", "sp01.prot1d", "sp01.prot1e", "sp01.prot1f", "sp01.prot1g", 
    "sp02.prot1a", "sp02.prot1b", "sp02.prot1c", "sp02.prot1d", "sp02.prot1e", "sp02.prot1f", 
    "sp03.prot1a", "sp03.prot1b", "sp03.prot1c", 
    "sp04.prot1a", "sp04.prot1b", "sp04.prot1c", "sp04.prot1d", 
    "sp05.prot1a", "sp05.prot1b", "sp05.prot1c", "sp05.prot1d", "sp05.prot1e",
    "sp06.prot1a", "sp06.prot1b",
    "sp07.prot1a", "sp07.prot1b", "sp07.prot1c",
    "sp08.prot1a", "sp08.prot1b",
    "sp11.prot1a", "sp11.prot1b", "sp11.prot1c",
    "sp12.prot1a", "sp12.prot1b", "sp12.prot1c",
    "sp13.prot1a", "sp13.prot1b",
    "sp14.prot1",
    "sp15.prot1a", "sp15.prot1b"
    )
)
og2 <- expand.grid(cog_id = "OG2", 
                   protein_id = c(
    "sp01.prot2a", "sp01.prot2b", "sp01.prot2c", "sp01.prot2d", "sp01.prot2e", "sp01.prot2f",
    "sp02.prot2a", "sp02.prot2b", "sp02.prot2c",
    "sp03.prot2a", "sp03.prot2b", "sp03.prot2c", "sp03.prot2d",
    "sp04.prot2a", "sp04.prot2b", "sp04.prot2c",
    "sp05.prot2a", "sp05.prot2b", "sp05.prot2c", "sp05.prot2d",
    "sp06.prot2",
    "sp08.prot2a", "sp08.prot2b"
    )
)
og3 <- expand.grid(cog_id = "OG3", 
                   protein_id = c(
    "sp01.prot3a", "sp01.prot3b", "sp01.prot3c", "sp01.prot3d", 
    "sp02.prot3a", "sp02.prot3b", "sp02.prot3c",
    "sp03.prot3",
    "sp10.prot3"
    )
)To build the cogdata object, the user should bind the OGs in a single data.frame. Furthermore, the required column ssp_id must be added to the data frame. In the current example, this is done by splitting the species label from the protein_id column and replacing it with the corresponding taxon_id value.
library(tibble)
library(stringi)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# Bind OGs into a single object
cogdata <- rbind(og1, og2, og3)
# Create a dictionary to map species labels to taxon IDs
species_taxid_lookup <- species |> dplyr::select(label, taxon_id) |> tibble::deframe()
species_name_lookup <- species |> dplyr::select(taxon_id, scientific_name) |> tibble::deframe()
# Add the required ssp_id column
cogdata[["ssp_id"]] <- species_taxid_lookup[stri_split_fixed(cogdata[["protein_id"]], pattern = ".", n = 2, simplify = T)[,1]]
head(cogdata)
#>   cog_id  protein_id ssp_id
#> 1    OG1 sp01.prot1a   9606
#> 2    OG1 sp01.prot1b   9606
#> 3    OG1 sp01.prot1c   9606
#> 4    OG1 sp01.prot1d   9606
#> 5    OG1 sp01.prot1e   9606
#> 6    OG1 sp01.prot1f   9606For users having OGs from standard outputs from orthology inference methods like OrthoFinder, we provide the make.cogdata() function to parse a tabular-separated (.tsv) file directly into the cogdata object.
library(geneplast.data)
cogdata <- geneplast.data::make.cogdata(file = "path/to/orthogroups.tsv")To create the phyloTree object, the user can use the make.phyloTree() function provided by geneplast.data package. This function has two optional arguments (sspids and newick) that define its behavior depending on which one is provided.
Given a list of species’ NCBI Taxonomy IDs sspids, it builds a phylogenetic tree by merging the TimeTree and NCBI Taxonomy databases:
library(geneplast.data)
phyloTree <- geneplast.data::make.phyloTree(sspids = species$taxon_id)
#> -Building eukaryotes tree from sspids...
#> Joining with `by = join_by(taxid)`
#> -Loading full TimeTree species tree...
#> -Searching for missing taxa...
#> -Mapping taxon IDs and scientific names...
#> -NCBI tree created.
#> 1 missing taxa. Looking for the closest available taxa up in the hierarchy. This might take a while...-Grafting missing vertices...
#> -Grafted tree created.
#> -Eukaryotes tree created.
phyloTree
#> 
#> Phylogenetic tree with 15 tips and 241 internal nodes.
#> 
#> Tip labels:
#>   7955, 8364, 225402, 13616, 9615, 10116, ...
#> Node labels:
#>   2759, 33154, fix_15415, fix_15418, 33208, 6072, ...
#> 
#> Rooted; no branch lengths.Alternatively, the user could provide a newick file describing the species phylogenetic tree.
Note: the tips’ labels from the custom newick tree should be listed in ‘cogdata’ object to properly perform the analysis.
# Create from a user's predefined newick file
phyloTree <- geneplast.data::make.phyloTree(newick = "path/to/newick_tree.nwk")Plot the resulting phylogenetic tree:
library(geneplast)
phyloTree <- geneplast:::rotatePhyloTree(phyloTree, "9606")
phyloTree$edge.length <- NULL
phyloTree$tip.label <- species_name_lookup[phyloTree$tip.label]
plot(phyloTree, type = "cladogram")Figure 1. Sample phylogenetic tree built with make.phyloTree() function.
This section reproduces a case study using STRING, OMA, and OrthoDB annotated datasets.
The following scripts run geneplast rooting analysis and transfer its results to a graph model. For detailed step-by-step instructions, please check the geneplast vignette.
library(geneplast)
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606")
ogr <- groot(ogr, nPermutations=1, verbose=TRUE)
library(RedeR)
library(igraph)
library(RColorBrewer)
data(ppi.gs)
g <- ogr2igraph(ogr, cogdata, ppi.gs, idkey = "ENTREZ")
pal <- brewer.pal(9, "RdYlBu")
color_col <- colorRampPalette(pal)(37) #set a color for each root!
g <- att.setv(g=g, from="Root", to="nodeColor", cols=color_col, na.col = "grey80", breaks = seq(1,37))
g <- att.setv(g = g, from = "SYMBOL", to = "nodeAlias")
E(g)$edgeColor <- "grey80"
V(g)$nodeLineColor <- "grey80"
rdp <- RedPort()
calld(rdp)
resetd(rdp)
addGraph(rdp, g)
addLegend.color(rdp, colvec=g$legNodeColor$scale, size=15, labvec=g$legNodeColor$legend, title="Roots inferred from geneplast")
g1  <- induced_subgraph(g=g, V(g)$name[V(g)$Apoptosis==1])
g2  <- induced_subgraph(g=g, V(g)$name[V(g)$GenomeStability==1])
myTheme <- list(nestFontSize=25, zoom=80, isNest=TRUE, gscale=65, theme=2)
addGraph(rdp, g1, gcoord=c(25, 50), theme = c(myTheme, nestAlias="Apoptosis"))
addGraph(rdp, g2, gcoord=c(75, 50), theme = c(myTheme, nestAlias="Genome Stability"))
relax(rdp, p1=50, p2=50, p3=50, p4=50, p5= 50, ps = TRUE)
Figure 2. Inferred evolutionary roots of a protein-protein interaction network.
load(meta[["AH83117"]])
cogdata$cog_id <- paste0("OMA", cogdata$cog_id)
cogids$cog_id <- paste0("OMA", cogids$cog_id)
human_entrez_2_oma_Aug2020 <- read_delim("processed_human.entrez_2_OMA.Aug2020.tsv", 
    delim = "\t", escape_double = FALSE, 
    col_names = FALSE, trim_ws = TRUE)
names(human_entrez_2_oma_Aug2020) <- c("protein_id", "gene_id")
cogdata <- cogdata %>% left_join(human_entrez_2_oma_Aug2020)
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606")
ogr <- groot(ogr, nPermutations=1, verbose=TRUE)
g <- ogr2igraph(ogr, cogdata, ppi.gs, idkey = "ENTREZ")
pal <- brewer.pal(9, "RdYlBu")
color_col <- colorRampPalette(pal)(37) #set a color for each root!
g <- att.setv(g=g, from="Root", to="nodeColor", cols=color_col, na.col = "grey80", breaks = seq(1,37))
g <- att.setv(g = g, from = "SYMBOL", to = "nodeAlias")
E(g)$edgeColor <- "grey80"
V(g)$nodeLineColor <- "grey80"
# rdp <- RedPort()
# calld(rdp)
resetd(rdp)
addGraph(rdp, g)
addLegend.color(rdp, colvec=g$legNodeColor$scale, size=15, labvec=g$legNodeColor$legend, title="Roots inferred from geneplast")
g1  <- induced_subgraph(g=g, V(g)$name[V(g)$Apoptosis==1])
g2  <- induced_subgraph(g=g, V(g)$name[V(g)$GenomeStability==1])
myTheme <- list(nestFontSize=25, zoom=80, isNest=TRUE, gscale=65, theme=2)
addGraph(rdp, g1, gcoord=c(25, 50), theme = c(myTheme, nestAlias="Apoptosis"))
addGraph(rdp, g2, gcoord=c(75, 50), theme = c(myTheme, nestAlias="Genome Stability"))
relax(rdp, p1=50, p2=50, p3=50, p4=50, p5= 50, ps = TRUE)title
load(meta[["AH83118"]])
cogdata$cog_id <- paste0("ODB", cogdata$cog_id)
cogids$cog_id <- paste0("ODB", cogids$cog_id)
human_entrez_2_odb <- read_delim("odb10v1_genes-human-entrez.tsv", 
    delim = "\t", escape_double = FALSE, 
    col_names = FALSE, trim_ws = TRUE)
names(human_entrez_2_odb) <- c("protein_id", "gene_id")
cogdata <- cogdata %>% left_join(human_entrez_2_odb)
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606")
ogr <- groot(ogr, nPermutations=1, verbose=TRUE)
g <- ogr2igraph(ogr, cogdata, ppi.gs, idkey = "ENTREZ")
pal <- brewer.pal(9, "RdYlBu")
color_col <- colorRampPalette(pal)(37) #set a color for each root!
g <- att.setv(g=g, from="Root", to="nodeColor", cols=color_col, na.col = "grey80", breaks = seq(1,37))
g <- att.setv(g = g, from = "SYMBOL", to = "nodeAlias")
E(g)$edgeColor <- "grey80"
V(g)$nodeLineColor <- "grey80"
rdp <- RedPort()
calld(rdp)
resetd(rdp)
addGraph(rdp, g)
addLegend.color(rdp, colvec=g$legNodeColor$scale, size=15, labvec=g$legNodeColor$legend, title="Roots inferred from geneplast")
g1  <- induced_subgraph(g=g, V(g)$name[V(g)$Apoptosis==1])
g2  <- induced_subgraph(g=g, V(g)$name[V(g)$GenomeStability==1])
myTheme <- list(nestFontSize=25, zoom=80, isNest=TRUE, gscale=65, theme=2)
addGraph(rdp, g1, gcoord=c(25, 50), theme = c(myTheme, nestAlias="Apoptosis"))
addGraph(rdp, g2, gcoord=c(75, 50), theme = c(myTheme, nestAlias="Genome Stability"))
relax(rdp, p1=50, p2=50, p3=50, p4=50, p5= 50, ps = TRUE)title
This section presents elapsed times (ET) for geneplast rooting analysis of all OG from H. sapiens within different datasets (STRING, OMA, and OrthoDB). We used default values when calling the functions groot.preprocess and groot. The elapsed times were calculated by the difference of timestamps using tictoc library.
library(AnnotationHub)
library(dplyr)
library(geneplast)
library(tictoc)
ah <- AnnotationHub()
meta <- query(ah, "geneplast")
load(meta[["AH83116"]])
writeLines(paste("geneplast input data loaded from", meta["AH83116"]$title))
tic("total time")
tic("preprocessing")
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606")
toc()
tic("rooting")
ogr <- groot(ogr)
toc()
toc()We performed all runs on a desktop computer with the following specification:
Table 1. Elapsed times for preprocessing and rooting all human orthologs within different datasets.
| Dataset ID | Source | Species count | OG count | ET (prep.) | ET (rooting) | ET (total) | 
|---|---|---|---|---|---|---|
| AH83116 | STRING | 476 | 7702 | 6 min 22 sec | 10 min 50 sec | 17 min 12 sec | 
| AH83117 | OMA | 485 | 19614 | 7 min 46 sec | 26 min 58 sec | 34 min 45 sec | 
| AH83118 | OrthoDB | 1271 | 10557 | 19 min 3 sec | 13 min 41 sec | 32 min 44 sec | 
sessionInfo()
#> R Under development (unstable) (2024-01-16 r85808)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] geneplast_1.29.0      geneplast.data_0.99.9 dplyr_1.1.4          
#> [4] stringi_1.8.3         tibble_3.2.1          BiocStyle_2.31.0     
#> 
#> loaded via a namespace (and not attached):
#>  [1] xfun_0.42            bslib_0.6.1          lattice_0.22-5      
#>  [4] tzdb_0.4.0           vctrs_0.6.5          tools_4.4.0         
#>  [7] generics_0.1.3       yulab.utils_0.1.4    curl_5.2.0          
#> [10] parallel_4.4.0       fansi_1.0.6          RSQLite_2.3.5       
#> [13] highr_0.10           blob_1.2.4           pkgconfig_2.0.3     
#> [16] data.table_1.15.0    dbplyr_2.4.0         lifecycle_1.0.4     
#> [19] compiler_4.4.0       treeio_1.27.0        htmltools_0.5.7     
#> [22] snow_0.4-4           sass_0.4.8           yaml_2.3.8          
#> [25] lazyeval_0.2.2       pillar_1.9.0         crayon_1.5.2        
#> [28] jquerylib_0.1.4      tidyr_1.3.1          cachem_1.0.8        
#> [31] magick_2.8.3         nlme_3.1-164         tidyselect_1.2.0    
#> [34] digest_0.6.34        purrr_1.0.2          bookdown_0.37       
#> [37] fastmap_1.1.1        grid_4.4.0           archive_1.1.7       
#> [40] cli_3.6.2            magrittr_2.0.3       utf8_1.2.4          
#> [43] ape_5.7-1            readr_2.1.5          withr_3.0.0         
#> [46] filelock_1.0.3       bit64_4.0.5          rmarkdown_2.25      
#> [49] httr_1.4.7           igraph_2.0.2         bit_4.0.5           
#> [52] hms_1.1.3            memoise_2.0.1        evaluate_0.23       
#> [55] knitr_1.45           BiocFileCache_2.11.1 rlang_1.1.3         
#> [58] Rcpp_1.0.12          glue_1.7.0           tidytree_0.4.6      
#> [61] DBI_1.2.2            BiocManager_1.30.22  vroom_1.6.5         
#> [64] jsonlite_1.8.8       R6_2.5.1             fs_1.6.3