TaxSEA is an R package designed to enable rapid interpretation of differential abundance analysis or correlation analysis output for microbiota data. TaxSEA takes as input a vector of genus or species names and a rank. For example log2 fold changes, or Spearman’s rho. TaxSEA then uses a Kolmogorov-Smirnov test to identify if a particular group of species or genera (i.e. a set of taxa such as butyrate producers) are skewed to one end of the distribution .
Simply put, TaxSEA allows users to rapidly go from a list of species to which metabolite producers are altered, and if the findings are similar to a previously published study.
To install the latest version of TaxSEA from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TaxSEA")
TaxSEA utilizes taxon sets generated from five reference databases (gutMGene, GMrepo v2, MiMeDB, mBodyMap, BugSigDB).
Please cite the appropriate database if using:
Cheng et al. gutMGene: a comprehensive database for target genes of gut microbes and microbial metabolites Nucleic Acids Res. 2022.
Dai et al. GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison Nucleic Acids Res. 2022.
Wishart et. al. MiMeDB: the Human Microbial Metabolome Database Nucleic Acids Res. 2023.
Jin et al. mBodyMap: a curated database for microbes across human body and their associations with health and diseases. Nucleic Acids Res. 2022.
Geistlinger et al. BugSigDB captures patterns of differential abundance across a broad range of host-associated microbial signatures. Nature Biotech. 2023.
The test data provided with TaxSEA consists of log2 fold changes comparing between healthy and IBD. The count data for this was downloaded from curatedMetagenomeData and fold changes generated with LinDA.
get_taxon_sets(taxon)
: Retrieves taxon sets which contain a particular
taxon for a list of taxon names.get_ncbi_taxon_ids(taxon_names)
: Retrieves NCBI Taxonomy IDs for a
list of taxon names.TaxSEA(taxon_ranks, database = "All")
: Taxon set enrichment analysis.library(TaxSEA)
# Retrieve taxon sets containing Bifidobacterium longum.
blong.sets <- get_taxon_sets(taxon="Bifidobacterium_longum")
All that is required for TaxSEA is a named vector of log2 fold changes between groups for species or genera. TaxSEA will not work for ranks higher than species or genus. The input should be for all taxa tested, and not limited to only a pre-defined set (e.g. do not use a threshold for significance or remove any taxa). See example below for format. TaxSEA will lookup and convert taxon names to NCBI taxonomic identifiers. TaxSEA stores commonly observed identifiers internally to save time.
TaxSEA can also utilise custom databases which should be a named list of taxon sets. In this case the ID conversion is disabled and it is expected that the input names and database names will be in the same format
Input IDs should be in the format of like one of the following - Species name. E.g. “Bifidobacterium longum”, “Bifidobacterium_longum” - Genus name. E.g. “Bifidobacterium” - NCBI ID E.g. 216816
#Input IDs with the full taxonomic lineage should be split up. E.g.
x <- paste0(
"d__Bacteria.p__Actinobacteriota.c__Actinomycetes.",
"o__Bifidobacteriales.f__Bifidobacteriaceae.g__Bifidobacterium")
x = strsplit(x,split="\\.")[[1]][6]
x = gsub("g__","",x)
print(x)
## [1] "Bifidobacterium"
## Example test data
library(TaxSEA)
data(TaxSEA_test_data)
head(sample(TaxSEA_test_data),4)
## Haemophilus_sp_HMSC71H05 Hungatella_hathewayi Blautia_wexlerae
## 2.179 1.465 -1.446
## Prevotella_copri
## -0.514
data("TaxSEA_test_data")
taxsea_results <- TaxSEA(taxon_ranks=TaxSEA_test_data)
## Warning in ks.test.default(taxon_set_ranks, taxon_ranks): p-value will be
## approximate in the presence of ties
## Warning in ks.test.default(taxon_set_ranks, taxon_ranks): p-value will be
## approximate in the presence of ties
## Warning in ks.test.default(taxon_set_ranks, taxon_ranks): p-value will be
## approximate in the presence of ties
#Enrichments among metabolite producers from gutMgene and MiMeDB
metabolites.df = taxsea_results$Metabolite_producers
#Enrichments among health and disease signatures from GMRepoV2 and mBodyMap
disease.df = taxsea_results$Health_associations
#Enrichments amongh published associations from BugSigDB
bsdb.df = taxsea_results$BugSigdB
The output is a list of three dataframes providing enrichment results for metabolite produers, health/disease associations, and published signatures from BugSigDB. Each dataframe has 5 columns - taxonSetName - The name of the taxon set tested - median_rank - The median rank of set members - P value - Kolmogorov-Smirnov test P value. - FDR - P value adjusted for multiple testing. - TaxonSet - Returns list of taxa in the set to show what is driving the signal
The format of BugSigDB is that each publication is entered as a “Study”, and within this there is different experiments and signatures. Should users wish to find out more information about the signatures, they can do so by querying that database.
library(bugsigdbr) #This package is installable via Bioconductor
bsdb <- importBugSigDB() #Import database
## Using cached version from 2025-01-22 23:29:25
#E.g. if the BugSigDB identifier you found enriched was
#bsdb:74/1/2_obesity:obese_vs_non-obese_DOWN
#This is Study 74, Experiment 1, Signature 2
bsdb[bsdb$Study=="Study 74" &
bsdb$Experiment=="Experiment 1" &
bsdb$Signature=="Signature 2",]
## BSDB ID Study Study design PMID DOI URL
## 286 bsdb:74/1/2 Study 74 case-control 23526699 10.1002/oby.20466 <NA>
## Authors list
## 286 Verdam FJ, Fuentes S, de Jonge C, Zoetendal EG, Erbil R, Greve JW, Buurman WA, de Vos WM , Rensen SS
## Title
## 286 Human intestinal microbiota composition is associated with local and systemic inflammation in obesity
## Journal Year Keywords Experiment
## 286 Obesity (Silver Spring, Md.) 2013 <NA> Experiment 1
## Location of subjects Host species Body site UBERON ID Condition
## 286 Netherlands Homo sapiens Feces UBERON:0001988 Obesity
## EFO ID Group 0 name Group 1 name Group 1 definition
## 286 EFO:0001073 non-obese obese obesity: BMI 30.5-60.3 kg/m2
## Group 0 sample size Group 1 sample size Antibiotics exclusion
## 286 13 15 6 months
## Sequencing type 16S variable region Sequencing platform
## 286 16S NA Human Intestinal Tract Chip
## Statistical test Significance threshold MHT correction
## 286 Mann-Whitney (Wilcoxon) 0.05 TRUE
## LDA Score above Matched on Confounders controlled for Pielou Shannon Chao1
## 286 NA <NA> <NA> <NA> <NA> <NA>
## Simpson Inverse Simpson Richness Signature page name Source
## 286 <NA> <NA> <NA> Signature 2 Table 2
## Curated date Curator Revision editor
## 286 10 January 2021 Marianthi Thomatos WikiWorks,ChiomaBlessing
## Description
## 286 Differential microbial abundance between obese and non-obese individuals
## Abundance in Group 1
## 286 decreased
## MetaPhlAn taxon names
## 286 k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Rikenellaceae|g__Alistipes, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Alloprevotella|s__Alloprevotella tannerae, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides fragilis, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides intestinalis, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides stercoris, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides uniformis, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Hoylesella|s__Hoylesella oralis, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Odoribacteraceae|g__Odoribacter|s__Odoribacter splanchnicus, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Tannerellaceae|g__Parabacteroides|s__Parabacteroides distasonis, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Phocaeicola|s__Phocaeicola plebeius, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Tannerellaceae|g__Tannerella, k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Xylanibacter|s__Xylanibacter ruminicola
## NCBI Taxonomy IDs
## 286 2|976|200643|171549|171550|239759, 2|976|200643|171549|171552|1283313|76122, 2|976|200643|171549|815|816|817, 2|976|200643|171549|815|816|329854, 2|976|200643|171549|815|816|28116, 2|976|200643|171549|815|816|46506, 2|976|200643|171549|815|816|820, 2|976|200643|171549|171552|2974257|28134, 2|976|200643|171549|1853231|283168|28118, 2|976|200643|171549|2005525|375288|823, 2|976|200643|171549|815|909656|310297, 2|976|200643|171549|2005525|195950, 2|976|200643|171549|171552|558436|839
## State Reviewer
## 286 Complete Shaimaa Elsafoury
The TaxSEA function by default uses the Kolmogorov Smirnov test and the original idea was inspired by gene set enrichment analyses from RNASeq. Should users wish to use an alternative gene set enrichment analysis tool the database is formatted in such a way that should be possible. See below for an example with fast gene set enrichment analysis (fgsea).
library(fgsea) #This package is installable via Bioconductor
data("TaxSEA_test_data")
data("TaxSEA_db")
#Convert input names to NCBI taxon ids
names(TaxSEA_test_data) = get_ncbi_taxon_ids(names(TaxSEA_test_data))
TaxSEA_test_data = TaxSEA_test_data[!is.na(names(TaxSEA_test_data))]
#Run fgsea
fgsea_results <- fgsea(TaxSEA_db, TaxSEA_test_data, minSize=5, maxSize=500)
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.83% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
sessionInfo()
## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] fgsea_1.34.0 bugsigdbr_1.14.0 TaxSEA_1.0.0 BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.3 lattice_0.22-7
## [4] RSQLite_2.3.9 digest_0.6.37 magrittr_2.0.3
## [7] evaluate_1.0.3 grid_4.5.0 bookdown_0.43
## [10] fastmap_1.2.0 blob_1.2.4 Matrix_1.7-3
## [13] jsonlite_2.0.0 DBI_1.2.3 BiocManager_1.30.25
## [16] httr_1.4.7 purrr_1.0.4 scales_1.3.0
## [19] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.4
## [22] rlang_1.1.6 dbplyr_2.5.0 munsell_0.5.1
## [25] cowplot_1.1.3 bit64_4.6.0-1 withr_3.0.2
## [28] cachem_1.1.0 yaml_2.3.10 tools_4.5.0
## [31] parallel_4.5.0 BiocParallel_1.42.0 memoise_2.0.1
## [34] dplyr_1.1.4 colorspace_2.1-1 ggplot2_3.5.2
## [37] fastmatch_1.1-6 filelock_1.0.3 curl_6.2.2
## [40] vctrs_0.6.5 R6_2.6.1 BiocFileCache_2.16.0
## [43] lifecycle_1.0.4 bit_4.6.0 pkgconfig_2.0.3
## [46] pillar_1.10.2 bslib_0.9.0 gtable_0.3.6
## [49] data.table_1.17.0 glue_1.8.0 Rcpp_1.0.14
## [52] xfun_0.52 tibble_3.2.1 tidyselect_1.2.1
## [55] knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [58] compiler_4.5.0