---
title: "bugphyzz"
subtitle:
"A harmonized data resource and software for enrichment analysis of
microbial physiologies"
author:
- name: "Samuel Gamboa"
email: "Samuel.Gamboa.Tuz@gmail.com"
- name: "Levi Waldron"
email: "levi.waldron@sph.cuny.edu"
data: "`r doc_date()`"
package: "`r pkg_ver('bugphyzz')`"
abstract: >
The **bugphyzz** package simplifies the import of harmonized microbial
annotations into R from diverse sources. These annotations, including
extended taxa via ancestral state reconstruction, are organized into tidy
`data.frame` objects, facilitating the creation of microbial signatures.
These signatures can be used for enrichment analysis of microbiome omic
data, akin to gene set enrichment analysis using tools like
[EnrichmentBrowser](https://bioconductor.org/packages/release/bioc/html/EnrichmentBrowser.html)
and [bugsigdbr](https://bioconductor.org/packages/release/bioc/html/bugsigdbr.html),
which are also available on Bioconductor. Annotaions are imported from [Zenodo](https://zenodo.org/doi/10.5281/zenodo.10980653).
output:
BiocStyle::html_document:
fig_caption: true
toc: true
vignette: >
%\VignetteIndexEntry{bugphyzz}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
The bugphyzz package offers a convenient way to import a collection of
harmonized microbial annotations from various sources into R.
These annotations are available on [Zenodo](https://zenodo.org/doi/10.5281/zenodo.10980653).
In addition to being harmonized, some annotations have been extended to
other taxa based on the phylogeny from
[‘The All-Species Living Tree Project’](https://imedea.uib-csic.es/mmg/ltp/)
using ancestral state reconstruction (ASR) methods.
The annotations are provided in tabular format and organized into distinct
tidy `data.frame` objects
(for details, see the “Data schema” section below).
Once imported, these `data.frame` objects can be used to create microbial
signatures, which are lists of taxa with shared characteristics.
We anticipate these signatures being utilized for enrichment analysis of
microbiome omic data by implementing workflows similar to those used in
gene set enrichment analysis; for example, using the
[EnrichmentBrowser](https://bioconductor.org/packages/release/bioc/html/EnrichmentBrowser.html)
package (a detailed example is provided in a section below).
A similar package in Bioconductor is the [bugsigdbr](https://bioconductor.org/packages/release/bioc/html/bugsigdbr.html)
package, which imports literature-published microbial signatures from the
[BugSigDB](https://bugsigdb.org/Main_Page) database and has been used for
bug set enrichment analysis (BSEA). Moreover, the [writeGMT function](https://bioconductor.org/packages/release/bioc/vignettes/bugsigdbr/inst/doc/bugsigdbr.html#13_Writing_microbe_signatures_to_file_in_GMT_format)
from the bugsigdbr package can export bugphyzz signatures as GMT text
files.
# Data schema
Annotations in bugphyzz represent the link between a taxon (Bacteria/Archaea)
and an attribute, as outlined in the data schema provided below.
![**Data schema**](bugphyzz_data_schema.png){width="100%"}
**Taxon-related**
Taxonomic data in bugphyzz is standardized according to the NCBI taxonomy:
1. _NCBI ID_: An integer representing the NCBI taxonomy ID (taxid) associated
with a taxon.
2. _Rank_: A character string indicating the taxonomy rank,
including superkingdom, kingdom, phylum, class, order, family, genus, species,
or strain.
3. _Taxon name_: A character string denoting the scientific name of the taxon.
**Attribute-related**
Attribute data is harmonized using
ontology terms. Details of attributes, ontology terms, and ontology libraries
can be found in the [Attribute and sources](articles/attributes.html) article.
4. _Attribute_: A character string describing the name of a trait that can be
observed or measured.
5. _Attribute type_: A character string indicating the data type:
* numeric: Attributes with numeric values (e.g., growth temperature: 25°C).
* binary: Attributes with boolean values (e.g., butyrate-producing: TRUE).
* multistate-intersection: A set of related binary attributes
(e.g., habitat).
* multistate-union: Attributes with three or more values represented as
character strings (e.g., aerophilicity: aerobic, anaerobic,
or facultatively anaerobic).
6. _Attribute value_: The possible values that an attribute could take,
represented as character strings, booleans, or numbers.
**Attribute value-related**
Metadata associated with attribute values:
7. _Attribute source_: The source of the information.
8. _Evidence_: The type of evidence supporting an annotation, including:
* EXP: Experiment
* IGC: Inferred from genomic context
* TAS: Traceable author statement
* NAS: Non-traceable author statement
* IBD: Inferred from biological aspect of descendant
* ASR: Ancestral state reconstruction
9. _Support values_:
* Frequency and Score: Confidence that a given taxon exhibits a trait
based on the curator’s knowledge or results of ASR or IBD.
* Validation: Score from the
[10-fold cross-validation analysis](https://github.com/waldronlab/taxPProValidation).
Matthews correlation coefficient (MCC) for discrete attributes and
R-squared for numeric attributes. Default threshold value is 0.5 and above.
* NSTI: Nearest sequence taxon index from [PICRUSt2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7365738/)
or the [castor package](https://cran.r-project.org/web/packages/castor/index.html).
Relevant for numeric values only.
**Attribute source-related**
10. _Confidence in curation_: A character string indicating the confidence
value of a source based on three criteria: 1) presence of a source, 2) valid
references, and 3) peer-reviewed curation.
Valid options include high, medium, or low, corresponding to satisfaction of
three, two, or one of these criteria.
**Additional information**
+ Description of **sources** and **attributes**:
https://waldronlab.io/bugphyzz/articles/attributes.html
+ Description of ontology **evidence** codes:
https://geneontology.org/docs/guide-go-evidence-codes/
+ Description of **frequency** keywords and scores were based on:
https://grammarist.com/grammar/adverbs-of-frequency/
+ IBD and ASR were performed with taxPPro:
https://github.com/waldronlab/taxPPro
# Analysis and Stats
This vignette serves as an introduction to the basic functionalities of
bugphyzz. For a more in-depth analysis and detailed statistics
utilizing bugphyzz annotations, please visit:
https://github.com/waldronlab/bugphyzzAnalyses
# Installation
```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("bugphyzz")
```
# Import bugphyzz data
Load the bugphyzz package and additional packages for data manipulation:
```{r load package, message=FALSE}
library(bugphyzz)
library(dplyr)
library(purrr)
```
bugphyzz data is imported using the `importBugphyzz` function,
resulting in a list of tidy data frames. Each data frame corresponds to an
attribute or a group of related attributes. This is particularly evident in
the case of the multistate-union type described in the data schema above,
where related attributes are grouped together in a single data frame. Available
attribute names can be inspected with the `names` function:
```{r import data, message=FALSE}
bp <- importBugphyzz()
names(bp)
```
Let's take a glimpse at one of the data frames:
```{r a glimpse}
glimpse(bp$aerophilicity, width = 50)
```
Compare the column names with the data schema described above.
# Create microbial signatures
bugphyzz's primary function is to facilitate the creation of microbial
signatures, which are essentially lists of microbes sharing specific taxonomy
ranks and attribute values. Once the data frames containing attribute
information are imported, the `makeSignatures` function can be employed to
generate these signatures. `makeSignatures` offers various filtering options,
including evidence, frequency, and minimum and maximum values for numeric
attributes. For more precise filtering requirements, users can leverage
standard data manipulation functions on the relevant data frame,
such as `dplyr::filter`.
Examples:
+ Create signatures of taxon names at the genus level for the aerophilicity
attribute (discrete):
```{r}
aer_sigs_g <- makeSignatures(
dat = bp[["aerophilicity"]], taxIdType = "Taxon_name", taxLevel = "genus"
)
map(aer_sigs_g, head)
```
+ Create signatures of taxon names at the species level for the growth
temperature attribute (numeric):
```{r}
gt_sigs_sp <- makeSignatures(
dat = bp[["growth temperature"]], taxIdType = "Taxon_name",
taxLevel = 'species'
)
map(gt_sigs_sp, head)
```
+ Create signatures with a custom threshold for the growth temperature
attribute (numeric):
```{r}
gt_sigs_mix <- makeSignatures(
dat = bp[["growth temperature"]], taxIdType = "Taxon_name",
taxLevel = "mixed", min = 0, max = 25
)
map(gt_sigs_mix, head)
```
+ Create signatures for the animal pathogen attribute (boolean):
```{r}
ap_sigs_mix <- makeSignatures(
dat = bp[["animal pathogen"]], taxIdType = "NCBI_ID",
taxLevel = "mixed", evidence = c("exp", "igc", "nas", "tas")
)
map(ap_sigs_mix, head)
```
+ Create signatures for all of the bugphyzz data frames:
```{r}
sigs <- map(bp, makeSignatures) |>
list_flatten(name_spec = "{inner}")
length(sigs)
```
```{r}
head(map(sigs, head))
```
# Run a bug set enrichment analysis
Bugphyzz signatures are suitable for conducting bug set enrichment analysis
using existing tools available in R. In this example, we will perform a set
enrichment analysis using a dataset
with a known biological ground truth.
The dataset originates from the Human Microbiome Project (2012) and compares
subgingival and supragingival plaque.
This data will be imported using the
[MicrobiomeBenchmarkData package](https://bioconductor.org/packages/release/data/experiment/html/MicrobiomeBenchmarkData.html).
For the implementation of the enrichment analysis, we will utilize the
Gene Set Enrichment Analysis (GSEA) method available in the
[EnrichmentBrowser package](https://bioconductor.org/packages/release/bioc/html/EnrichmentBrowser.html).
The expected outcome is an enrichment of aerobic taxa in the supragingival
plaque (positive enrichment score) and anaerobic taxa in the subgingival plaque
(negative enrichment score).
Load necessary packages:
```{r, message=FALSE}
library(EnrichmentBrowser)
library(MicrobiomeBenchmarkData)
library(mia)
```
Import benchmark data:
```{r, warning=FALSE}
dat_name <- 'HMP_2012_16S_gingival_V35'
tse <- MicrobiomeBenchmarkData::getBenchmarkData(dat_name, dryrun = FALSE)[[1]]
tse_genus <- mia::splitByRanks(tse)$genus
min_n_samples <- round(ncol(tse_genus) * 0.2)
tse_subset <- tse_genus[rowSums(assay(tse_genus) >= 1) >= min_n_samples,]
tse_subset
```
Let's use the edgeR method for differential abundance analysis and
obtain sets of microbes.
Subgingival plaque will be used as reference
or "control", so negative values will mean enrichment in the subgingival plaque
and positive values will mean enrichment in the supragingival plaque.
Perform differential abundance (DA) analysis:
```{r}
tse_subset$GROUP <- ifelse(
tse_subset$body_subsite == 'subgingival_plaque', 0, 1
)
se <- EnrichmentBrowser::deAna(
expr = tse_subset, de.method = 'edgeR', padj.method = 'fdr',
filter.by.expr = FALSE,
)
```
It's recommended to perform a normalization step of the counts before
running GSEA. From the original [GSEA user guide](https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm):
"GSEA does not normalize RNA-seq data.
RNA-seq data must be normalized for between-sample comparisons using an
external normalization procedure (e.g. those in DESeq2 or Voom)."
In this example, we are treating the microbiome
data as RNA-seq (see: https://link.springer.com/article/10.1186/s13059-020-02104-1).
Let's use the `limma::voom` function.
A glimpse to the assay stored in the SE:
```{r}
assay(se)[1:5, 1:5] # counts
```
From the `?limma::voom` documentation, input should be "a numeric matrix
containing raw counts...". Note that the assay in the SummarizedExperiment
will be replaced with normalized counts.
Perform normalization step:
```{r}
dat <- data.frame(colData(se))
design <- stats::model.matrix(~ GROUP, data = dat)
assay(se) <- limma::voom(
counts = assay(se), design = design, plot = FALSE
)$E
```
The output is a "numeric matrix of normalized expression values on the
log2 scale" as described in the `?lima::voom` documentation. This output
is ready for GSEA.
```{r}
assay(se)[1:5, 1:5] # normalized counts
```
Perform GSEA and display the results:
```{r, message=FALSE}
gsea <- EnrichmentBrowser::sbea(
method = 'gsea', se = se, gs = aer_sigs_g, perm = 1000,
# Alpha is the FDR threshold (calculated above) to consider a feature as
# significant.
alpha = 0.1
)
gsea_tbl <- as.data.frame(gsea$res.tbl) |>
mutate(
GENE.SET = ifelse(PVAL < 0.05, paste0(GENE.SET, ' *'), GENE.SET),
PVAL = round(PVAL, 3),
) |>
dplyr::rename(BUG.SET = GENE.SET)
knitr::kable(gsea_tbl)
```
# Get signatures associated with a specific microbe
To retrieve all signature names associated with a specific taxon,
users can utilize the `getTaxonSignatures` function.
Let's see an example using _Escherichia coli_ (taxid: 562).
Get all signature names associated to _E. coli_:
```{r, message=FALSE}
getTaxonSignatures(tax = "Escherichia coli", bp = bp)
```
Get all signature names associated to the _E. coli_ taxid:
```{r}
getTaxonSignatures(tax = "562", bp = bp)
```
# Session information
```{r}
sessioninfo::session_info()
```