--- title: "HoloFoodR: interface to HoloFoodR database" date: "`r Sys.Date()`" package: HoloFoodR output: BiocStyle::html_document: fig_height: 7 fig_width: 10 toc: yes toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{HoloFoodR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| label: setup #| include: false library(knitr) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = TRUE ) ``` ## Introduction `HoloFoodR` is a package designed to ease access to the EBI's [HoloFoodR](https://www.holofooddata.org/) resource, allowing searching and retrieval of multiple datasets for downstream analysis. The HoloFood database does not encompass metagenomics data; however, such data is stored within the [MGnify](https://www.ebi.ac.uk/metagenomics) database. Both packages offer analogous functionalities, streamlining the integration of data and enhancing accessibility. `r BiocStyle::Biocpkg("TreeSummarizedExperiment")` ## Installation `HoloFoodR` is hosted on Bioconductor, and can be installed using via `BiocManager`. ```{r} #| label: install #| eval: false BiocManager::install("HoloFoodR") ``` ## Load the package Once installed, `HoloFoodR` is made available in the usual way. ```{r} #| label: load_package library(HoloFoodR) ``` ## Functionalities `HoloFoodR` offers three functions `doQuery()`, `getResult()` and `getData()` which can be utilized to search and fetch data from HoloFood database. In this tutorial, we demonstrate how to search animals, subset animals based on whether they have specific sample type, and finally fetch the data on samples. Note that this same can be done with `doQuery()` and `getResult()` (or `getData()` and `getResult()`) only by utilizing query filters. This tutorial is for demonstrating the functionality of the package. Additionally, the package includes `getMetaboLights()` function which can be utilized to retrieve metabolomic data from MetaboLights database. ### Search data To search animals, genome catalogues, samples or viral catalogues, you can use `doQuery()` function. You can also use `getData()` but `doQuery()` is optimized for searching these datatypes. For example, instead of nested list of sample types `doQuery()` returns sample types as presence/absence table which is more convenient. Here we search animals, and subset them based on whether they include histological samples. Note that this same can be done also by using query filters. ```{r} #| label: search_animals animals <- doQuery("animals", max.hits = 100) animals <- animals[ animals[["histological"]], ] colnames(animals) |> head() ``` `doQuery()` returns a `data.frame` including information on type of data being searched. ### Get data Now we have information on which animal has histological samples. Let's get data on those animals to know the sample IDs to fetch. ```{r} #| label: get_animal_data animal_data <- getData( accession.type = "animals", accession = animals[["accession"]]) ``` The returned value of `getData()` function is a `list`. We can have the data also as a `data.frame` when we specify `flatten = TRUE`. The data has information on animals including samples that have been drawn from them. ```{r} #| label: get_samples1 samples <- animal_data[["samples"]] colnames(samples) |> head() ``` The elements of the `list` are `data.frames`. For example, "samples" table contains information on samples drawn from animals that were specified in input. Now we can collect sample IDs. ```{r} #| label: get_samples2 sample_ids <- unique(samples[["accession"]]) ``` ### Get data on samples To get data on samples, we can utilize `getResult()` function. It returns the data in `r BiocStyle::Biocpkg("MultiAssayExperiment")` (`MAE`) format. ```{r} #| label: get_mae mae <- getResult(sample_ids) mae ``` `MAE` object stores individual omics as `r BiocStyle::Biocpkg("TreeSummarizedExperiment")` (`TreeSE`) objects. ```{r} #| label: show_mae mae[[1]] ``` In `TreeSE`, each column represents a sample and rows represent features. ### Incorporate with MGnify data `MGnifyR` is a package that can be utilized to fetch metagenomics data from MGnify database. From the `MGnifyR` package, we can use `MGnifyR::searchAnalysis()` function to search analyses based on sample IDs that we have. ```{r} #| label: search_analyses1 #| eval: false library(MGnifyR) mg <- MgnifyClient(useCache = TRUE) # Get those samples that are metagenomic samples metagenomic_samples <- samples[ samples[["sample_type"]] == "metagenomic_assembly", ] # Get analysis IDs based on sample IDs analysis_ids <- searchAnalysis( mg, type = "samples", metagenomic_samples[["accession"]]) ``` ```{r} #| label: search_analyses2 #| include: false path <- system.file("extdata", "analysis_ids.rds", package = "HoloFoodR") analysis_ids <- readRDS(path) ``` ```{r} #| label: search_analyses3 head(analysis_ids) ``` Then we can fetch data based on accession IDs. ```{r} #| label: get_metageomic_data1 #| eval: false mae_metagenomic <- MGnifyR::getResult(mg, analysis_ids) ``` ```{r} #| label: get_metageomic_data2 #| include: false path <- system.file("extdata", "mae_metagenomic.rds", package = "HoloFoodR") mae_metagenomic <- readRDS(path) ``` ```{r} #| label: get_metageomic_data3 mae_metagenomic ``` `MGnifyR::getResult()` returns `MAE` object just like `HoloFoodR`. However, metagenomic data points to individual analyses instead of samples. We can harmonize the data by replacing analysis IDs with sample IDs, and then we can combine the data to single `MAE`. ```{r} #| label: combine_data #| eval: false # Get experiments from metagenomic data exps <- experiments(mae_metagenomic) # Convert analysis names to sample names exps <- lapply(exps, function(x){ # Get corresponding sample ID sample_id <- names(analysis_ids)[ match(colnames(x), analysis_ids) ] # Replace analysis ID with sample ID colnames(x) <- sample_id return(x) }) # Add to original MultiAssayExperiment mae <- c(experiments(mae), exps) mae ``` Now, with the `MAE` object linking samples from various omics together, compatibility is ensured as the single omics datasets are in `(Tree)SummarizedExperiment` format. This compatibility allows us to harness cutting-edge downstream analytics tools like [miaverse framework](https://microbiome.github.io/) that support these data containers seamlessly. ### Extra: Get data from MetaboLights database The HoloFood database exclusively contains targeted metabolomic data. However, it provides URL addresses linking to the MetaboLights database, where untargeted metabolomics data can be accessed. To retrieve this data, you can utilize the getMetaboLights() function to retrieve information on available data. Moreover, it returns processed metabolomic data (for processed data, you can also use `getReturn(x, get.metabolomic=TRUE)`). Below, we retrieve all the processed (mapped) metabolomic data associated with HoloFood. ```{r} #| label: get_metabolomics #| eval: false # Get untargeted metabolomic samples samples <- doQuery("samples", sample_type = "metabolomic") # Get the data metabolomic <- getMetaboLights(samples[["metabolomics_url"]]) # Show names of data.frames names(metabolomic) ``` The result is a list that includes three data.frames: - study metadata - assay metadata - assay that includes abundance table and feature metadata For spectra data, you can either fetch files using `getMetaboLightsFile()`, or follow this [vignette](https://rformassspectrometry.github.io/MsIO/articles/MsIO.html#loading-data-from-metabolights) for guidance on loading data directly into an \code{MsExperiment} object, which is tailored for metabolomics spectra data. ## Session info ```{r} #| label: session_info sessionInfo() ```