--- title: "Explore Human BioMolecular Atlas Program Data Portal" author: - name: "Christine Hou" affiliation: Department of Biostatistics, Johns Hopkins University email: chris2018hou@gmail.com output: BiocStyle::html_document package: "HuBMAPR" vignette: | %\VignetteIndexEntry{Accessing Human Cell Atlas Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteKeywords{Software, SingleCell, DataImport, ThirdPartyClient, Spatial, Infrastructure} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Overview 'HuBMAP' data portal () provides an open, global bio-molecular atlas of the human body at the cellular level. `HuBMAPR` package provides an alternative interface to explore the data via R. The HuBMAP Consortium offers several [APIs](https://docs.hubmapconsortium.org/apis.html). To achieve the main objectives, `HuBMAPR` package specifically integrates three APIs: - [Search API](https://smart-api.info/ui/7aaf02b838022d564da776b03f357158): The **Search API** is primarily searching relevant data information and is referenced to the [Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/). - [Entity API](https://smart-api.info/ui/0065e419668f3336a40d1f5ab89c6ba3): The **Entity API** is specifically utilized in the `bulk_data_transfer()` function for Globus URL retrieval - [Ontology API](https://smart-api.info/ui/d10ff85265d8b749fbe3ad7b51d0bf0a): The **Ontology API** is applied in the `organ()` function to provide additional information about the abbreviation and corresponding full name of each organ. Each API serves a distinct purpose with unique query capabilities, tailored to meet various needs. Utilizing the `httr2` and `rjsoncons` packages, `HuBMAPR` effectively manages, modifies, and executes multiple requests via these APIs, presenting responses in formats such as tibble or character. These outputs are further modified for clarity in the final results from the `HuBMAPR` functions, and these functions help reflect the data information of HuBMAP Data Portal as much as possible. Using temporary storage to cache API responses facilitates efficient data retrieval by reducing the need for redundant requests to the HuBMAP Data Portal. This approach minimizes server load, improves response times (e.g. `datasets()` takes less than 4 seconds to retrieve more than 3500 records’ information, shown below), and enhances overall query efficiency. By periodically clearing cached data or directing them to a temporary directory, the process ensures that the retrieved information remains relevant while managing storage effectively. This caching mechanism supports a smoother and more efficient user experience when accessing data from the portal. HuBMAP Data incorporates three different [identifiers](https://docs.hubmapconsortium.org/apis): - HuBMAP ID, e.g. HBM399.VCTL.353 - Universally Unique Identifier (UUID), e.g. 7036a70229eff1a51af965454dddbe7d - Digital Object Identifiers (DOI), e.g. 10.35079/HBM399.VCTL.353. The `HuBMAPR` package utilizes the UUID - a 32-digit hexadecimal number - and the more human-readable HuBMAP ID as two common identifiers in the retrieved results. Considering precision and compatibility with software implementation and data storage, UUID serves as the primary identifier to retrieve data across various functions, with the UUID mapping uniquely to its corresponding HuBMAP ID. The systematic nomenclature is adopted for functions in the package by appending the entity category prefix to the concise description of the specific functionality. Most of the functions are grouped by entity categories, thereby simplifying the process of selecting the appropriate functions to retrieve the desired information associated with the given UUID from the specific entity category. The structure of these functions is heavily consistent across all entity categories with some exceptions for collection and publication. # Installation `HuBMAPR` is a R package. The package can be installed by ```{r 'install bioc', eval=FALSE} if (!requireNamespace("BiocManager")) { install.packages("BiocManager") } BiocManager::install("HuBMAPR") ``` Install development version from [GitHub](https://christinehou11.github.io/HuBMAPR): ```{r 'github', eval = FALSE} remotes::install_github("christinehou11/HuBMAPR") ``` # Basic User Guide ## Implementation Notes This session is to provide a guidance on extending or customizing the `HuBMAPR` package to accommodate potential future changes in data structure, enhancing the package's long-term utility. We included a brief outline to illustrate the basics of the principles and approach to package design. - Identify an API end point - Provide an R client to translate R data structures to the arguments and parameters required by the API - Handle the response in a consistent way with respect to argument and response validation - Format the return value as a 'tibble' or 'character' to minimize cognitive demands on the user for interpreting the result, and to facilitate incorporation into general R workflows ## Load Necessary Packages Load additional packages. `dplyr` package is widely used in this vignettes to conduct data wrangling and specific information extraction. ```{r 'library', message=FALSE, warning=FALSE} library("dplyr") library("tidyr") library("ggplot2") library("HuBMAPR") library("pryr") ``` ## Data Discovery `HuBMAP` data portal page displays chronologically (last modified date time) five categories of entity data: - **Dataset** - **Sample** - **Donor** - **Publication** - **Collection**. Using corresponding functions to explore entity data. ```{r 'datasets'} system.time({ datasets_df <- datasets() }) object_size(datasets_df) datasets_df ``` `samples()`, `donors()`, `collections()`, and `publications()` work same as above. ```{r 'plot', echo=FALSE, warning=FALSE, message=FALSE} datasets_sub <- datasets_df |> select(organ, dataset_type) |> group_by(organ) |> mutate(count = n()) |> filter(!is.na(organ)) colorblind_palette <- c( "10X Multiome" = "#E69F00", "2D Imaging Mass Cytometry" = "#56B4E9", "3D Imaging Mass Cytometry" = "#009E73", "ATACseq" = "#F0E442", "Auto-fluorescence" = "#0072B2", "Cell DIVE" = "#D55E00", "CODEX" = "#CC79A7", "DESI" = "#999999", "Histology" = "#E69F00", "LC-MS" = "#56B4E9", "Light Sheet" = "#009E73", "MALDI" = "#F0E442", "MIBI" = "#0072B2", "MUSIC" = "#D55E00", "Publication" = "#CC79A7", "RNAseq" = "#999999", "seqFish" = "#E69F00", "Slide-seq" = "#56B4E9", "Visium (no probes)" = "#009E73", "WGS" = "#F0E442") plot1 <- ggplot(datasets_sub, aes(y = reorder(organ, count), fill = dataset_type)) + geom_histogram(stat = "count") + scale_fill_manual(values = colorblind_palette) + labs(x = NULL, y = NULL, fill = "Assay Type") + theme_minimal() + theme( panel.grid.major.y = element_blank(), panel.grid.minor = element_blank(), axis.text.y = element_text(size = 9), axis.text.x = element_text(size = 9), legend.position = "bottom", legend.title = element_text(size = 9), legend.text = element_text(size = 7), panel.background = element_rect(fill = "white", color = NA), plot.background = element_rect(fill = "white", color = NA)) + guides(fill = guide_legend(nrow = 4)) plot1 ``` The default tibble produced by corresponding entity function only reflects selected information. To see the names of selected information, use following commands for each entity category. Specify `as` parameter to display information in the format of `"character"` or `"tibble"`. ```{r 'cols'} # as = "tibble" (default) datasets_col_tbl <- datasets_default_columns(as = "tibble") datasets_col_tbl # as = "character" datasets_col_char <- datasets_default_columns(as = "character") datasets_col_char ``` `samples_default_columns()`, `donors_default_columns()`, `collections_default_columns()`, and `publications_default_columns()` work same as above. A brief overview of selected information for five entity categories is: ```{r 'summary cols'} tbl <- bind_cols( dataset = datasets_default_columns(as = "character"), sample = c(samples_default_columns(as = "character"), rep(NA, 7)), donor = c(donors_default_columns(as = "character"), rep(NA, 6)), collection = c(collections_default_columns(as = "character"), rep(NA, 10)), publication = c(publications_default_columns(as = "character"), rep(NA, 7)) ) tbl ``` Use `organ()` to read through the available organs included in `HuBMAP`. It can be helpful to filter retrieved data based on organ information. ```{r 'organs'} organs <- organ() organs ``` ### Data Wrangling Examples Data wrangling and filter are welcome to retrieve data based on interested information. ```{r 'datasets filter'} # Example from datasets() datasets_df |> filter(organ == 'Small Intestine') |> count() ``` Any dataset, sample, donor, collection, and publication has special **HuBMAP ID** and **UUID**, and **UUID** is the main ID to be used in most of functions for specific detail retrievals. The column of **donor_hubmap_id** is included in the retrieved tibbles from `samples()` and `datasets()`, which can help to join the tibble. ```{r 'derived using left_join'} donors_df <- donors() donor_sub <- donors_df |> filter(Sex == "Female", Age <= 76 & Age >= 55, Race == "White", `Body Mass Index` <= 25, last_modified_timestamp >= "2020-01-08" & last_modified_timestamp <= "2020-06-30") |> head(1) # Datasets donor_sub_dataset <- donor_sub |> left_join(datasets_df |> select(-c(group_name, last_modified_timestamp)) |> rename("dataset_uuid" = "uuid", "dataset_hubmap_id" = "hubmap_id"), by = c("hubmap_id" = "donor_hubmap_id")) donor_sub_dataset # Samples samples_df <- samples() donor_sub_sample <- donor_sub |> left_join(samples_df |> select(-c(group_name, last_modified_timestamp)) |> rename("sample_uuid" = "uuid", "sample_hubmap_id" = "hubmap_id"), by = c("hubmap_id" = "donor_hubmap_id")) donor_sub_sample ``` You can use `*_detail(uuid)` to retrieve all available information for any entry of any entity category given **UUID**. Use `select()` and `unnest_*()` functions to expand list-columns. It will be convenient to view tables with multiple columns but one row using `glimpse()`. ```{r '*_detail()'} dataset_uuid <- datasets_df |> filter(dataset_type == "Auto-fluorescence", organ == "Kidney (Right)") |> head(1) |> pull(uuid) # Full Information dataset_detail(dataset_uuid) |> glimpse() # Specific Information dataset_detail(uuid = dataset_uuid) |> select(contributors) |> unnest_longer(contributors) |> unnest_wider(everything()) ``` `sample_detail()`, `donor_detail()`, `collection_detail()`, and `publication_detail()` work same as above. ## Metadata To retrieve the metadata for **Dataset**, **Sample**, and **Donor** metadata, use `dataset_metadata()`, `sample_metadata()`, and `donor_metadata()`. ```{r 'metadata'} dataset_metadata("993bb1d6fa02e2755fd69613bb9d6e08") sample_metadata("8ecdbdc3e2d04898e2563d666658b6a9") donor_metadata("b2c75c96558c18c9e13ba31629f541b6") ``` ## Derived Data Some datasets from **Dataset** entity has derived (support) dataset(s). Use `dataset_derived()` to retrieve. A tibble with selected details will be retrieved as if the given dataset has support dataset; otherwise, nothing returns. ```{r 'dataset derived'} # no derived/support dataset dataset_uuid_1 <- "3acdb3ed962b2087fbe325514b098101" dataset_derived(uuid = dataset_uuid_1) # has derived/support dataset dataset_uuid_2 <- "baf976734dd652208d13134bc5c4594b" dataset_derived(uuid = dataset_uuid_2) |> glimpse() ``` **Sample** and **Donor** have derived samples and datasets. In `HuBAMPR` package, `sample_derived()` and `donor_derived()` functions are available to use to see the derived datasets and samples from one sample given sample UUID or one donor given donor UUID. Specify `entity_type` parameter to retrieve derived `Dataset` or `Sample`. ```{r 'derived using sample_derived'} sample_uuid <- samples_df |> filter(last_modified_timestamp >= "2023-01-01" & last_modified_timestamp <= "2023-10-01", organ == "Kidney (Left)") |> head(1) |> pull(uuid) sample_uuid # Derived Datasets sample_derived(uuid = sample_uuid, entity_type = "Dataset") # Derived Samples sample_derived(uuid = sample_uuid, entity_type = "Sample") ``` `donor_derived()` works same as above. ## Provenance Data For individual entries from **Dataset** and **Sample** entities, `uuid_provenance()` helps to retrieve the provenance of the entry as a list of characters (UUID, HuBMAP ID, and entity type) from the most recent ancestor to the furthest ancestor. There is no ancestor for Donor UUID, and an empty list will be returned. ```{r 'provenance'} # dataset provenance dataset_uuid <- "3e4c568d9ce8df9d73b8cddcf8d0fec3" uuid_provenance(dataset_uuid) # sample provenance sample_uuid <- "35e16f13caab262f446836f63cf4ad42" uuid_provenance(sample_uuid) # donor provenance donor_uuid <- "0abacde2443881351ff6e9930a706c83" uuid_provenance(donor_uuid) ``` ## Related Data Each **Collection** has related datasets, and use `collection_data()` to retrieve. ```{r 'collection datasets'} collections_df <- collections() collection_uuid <- collections_df |> filter(last_modified_timestamp >= "2023-01-01") |> head(1) |> pull(uuid) collection_data(collection_uuid) ``` Each publication has related datasets, samples, and donors, and use `publication_data()` to see, while specifying `entity_type` parameter to retrieve derived `Dataset` or `Sample`. ```{r 'publication data'} publications_df <- publications() publication_uuid <- publications_df |> filter(publication_venue == "Nature") |> head(1) |> pull(uuid) publication_data(publication_uuid, entity_type = "Dataset") publication_data(publication_uuid, entity_type = "Sample") ``` ## Additional Information To read the textual description of one **Collection** or **Publication**, use `collection_information()` or `publication_information()` respectively. ```{r 'information'} collection_information(uuid = collection_uuid) publication_information(uuid = publication_uuid) ``` Some additional contact/author/contributor information can be retrieved using `dataset_contributor()` for **Dataset** entity, `collection_contact()` and `collection_contributors()` for **Collection** entity, or `publication_authors()` for **Publication** entity. ```{r 'author'} # Dataset dataset_contributors(uuid = dataset_uuid) # Collection collection_contacts(uuid = collection_uuid) collection_contributors(uuid = collection_uuid) # Publication publication_authors(uuid = publication_uuid) ``` # File Transfer For each dataset, there are corresponding data files. Most of the datasets' files are available on HuBMAP Globus with corresponding URL. Some of the datasets' files are not available via Globus, but can be accessed via dbGAP (database of Genotypes and Phenotypes) and/or SRA (Sequence Read Archive). But some of the datasets' files are not available in any authorized platform. Each dataset available on Globus has different components of data-related files to preview and download, include but not limited to images, metadata files, downstream analysis reports, raw data products, etc. Use `bulk_data_transfer()` to know whether data files are open-accessed or restricted. Only open-accessed files can be downloaded for downstream analysis. #### Files are publicly accessible HuBMAP stored all public data files on Globus, which is a open-source and safe platform for the large-size data storage. For every dataset which the data files can be publicly accessed, the `bulk_data_transfer()` function will direct to corresponding Globus webpage in Chrome. ```{r 'bulk data transfer successful', eval=FALSE} uuid_globus <- "d1dcab2df80590d8cd8770948abaf976" bulk_data_transfer(uuid_globus) ```
By selecting the data file and clicking on "Download" button, the data file can be downloaded to the specific directory.
##### Alternative data transfer method using rglobus package Martin Morgan, one of the `HuBMAPR` package creators, generated an experimental package called [rglobus](https://github.com/mtmorgan/rglobus/). Globus is in part a cloud-based file transfer service, available at https://www.globus.org/. This package provides an *R* client with the ability to discover and navigate collections, and to transfer files and directories between collections. Therefore, `rglobus` is an alternative method to transfer HuBMAP data files on the local computer using HuBMAP dataset UUID. `rglobus` has the vignette documentation [here](https://mtmorgan.github.io/rglobus/articles/a_get_started.html) using HuBMAP collection as the main example to illustrate how to discover and navigate the correct collection, and transfer the files. Since `rglobus` is an experimental package, the functionality may not be complete. It is possible to see transfer issues while using functions. There will be more information updated in the future. You are welcome to report any issue or provide any comment [here](https://github.com/mtmorgan/rglobus/issues) to help us develop. #### Files are restricted For every dataset which the data files are restricted under dbGAP or SRA, the `bulk_data_transfer()` function will print out the instruction messages. The dbGaP or/and SRA link(s) allow the users to request the protected-access sequence data from authenticated platform. ```{r 'bulk data transfer urls', eval=FALSE} uuid_dbGAP_SRA <- "d926c41ac08f3c2ba5e61eec83e90b0c" bulk_data_transfer(uuid_dbGAP_SRA) ``` ```{r,comment=NA, echo=FALSE} result1 <- paste("Pruning cache", "Error in bulk_data_transfer(uuid_dbGAP_SRA) :", "This dataset contains protected-access human sequence data.", "If you are not a Consortium member,", "you must access these data through dbGaP if available.", "dbGaP authentication is required for downloading.", "View documentation on how to attain dbGaP access.", "Additional Help: 'https://hubmapconsortium.org/contact-form/'", "Navigate to the 'Bioproject' or 'Sequencing Read Archive' links.", "dbGaP URL: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002267", "Select the 'Run' link on the page to download the dataset.", "Additional documentation: https://www.ncbi.nlm.nih.gov/sra/docs/.", "SRA URL: https://www.ncbi.nlm.nih.gov/sra/SRX13283313.)", sep = "\n") cat(result1) ``` #### Files are unavailable For every dataset which the data files not available, the `bulk_data_transfer()` function will print out the messages. ```{r 'bulk data transfer not avail', eval=FALSE} uuid_not_avail <- "0eb5e457b4855ce28531bc97147196b6" bulk_data_transfer(uuid_not_avail) ``` ```{r,comment=NA, echo=FALSE} result2 <- paste("Pruning cache", "Error in bulk_data_transfer(uuid_not_avail) :", "This dataset contains protected-access human sequence data.", "Data isn't yet available through dbGaP,", "but will be available soon.", "Please contact us via 'https://hubmapconsortium.org/contact-form/'", "with any questions regarding this data.", sep = "\n") cat(result2) ``` # `R` session information {.unnumbered} ```{r 'sessionInfo', echo=FALSE} ## Session info options(width = 120) sessionInfo() ```