---
title: Pokédex for Cell Types
author:
- name: Aaron Lun
email: infinite.monkeys.with.keyboards@gmail.com
- name: Jared M. Andrews
affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
- name: Friederike Dündar
affiliation: Applied Bioinformatics Core, Weill Cornell Medicine
- name: Daniel Bunis
affiliation: Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
date: "Revised: February 29, 2024"
output:
BiocStyle::html_document:
toc_float: true
package: celldex
bibliography: ref.bib
vignette: >
%\VignetteIndexEntry{Cell type references}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocStyle)
```
# Overview
The `r Biocpkg("celldex")` package provides convenient access to several cell type reference datasets.
Most of these references are derived from bulk RNA-seq or microarray data of cell populations
that (hopefully) consist of a pure cell type after sorting and/or culturing.
The aim is to provide a common resource for further analysis like cell type annotation of single cell expression data
or deconvolution of cell type proportions in bulk expression datasets.
Each dataset contains a log-normalized expression matrix that is intended to be comparable
to log-UMI counts from common single-cell protocols [@aran2019reference]
or gene length-adjusted values from bulk datasets.
By default, gene annotation is returned in terms of gene symbols,
but they can be coerced to Ensembl annotation with `ensembl=TRUE` for more robust cross-referencing across studies.
Typically, each reference provides three levels of cell type annotation in its column metadata:
- `label.main`, broad annotation that defines the major cell types.
This has few unique levels that allows for fast annotation but at low resolution.
- `label.fine`, fine-grained annotation that defines subtypes or states.
This has more unique levels that results in slower annotation but at much higher resolution.
- `label.ont`, fine-grained annotation mapped to the standard vocabulary in the [Cell Ontology](https://www.ebi.ac.uk/ols/ontologies/cl).
This enables synchronization of labels across references as well as dynamic adjustment of the resolution.
# Finding references
We can examine the available references using the `surveyReferences()` function.
This returns a `DataFrame` of the reference's name and version,
along with additional information like the title, description, species, number of samples, available labels, and so on.
```{r}
library(celldex)
surveyReferences()
```
Alternatively, users can search the text of each reference's metadata to identify relevant datasets.
This may require some experimentation as it depends on the level of detail in the metadata supplied by the uploader.
```{r}
searchReferences("B cell")
searchReferences(
defineTextQuery("immun%", partial=TRUE) &
defineTextQuery("10090", field="taxonomy_id")
)
```
Keep in mind that the search results are not guaranteed to be reproducible -
more datasets may be added over time, and existing datasets may be updated with new versions.
Once a dataset of interest is identified, users should explicitly list the name and version of the dataset in their scripts to ensure reproducibility.
# General-purpose references
## Human primary cell atlas (HPCA)
The HPCA reference consists of publicly available microarray datasets derived from human primary cells [@hpcaRef].
Most of the labels refer to blood subpopulations but cell types from other tissues are also available.
```{r}
library(celldex)
ref <- fetchReference("hpca", "2024-02-26")
```
```{r tabulate, echo=FALSE}
samples <- colData(ref)[,c("label.main", "label.fine","label.ont")]
samples <- as.data.frame(samples)
DT::datatable(unique(samples))
```
This reference also contains many cells and cell lines that have been treated or collected from pathogenic conditions.
## Blueprint/ENCODE
The Blueprint/ENCODE reference consists of bulk RNA-seq data for pure stroma and immune cells
generated by Blueprint [@blueprintRef] and ENCODE projects [@encodeRef].
```{r}
ref <- fetchReference("blueprint_encode", "2024-02-26")
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference is best suited to mixed samples that do not require fine resolution,
and is particularly suited for situations where easily interpretable labels are required quickly.
It provides decent immune cell granularity, though it does not contain finer monocyte and dendritic cell subtypes.
## Mouse RNA-seq
This reference consists of a collection of mouse bulk RNA-seq data sets downloaded from the gene expression omnibus [@Benayoun2019].
A variety of cell types are available, again mostly from blood but also covering several other tissues.
```{r}
ref <- fetchReference("mouse_rnaseq", "2024-02-26")
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference is best suited to bulk tissue samples from brain, blood, or heart where low-resolution labels are adequate.
# Immune references
## Immunological Genome Project (ImmGen)
The ImmGen reference consists of microarray profiles of pure mouse immune cells from
the [project of the same name](http://www.immgen.org/) [@ImmGenRef].
This is currently the most highly resolved immune reference -
possibly overwhelmingly so, given the granularity of the fine labels.
```{r}
ref <- fetchReference("immgen", "2024-02-26")
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference provides exhaustive coverage of a dizzying number of cell subtypes.
However, this can be a double-edged sword as the high resolution can be difficult to interpret,
especially for samples derived from experimental conditions that are not of interest.
Users may want to remove certain samples themselves depending on the use case.
## Database of Immune Cell Expression/eQTLs/Epigenomics (DICE)
The DICE reference consists of bulk RNA-seq samples of sorted cell populations
from the [project of the same name](https://dice-database.org) [@diceRef].
```{r}
ref <- fetchReference("dice", "2024-02-26")
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference is particularly useful to those interested in CD4^+^ T cell subsets,
though the lack of CD4^+^ central memory and effector memory samples may decrease accuracy in some cases.
In addition, the lack of dendritic cells and a single B cell subset may result in those populations being improperly labeled or having their label pruned in a typical PBMC sample.
## Novershtern hematopoietic data
The Novershtern reference (previously known as Differentiation Map)
consists of microarray datasets for sorted hematopoietic cell populations
from [GSE24759](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24759) [@dmapRef].
```{r}
ref <- fetchReference("novershtern_hematopoietic", "2024-02-26")
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference provides the greatest resolution for myeloid and progenitor cells among the human immune references.
It has fewer T cell subsets than the other immune references but contains many more NK, erythroid, and granulocytic subsets.
It is likely the best option for bone marrow samples.
## Monaco immune data
The Monaco reference consists of bulk RNA-seq samples of sorted immune cell populations
from [GSE107011](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107011) [@monaco_immuneRef].
```{r}
ref <- fetchReference("monaco_immune", "2024-02-26")
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This is the human immune reference that best covers all of the bases for a typical PBMC sample.
It provides expansive B and T cell subsets, differentiates between classical and non-classical monocytes, includes basic dendritic cell subsets, and also includes neutrophil and basophil samples to help identify small contaminating populations that may have slipped into a PBMC preparation.
# Adding new references
Want to contribute your own reference dataset to this package?
It's easy!
Just follow these simple steps for instant fame and prestige.
1. Obtain log-normalized expression values and cell type labels.
While not strictly required, we recommend trying to organize the labels into `label.fine`, `label.main` and `label.ont`,
so that users have a consistent experience with your reference.
Let's just make up something here.
```{r}
norm <- matrix(runif(1000), ncol=20)
rownames(norm) <- sprintf("GENE_%i", seq_len(nrow(norm)))
labels <- DataFrame(label.main=rep(LETTERS[1:5], each=4))
labels$label.fine <- sprintf("%s%i", labels$label.main, rep(c(1, 1, 2, 2), 5))
labels$label.ont <- sprintf("CL:000%i", rep(1:5, each=4))
```
2. Assemble the metadata for your dataset.
This should be a list structured as specified in the [Bioconductor metadata schema](https://artifactdb.github.io/bioconductor-metadata-index/bioconductor/v1.json).
Check out some examples from `fetchMetadata()` - note that the `application.takane` property will be automatically added later, and so can be omitted from the list that you create.
```{r}
meta <- list(
title="My reference",
description="This is my reference dataset",
taxonomy_id="10090",
genome="GRCh38",
sources=list(
list(provider="GEO", id="GSE12345"),
list(provider="PubMed", id="1234567")
),
maintainer_name="Chihaya Kisaragi",
maintainer_email="kisaragi.chihaya@765pro.com"
)
```
3. Save your normalized expression matrix and labels to disk with `saveReference()`.
This saves the dataset into a "staging directory" using language-agnostic file formats - check out the [**alabaster**](https://github.com/ArtifactDB/alabaster.base) framework for more details.
In more complex cases involving multiple datasets, users may save each dataset into a subdirectory of the staging directory.
```{r}
# Simple case: you only have one dataset to upload.
staging <- tempfile()
saveReference(norm, labels, staging, meta)
list.files(staging, recursive=TRUE)
# Complex case: you have multiple subdatasets to upload.
staging <- tempfile()
dir.create(staging)
saveReference(norm, labels, file.path(staging, "foo"), meta)
saveReference(norm, labels, file.path(staging, "bar"), meta) # and so on.
```
You can check that everything was correctly saved by reloading the on-disk data into the R session for inspection.
This yields a `SummarizedExperiment` with the log-expression matrix in the assay named `"logcounts"`.
```{r}
alabaster.base::readObject(file.path(staging, "foo"))
```
4. Open a [pull request (PR)](https://github.com/LTLA/scRNAseq/pulls) for the addition of a new reference.
You will need to provide a few things here:
- The name of your dataset.
This should be short enough to type yet distinct from the existing references.
- The version of your dataset.
This is usually just the current date... or whenever you started putting together the dataset for upload.
The exact date doesn't really matter as long as we can establish a timeline for later versions.
- The code used to assemble the reference dataset as an Rmarkdown file.
This should be added to the [`scripts/`](https://github.com/LTLA/celldex/tree/master/scripts) directory of this package,
in order to provide some record of how the dataset was created.
5. Wait for us to grant temporary upload permissions to your GitHub account.
6. Upload your staging directory to [**gypsum** backend](https://github.com/ArtifactDB/gypsum-worker) with `gypsum::uploadDirectory()`.
On the first call to this function, it will automatically prompt you to log into GitHub so that the backend can authenticate you.
If you are on a system without browser access (e.g., most computing clusters), a [token](https://github.com/settings/tokens) can be manually supplied via `gypsum::setAccessToken()`.
```{r, eval=FALSE}
gypsum::uploadDirectory(staging, "celldex", "my_dataset_name", "my_version")
```
You can check that everything was successfully uploaded by just calling `fetchReference()`:
```{r, eval=FALSE}
fetchReference("my_dataset_name", "my_version")
```
If you realized you made a mistake, no worries.
Use the following call to clear the erroneous dataset, and try again:
```{r, eval=FALSE}
gypsum::rejectProbation("scRNAseq", "my_dataset_name", "my_version")
```
7. Comment on the PR to notify us that the dataset has finished uploading and you're happy with it.
We'll review it and make sure everything's in order.
If some fixes are required, we'll just clear the dataset so that you can upload a new version with the necessary changes.
Otherwise, we'll approve the dataset.
Note that once a version of a dataset is approved, no further changes can be made to that version;
you'll have to upload a new version if you want to modify something.
# Session information {-}
```{r}
sessionInfo()
```
# References {-}