---
title: Pokédex for Cell Types
author:
- name: Aaron Lun
email: infinite.monkeys.with.keyboards@gmail.com
- name: Jared M. Andrews
affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
- name: Friederike Dündar
affiliation: Applied Bioinformatics Core, Weill Cornell Medicine
- name: Daniel Bunis
affiliation: Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
date: "Revised: June 13th, 2020"
output:
BiocStyle::html_document:
toc_float: true
package: celldex
bibliography: ref.bib
vignette: >
%\VignetteIndexEntry{Cell type references}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocStyle)
```
# Overview
The `r Biocpkg("celldex")` package provides convenient access to several cell type reference datasets.
Most of these references are derived from bulk RNA-seq or microarray data of cell populations
that (hopefully) consist of a pure cell type after sorting and/or culturing.
The aim is to provide a common resource for further analysis like cell type annotation of single cell expression data
or deconvolution of cell type proportions in bulk expression datasets.
Each dataset contains a log-normalized expression matrix that is intended to be comparable
to log-UMI counts from common single-cell protocols [@aran2019reference]
or gene length-adjusted values from bulk datasets.
By default, gene annotation is returned in terms of gene symbols,
but they can be coerced to Ensembl annotation with `ensembl=TRUE` for more robust cross-referencing across studies.
In general, each reference provides three levels of cell type annotation in its column metadata:
- `label.main`, broad annotation that defines the major cell types.
This has few unique levels that allows for fast annotation but at low resolution.
- `label.fine`, fine-grained annotation that defines subtypes or states.
This has more unique levels that results in slower annotation but at much higher resolution.
- `label.ont`, fine-grained annotation mapped to the standard vocabulary in the [Cell Ontology](https://www.ebi.ac.uk/ols/ontologies/cl).
This enables synchronization of labels across references as well as dynamic adjustment of the resolution.
More details for each dataset can be viewed on the corresponding help page for its retrieval function (e.g., `?ImmGenData`).
# General-purpose references
## Human primary cell atlas (HPCA)
The HPCA reference consists of publicly available microarray datasets derived from human primary cells [@hpcaRef].
Most of the labels refer to blood subpopulations but cell types from other tissues are also available.
```{r}
library(celldex)
ref <- HumanPrimaryCellAtlasData()
```
```{r tabulate, echo=FALSE}
samples <- colData(ref)[,c("label.main", "label.fine","label.ont")]
samples <- as.data.frame(samples)
DT::datatable(unique(samples))
```
This reference also contains many cells and cell lines that have been treated or collected from pathogenic conditions.
## Blueprint/ENCODE
The Blueprint/ENCODE reference consists of bulk RNA-seq data for pure stroma and immune cells
generated by Blueprint [@blueprintRef] and ENCODE projects [@encodeRef].
```{r}
ref <- BlueprintEncodeData()
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference is best suited to mixed samples that do not require fine resolution,
and is particularly suited for situations where easily interpretable labels are required quickly.
It provides decent immune cell granularity, though it does not contain finer monocyte and dendritic cell subtypes.
## Mouse RNA-seq
This reference consists of a collection of mouse bulk RNA-seq data sets downloaded from the gene expression omnibus [@Benayoun2019].
A variety of cell types are available, again mostly from blood but also covering several other tissues.
```{r}
ref <- MouseRNAseqData()
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference is best suited to bulk tissue samples from brain, blood, or heart where low-resolution labels are adequate.
# Immune references
## Immunological Genome Project (ImmGen)
The ImmGen reference consists of microarray profiles of pure mouse immune cells from
the [project of the same name](http://www.immgen.org/) [@ImmGenRef].
This is currently the most highly resolved immune reference -
possibly overwhelmingly so, given the granularity of the fine labels.
```{r}
ref <- ImmGenData()
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference provides exhaustive coverage of a dizzying number of cell subtypes.
However, this can be a double-edged sword as the high resolution can be difficult to interpret,
especially for samples derived from experimental conditions that are not of interest.
Users may want to remove certain samples themselves depending on the use case.
## Database of Immune Cell Expression/eQTLs/Epigenomics (DICE)
The DICE reference consists of bulk RNA-seq samples of sorted cell populations
from the [project of the same name](https://dice-database.org) [@diceRef].
```{r}
ref <- DatabaseImmuneCellExpressionData()
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference is particularly useful to those interested in CD4^+^ T cell subsets,
though the lack of CD4^+^ central memory and effector memory samples may decrease accuracy in some cases.
In addition, the lack of dendritic cells and a single B cell subset may result in those populations being improperly labeled or having their label pruned in a typical PBMC sample.
## Novershtern hematopoietic data
The Novershtern reference (previously known as Differentiation Map)
consists of microarray datasets for sorted hematopoietic cell populations
from [GSE24759](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24759) [@dmapRef].
```{r}
ref <- NovershternHematopoieticData()
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This reference provides the greatest resolution for myeloid and progenitor cells among the human immune references.
It has fewer T cell subsets than the other immune references but contains many more NK, erythroid, and granulocytic subsets.
It is likely the best option for bone marrow samples.
## Monaco immune data
The Monaco reference consists of bulk RNA-seq samples of sorted immune cell populations
from [GSE107011](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107011) [@monaco_immuneRef].
```{r}
ref <- MonacoImmuneData()
```
```{r, echo=FALSE, ref.label="tabulate"}
```
This is the human immune reference that best covers all of the bases for a typical PBMC sample.
It provides expansive B and T cell subsets, differentiates between classical and non-classical monocytes, includes basic dendritic cell subsets, and also includes neutrophil and basophil samples to help identify small contaminating populations that may have slipped into a PBMC preparation.
# Session information {-}
```{r}
sessionInfo()
```
# References {-}