1 Overview

The celldex package provides convenient access to several cell type reference datasets. Most of these references are derived from bulk RNA-seq or microarray data of cell populations that (hopefully) consist of a pure cell type after sorting and/or culturing. The aim is to provide a common resource for further analysis like cell type annotation of single cell expression data or deconvolution of cell type proportions in bulk expression datasets.

Each dataset contains a log-normalized expression matrix that is intended to be comparable to log-UMI counts from common single-cell protocols (Aran et al. 2019) or gene length-adjusted values from bulk datasets. By default, gene annotation is returned in terms of gene symbols, but they can be coerced to Ensembl annotation with ensembl=TRUE for more robust cross-referencing across studies.

In general, each reference provides three levels of cell type annotation in its column metadata:

  • label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
  • label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
  • label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

More details for each dataset can be viewed on the corresponding help page for its retrieval function (e.g., ?ImmGenData).

2 General-purpose references

2.1 Human primary cell atlas (HPCA)

The HPCA reference consists of publicly available microarray datasets derived from human primary cells (Mabbott et al. 2013). Most of the labels refer to blood subpopulations but cell types from other tissues are also available.

library(celldex)
ref <- HumanPrimaryCellAtlasData()

This reference also contains many cells and cell lines that have been treated or collected from pathogenic conditions.

2.2 Blueprint/ENCODE

The Blueprint/ENCODE reference consists of bulk RNA-seq data for pure stroma and immune cells generated by Blueprint (Martens and Stunnenberg 2013) and ENCODE projects (The ENCODE Project Consortium 2012).

ref <- BlueprintEncodeData()

This reference is best suited to mixed samples that do not require fine resolution, and is particularly suited for situations where easily interpretable labels are required quickly. It provides decent immune cell granularity, though it does not contain finer monocyte and dendritic cell subtypes.

2.3 Mouse RNA-seq

This reference consists of a collection of mouse bulk RNA-seq data sets downloaded from the gene expression omnibus (Benayoun et al. 2019). A variety of cell types are available, again mostly from blood but also covering several other tissues.

ref <- MouseRNAseqData()

This reference is best suited to bulk tissue samples from brain, blood, or heart where low-resolution labels are adequate.

3 Immune references

3.1 Immunological Genome Project (ImmGen)

The ImmGen reference consists of microarray profiles of pure mouse immune cells from the project of the same name (Heng et al. 2008). This is currently the most highly resolved immune reference - possibly overwhelmingly so, given the granularity of the fine labels.

ref <- ImmGenData()

This reference provides exhaustive coverage of a dizzying number of cell subtypes. However, this can be a double-edged sword as the high resolution can be difficult to interpret, especially for samples derived from experimental conditions that are not of interest. Users may want to remove certain samples themselves depending on the use case.

3.2 Database of Immune Cell Expression/eQTLs/Epigenomics (DICE)

The DICE reference consists of bulk RNA-seq samples of sorted cell populations from the project of the same name (Schmiedel et al. 2018).

ref <- DatabaseImmuneCellExpressionData()

This reference is particularly useful to those interested in CD4+ T cell subsets, though the lack of CD4+ central memory and effector memory samples may decrease accuracy in some cases. In addition, the lack of dendritic cells and a single B cell subset may result in those populations being improperly labeled or having their label pruned in a typical PBMC sample.

3.3 Novershtern hematopoietic data

The Novershtern reference (previously known as Differentiation Map) consists of microarray datasets for sorted hematopoietic cell populations from GSE24759 (Novershtern et al. 2011).

ref <- NovershternHematopoieticData()

This reference provides the greatest resolution for myeloid and progenitor cells among the human immune references. It has fewer T cell subsets than the other immune references but contains many more NK, erythroid, and granulocytic subsets. It is likely the best option for bone marrow samples.

3.4 Monaco immune data

The Monaco reference consists of bulk RNA-seq samples of sorted immune cell populations from GSE107011 (Monaco et al. 2019).

ref <- MonacoImmuneData()

This is the human immune reference that best covers all of the bases for a typical PBMC sample. It provides expansive B and T cell subsets, differentiates between classical and non-classical monocytes, includes basic dendritic cell subsets, and also includes neutrophil and basophil samples to help identify small contaminating populations that may have slipped into a PBMC preparation.

Session information

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] celldex_1.2.0               SummarizedExperiment_1.22.0
##  [3] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [5] GenomeInfoDb_1.28.0         IRanges_2.26.0             
##  [7] S4Vectors_0.30.0            BiocGenerics_0.38.0        
##  [9] MatrixGenerics_1.4.0        matrixStats_0.58.0         
## [11] BiocStyle_2.20.0           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2                    sass_0.4.0                   
##  [3] bit64_4.0.5                   jsonlite_1.7.2               
##  [5] AnnotationHub_3.0.0           DelayedMatrixStats_1.14.0    
##  [7] bslib_0.2.5.1                 shiny_1.6.0                  
##  [9] assertthat_0.2.1              interactiveDisplayBase_1.30.0
## [11] BiocManager_1.30.15           BiocFileCache_2.0.0          
## [13] blob_1.2.1                    GenomeInfoDbData_1.2.6       
## [15] yaml_2.2.1                    BiocVersion_3.13.1           
## [17] pillar_1.6.1                  RSQLite_2.2.7                
## [19] lattice_0.20-44               glue_1.4.2                   
## [21] digest_0.6.27                 promises_1.2.0.1             
## [23] XVector_0.32.0                httpuv_1.6.1                 
## [25] htmltools_0.5.1.1             Matrix_1.3-3                 
## [27] pkgconfig_2.0.3               bookdown_0.22                
## [29] zlibbioc_1.38.0               xtable_1.8-4                 
## [31] purrr_0.3.4                   later_1.2.0                  
## [33] tibble_3.1.2                  KEGGREST_1.32.0              
## [35] generics_0.1.0                DT_0.18                      
## [37] ellipsis_0.3.2                withr_2.4.2                  
## [39] cachem_1.0.5                  mime_0.10                    
## [41] magrittr_2.0.1                crayon_1.4.1                 
## [43] memoise_2.0.0                 evaluate_0.14                
## [45] fansi_0.4.2                   tools_4.1.0                  
## [47] lifecycle_1.0.0               stringr_1.4.0                
## [49] DelayedArray_0.18.0           AnnotationDbi_1.54.0         
## [51] Biostrings_2.60.0             compiler_4.1.0               
## [53] jquerylib_0.1.4               rlang_0.4.11                 
## [55] grid_4.1.0                    RCurl_1.98-1.3               
## [57] htmlwidgets_1.5.3             rappdirs_0.3.3               
## [59] crosstalk_1.1.1               bitops_1.0-7                 
## [61] rmarkdown_2.8                 ExperimentHub_2.0.0          
## [63] DBI_1.1.1                     curl_4.3.1                   
## [65] R6_2.5.0                      knitr_1.33                   
## [67] dplyr_1.0.6                   fastmap_1.1.0                
## [69] bit_4.0.4                     utf8_1.2.1                   
## [71] filelock_1.0.2                stringi_1.6.2                
## [73] Rcpp_1.0.6                    vctrs_0.3.8                  
## [75] png_0.1-7                     sparseMatrixStats_1.4.0      
## [77] dbplyr_2.1.1                  tidyselect_1.1.1             
## [79] xfun_0.23

References

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2): 163–72.

Benayoun, Bérénice A., Elizabeth A. Pollina, Param Priya Singh, Salah Mahmoudi, Itamar Harel, Kerriann M. Casey, Ben W. Dulken, Anshul Kundaje, and Anne Brunet. 2019. “Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses.” Genome Research 29: 697–709. https://doi.org/10.1101/gr.240093.118.

Heng, Tracy S.P., Michio W. Painter, Kutlu Elpek, Veronika Lukacs-Kornek, Nora Mauermann, Shannon J. Turley, Daphne Koller, et al. 2008. “The immunological genome project: Networks of gene expression in immune cells.” Nature Immunology 9 (10): 1091–4. https://doi.org/10.1038/ni1008-1091.

Mabbott, Neil A., J. K. Baillie, Helen Brown, Tom C. Freeman, and David A. Hume. 2013. “An expression atlas of human primary cells: Inference of gene function from coexpression networks.” BMC Genomics 14. https://doi.org/10.1186/1471-2164-14-632.

Martens, Joost H A, and Hendrik G. Stunnenberg. 2013. “BLUEPRINT: Mapping human blood cell epigenomes.” Haematologica 98: 1487–9. https://doi.org/10.3324/haematol.2013.094243.

Monaco, Gianni, Bernett Lee, Weili Xu, Seri Mustafah, You Yi Hwang, Christophe Carré, Nicolas Burdin, et al. 2019. “RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute Deconvolution of Human Immune Cell Types.” Cell Reports 26 (6): 1627–1640.e7. https://doi.org/10.1016/j.celrep.2019.01.041.

Novershtern, Noa, Aravind Subramanian, Lee N. Lawton, Raymond H. Mak, W. Nicholas Haining, Marie E. McConkey, Naomi Habib, et al. 2011. “Densely Interconnected Transcriptional Circuits Control Cell States in Human Hematopoiesis.” Cell 144 (2): 296–309. https://doi.org/10.1016/j.cell.2011.01.004.

Schmiedel, Benjamin J., Divya Singh, Ariel Madrigal, Alan G. Valdovino-Gonzalez, Brandie M. White, Jose Zapardiel-Gonzalo, Brendan Ha, et al. 2018. “Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.” Cell 175 (6): 1701–1715.e16. https://doi.org/10.1016/j.cell.2018.10.022.

The ENCODE Project Consortium. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature. https://doi.org/10.1038/nature11247.