Chapter 3 Controlling marker detection
3.1 Overview
One of the most important steps in SingleR (beyond the choice of reference, of course) is the derivation of the marker genes used in the score calculation. We have already introduced the classic approach in the previous chapter, but it is similarly straightforward to perform marker detection with conventional statistical tests. In particular, we identify top-ranked markers based on pairwise Wilcoxon rank sum tests or \(t\)-tests between labels; this allows us to account for the variability across cells to choose genes that are robustly upregulated in each label.
The availability of variance-aware marker detection methods is most relevant for reference datasets
that contain a reasonable number (i.e., at least two) of replicate samples for each label.
An obvious use case is that of single-cell datasets that are used as a reference to annotate other single-cell datasets.
It is also possible for users to supply their own custom marker lists to SingleR()
,
facilitating incorporation of prior biological knowledge into the annotation process.
We will demonstrate these capabilities below in this chapter.
3.2 Annotation with test-based marker detection
To demonstrate, we will use two human pancreas scRNA-seq datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference, computing log-normalized expression values as discussed in Section 2.4.
library(scRNAseq)
sceM <- MuraroPancreasData()
# Removing unlabelled cells or cells without a clear label.
sceM <- sceM[,!is.na(sceM$label) & sceM$label!="unclear"]
library(scater)
sceM <- logNormCounts(sceM)
sceM
## class: SingleCellExperiment
## dim: 19059 2122
## metadata(0):
## assays(2): counts logcounts
## rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
## ZZZ3__chr1
## rowData names(2): symbol chr
## colnames(2122): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94
## colData names(4): label donor plate sizeFactor
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC
##
## acinar alpha beta delta duct endothelial
## 219 812 448 193 245 21
## epsilon mesenchymal pp
## 3 80 101
We then set up our test dataset from Grun et al. (2016), applying some basic quality control as discusssed
here
and in Section 2.3.
We also compute the log-transformed values here, not because it is strictly necessary
but so that we don’t have to keep on typing assay.type.test=1
in later calls to SingleR()
.
sceG <- GrunPancreasData()
sceG <- addPerCellQC(sceG)
qc <- quickPerCellQC(colData(sceG),
percent_subsets="altexps_ERCC_percent",
batch=sceG$donor,
subset=sceG$donor %in% c("D17", "D7", "D2"))
sceG <- sceG[,!qc$discard]
sceG <- logNormCounts(sceG)
sceG
## class: SingleCellExperiment
## dim: 20064 1064
## metadata(0):
## assays(2): counts logcounts
## rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
## ZZZ3__chr1
## rowData names(2): symbol chr
## colnames(1064): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95
## colData names(9): donor sample ... total sizeFactor
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC
We run SingleR()
as described previously but with a marker detection mode that considers the variance of expression across cells.
Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels.
This is slower but more appropriate for single-cell data compared to the default marker detection algorithm,
as the latter may fail for low-coverage data where the median for each label is often zero.
library(SingleR)
pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)
##
## acinar alpha beta delta duct endothelial
## 277 203 181 50 306 5
## epsilon mesenchymal pp
## 1 22 19
By default, the function will take the top de.n
(default: 10) genes from each pairwise comparison between labels.
A larger number of markers increases the robustness of the annotation by ensuring that relevant genes are not omitted,
especially if the reference dataset has study-specific effects that cause uninteresting genes to dominate the top set.
However, this comes at the cost of increasing noise and computational time.
library(SingleR)
pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label,
de.method="wilcox", de.n=50)
table(pred.grun$labels)
##
## acinar alpha beta delta duct endothelial
## 275 203 177 55 307 5
## epsilon mesenchymal pp
## 1 23 18
3.3 Defining custom markers
The marker detection in SingleR()
is built on top of the testing framework in scran,
so most options in ?pairwiseWilcox
and friends can be applied via the de.args=
option.
For example, we could use the \(t\)-test and test against a log-fold change threshold with de.args=list(lfc=1)
.
library(SingleR)
pred.grun2 <- SingleR(test=sceG, ref=sceM, labels=sceM$label,
de.method="t", de.args=list(lfc=1))
table(pred.grun2$labels)
##
## acinar alpha beta delta duct endothelial
## 285 200 177 54 296 5
## epsilon mesenchymal pp
## 5 24 18
However, users can also construct their own marker lists with any DE testing machinery. For example, we can perform pairwise binomial tests to identify genes that are differentially detected (i.e., have differences in the proportion of cells with non-zero counts) between labels in the reference Muraro dataset. We then take the top 10 marker genes from each pairwise comparison, obtaining a list of lists of character vectors containing the identities of the markers for that comparison.
library(scran)
out <- pairwiseBinom(counts(sceM), sceM$label, direction="up")
markers <- getTopMarkers(out$statistics, out$pairs, n=10)
# Upregulated in acinar compared to alpha:
markers$acinar$alpha
## [1] "KCNQ1__chr11" "FAM129A__chr1" "KLK1__chr19" "NTN4__chr12"
## [5] "RASEF__chr9" "CTRL__chr16" "LGALS2__chr22" "NUPR1__chr16"
## [9] "LGALS3__chr14" "NR5A2__chr1"
## [1] "SLC38A4__chr12" "ARX__chrX" "CRYBA2__chr2" "FSTL5__chr4"
## [5] "GNG2__chr14" "NOL4__chr18" "IRX2__chr5" "KCNMB2__chr3"
## [9] "CFC1__chr2" "KCNJ6__chr21"
Once we have this list of lists, we supply it to SingleR()
via the genes=
argument,
which causes the function to bypass the internal marker detection to use the supplied gene sets instead.
The most obvious benefit of this approach is that the user can achieve greater control of the markers,
allowing integration of prior biological knowledge to obtain more relevant genes and a more robust annotation.
pred.grun2b <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=markers)
table(pred.grun2b$labels)
##
## acinar alpha beta delta duct endothelial
## 276 202 175 54 302 6
## epsilon mesenchymal pp
## 2 24 23
In some cases, markers may only be available for specific labels rather than for pairwise comparisons between labels.
This is accommodated by supplying a named list of character vectors to genes
.
Note that this is likely to be less powerful than the list-of-lists approach as information about pairwise differences is discarded.
# Creating label-specific markers.
label.markers <- lapply(markers, unlist)
label.markers <- lapply(label.markers, unique)
str(label.markers)
## List of 9
## $ acinar : chr [1:40] "KCNQ1__chr11" "FAM129A__chr1" "KLK1__chr19" "NTN4__chr12" ...
## $ alpha : chr [1:41] "SLC38A4__chr12" "ARX__chrX" "CRYBA2__chr2" "FSTL5__chr4" ...
## $ beta : chr [1:47] "ELAVL4__chr1" "PRUNE2__chr9" "NMNAT2__chr1" "PLCB4__chr20" ...
## $ delta : chr [1:44] "NOL4__chr18" "CABP7__chr22" "UNC80__chr2" "HEPACAM2__chr7" ...
## $ duct : chr [1:50] "ADCY5__chr3" "PDE3A__chr12" "SLC3A1__chr2" "BICC1__chr10" ...
## $ endothelial: chr [1:26] "GPR4__chr19" "TMEM204__chr16" "GPR116__chr6" "CYYR1__chr21" ...
## $ epsilon : chr [1:14] "BHMT__chr5" "JPH3__chr16" "SERPINA10__chr14" "UGT2B4__chr4" ...
## $ mesenchymal: chr [1:34] "TNFAIP6__chr2" "THBS2__chr6" "CDH11__chr16" "SRPX2__chrX" ...
## $ pp : chr [1:44] "SERTM1__chr13" "ETV1__chr7" "ARX__chrX" "ELAVL4__chr1" ...
pred.grun2c <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=label.markers)
table(pred.grun2c$labels)
##
## acinar alpha beta delta duct endothelial
## 262 204 169 59 317 6
## epsilon mesenchymal pp
## 2 24 21
3.4 Pseudo-bulk aggregation
Single-cell reference datasets provide a like-for-like comparison to our test single-cell datasets, yielding a more accurate classification of the cells in the latter (hopefully). However, there are frequently many more samples in single-cell references compared to bulk references, increasing the computational work involved in classification. We overcome this by aggregating cells into one “pseudo-bulk” sample per label (e.g., by averaging across log-expression values) and using that as the reference profile, which allows us to achieve the same efficiency as the use of bulk references.
The obvious cost of this approach is that we discard potentially useful information about the distribution of cells within each label. Cells that belong to a heterogeneous population may not be correctly assigned if they are far from the population center. To preserve some of this information, we perform \(k\)-means clustering within each label to create pseudo-bulk samples that are representative of a particular region of the expression space (i.e., vector quantization). We create \(\sqrt{N}\) clusters given a label with \(N\) cells, which provides a reasonable compromise between reducing computational work and preserving the label’s internal distribution.
To enable this aggregation, we simply set aggr.ref=TRUE
in the SingleR()
call.
This uses the aggregateReference()
function to perform \(k\)-means clustering within each label
(typically after principal components analysis on the log-expression matrix, for greater speed)
and average expression values for each within-label cluster.
Note that marker detection is still performed on the unaggregated data
so as to make full use of the distribution of expression values across cells.
set.seed(100) # for the k-means step.
pred.grun3 <- SingleR(test=sceG, ref=sceM, labels=sceM$label,
de.method="wilcox", aggr.ref=TRUE)
table(pred.grun3$labels)
##
## acinar alpha beta delta duct endothelial
## 271 202 181 51 311 5
## epsilon mesenchymal pp
## 1 22 20
Obviously, the aggregation itself requires computational work so setting aggr.ref=TRUE
in SingleR()
itself may not improve speed.
Rather, the real power of this approach lies in pre-aggregating the reference dataset
so that it can be repeatedly applied to quickly annotate multiple test datasets.
This approach is discussed in more detail in Chapter 7.
Session information
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scran_1.26.0 SingleR_2.0.0
[3] scater_1.26.0 ggplot2_3.3.6
[5] scuttle_1.8.0 scRNAseq_2.11.0
[7] SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
[9] Biobase_2.58.0 GenomicRanges_1.50.0
[11] GenomeInfoDb_1.34.0 IRanges_2.32.0
[13] S4Vectors_0.36.0 BiocGenerics_0.44.0
[15] MatrixGenerics_1.10.0 matrixStats_0.62.0
[17] BiocStyle_2.26.0 rebook_1.8.0
loaded via a namespace (and not attached):
[1] AnnotationHub_3.6.0 BiocFileCache_2.6.0
[3] igraph_1.3.5 lazyeval_0.2.2
[5] BiocParallel_1.32.0 digest_0.6.30
[7] ensembldb_2.22.0 htmltools_0.5.3
[9] viridis_0.6.2 fansi_1.0.3
[11] magrittr_2.0.3 memoise_2.0.1
[13] ScaledMatrix_1.6.0 cluster_2.1.4
[15] limma_3.54.0 Biostrings_2.66.0
[17] prettyunits_1.1.1 colorspace_2.0-3
[19] blob_1.2.3 rappdirs_0.3.3
[21] ggrepel_0.9.1 xfun_0.34
[23] dplyr_1.0.10 crayon_1.5.2
[25] RCurl_1.98-1.9 jsonlite_1.8.3
[27] graph_1.76.0 glue_1.6.2
[29] gtable_0.3.1 zlibbioc_1.44.0
[31] XVector_0.38.0 DelayedArray_0.24.0
[33] BiocSingular_1.14.0 scales_1.2.1
[35] edgeR_3.40.0 DBI_1.1.3
[37] Rcpp_1.0.9 viridisLite_0.4.1
[39] xtable_1.8-4 progress_1.2.2
[41] dqrng_0.3.0 bit_4.0.4
[43] rsvd_1.0.5 metapod_1.6.0
[45] httr_1.4.4 dir.expiry_1.6.0
[47] ellipsis_0.3.2 pkgconfig_2.0.3
[49] XML_3.99-0.12 CodeDepends_0.6.5
[51] sass_0.4.2 dbplyr_2.2.1
[53] locfit_1.5-9.6 utf8_1.2.2
[55] tidyselect_1.2.0 rlang_1.0.6
[57] later_1.3.0 AnnotationDbi_1.60.0
[59] munsell_0.5.0 BiocVersion_3.16.0
[61] tools_4.2.1 cachem_1.0.6
[63] cli_3.4.1 generics_0.1.3
[65] RSQLite_2.2.18 ExperimentHub_2.6.0
[67] evaluate_0.17 stringr_1.4.1
[69] fastmap_1.1.0 yaml_2.3.6
[71] knitr_1.40 bit64_4.0.5
[73] purrr_0.3.5 KEGGREST_1.38.0
[75] AnnotationFilter_1.22.0 sparseMatrixStats_1.10.0
[77] mime_0.12 xml2_1.3.3
[79] biomaRt_2.54.0 compiler_4.2.1
[81] beeswarm_0.4.0 filelock_1.0.2
[83] curl_4.3.3 png_0.1-7
[85] interactiveDisplayBase_1.36.0 statmod_1.4.37
[87] tibble_3.1.8 bslib_0.4.0
[89] stringi_1.7.8 GenomicFeatures_1.50.0
[91] bluster_1.8.0 lattice_0.20-45
[93] ProtGenerics_1.30.0 Matrix_1.5-1
[95] vctrs_0.5.0 pillar_1.8.1
[97] lifecycle_1.0.3 BiocManager_1.30.19
[99] jquerylib_0.1.4 BiocNeighbors_1.16.0
[101] bitops_1.0-7 irlba_2.3.5.1
[103] httpuv_1.6.6 rtracklayer_1.58.0
[105] R6_2.5.1 BiocIO_1.8.0
[107] bookdown_0.29 promises_1.2.0.1
[109] gridExtra_2.3 vipor_0.4.5
[111] codetools_0.2-18 assertthat_0.2.1
[113] rjson_0.2.21 withr_2.5.0
[115] GenomicAlignments_1.34.0 Rsamtools_2.14.0
[117] GenomeInfoDbData_1.2.9 parallel_4.2.1
[119] hms_1.1.2 grid_4.2.1
[121] beachmat_2.14.0 rmarkdown_2.17
[123] DelayedMatrixStats_1.20.0 shiny_1.7.3
[125] ggbeeswarm_0.6.0 restfulr_0.0.15
Bibliography
Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2): 266–77.
Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4): 385–94.