---
bibliography: ref.bib
---
# Annotation diagnostics
## Overview
In addition to the labels, `SingleR()` returns a number of helpful diagnostics about the annotation process
that can be used to determine whether the assignments are appropriate.
Unambiguous assignments corroborated by expression of canonical markers add confidence to the results;
conversely, low-confidence assignments can be pruned out to avoid adding noise to downstream analyses.
This chapter will demonstrate some of these common sanity checks on the pancreas datasets
from Chapter \@ref(sc-mode) [@muraro2016singlecell;@grun2016denovo].
## Based on the scores within cells
The most obvious diagnostic reported by `SingleR()` is the nested matrix of per-cell scores in the `scores` field.
This contains the correlation-based scores prior to any fine-tuning for each cell (row) and reference label (column).
Ideally, we would see unambiguous assignments where, for any given cell, one label's score is clearly larger than the others.
```r
pred.grun$scores[1:10,]
```
```
## acinar alpha beta delta duct endothelial epsilon mesenchymal
## [1,] 0.6797 0.1463 0.1554 0.1029 0.4971 0.2030 0.04434 0.1848
## [2,] 0.6832 0.1849 0.1611 0.1250 0.5133 0.2231 0.08909 0.2140
## [3,] 0.6845 0.2032 0.2141 0.1875 0.5060 0.2286 0.08960 0.1961
## [4,] 0.6472 0.2050 0.2115 0.1879 0.7300 0.2285 0.12107 0.2708
## [5,] 0.6079 0.2285 0.2419 0.1938 0.6616 0.2659 0.16007 0.3228
## [6,] 0.5816 0.2789 0.2611 0.2661 0.6827 0.2857 0.21781 0.3151
## [7,] 0.5704 0.3268 0.2990 0.2618 0.6293 0.2947 0.20834 0.3263
## [8,] 0.5692 0.2455 0.2152 0.2033 0.6495 0.2652 0.18469 0.2943
## [9,] 0.6421 0.2101 0.1929 0.1894 0.7201 0.2795 0.14615 0.3142
## [10,] 0.7153 0.2109 0.1747 0.1606 0.5896 0.1687 0.15639 0.1778
## pp
## [1,] 0.08954
## [2,] 0.11207
## [3,] 0.15207
## [4,] 0.14351
## [5,] 0.18475
## [6,] 0.21898
## [7,] 0.25871
## [8,] 0.19160
## [9,] 0.17281
## [10,] 0.16314
```
To check whether this is indeed the case,
we use the `plotScoreHeatmap()` function to visualize the score matrix (Figure \@ref(fig:score-heatmap-grun)).
Here, the key is to examine the spread of scores within each cell, i.e., down the columns of the heatmap.
Similar scores for a group of labels indicates that the assignment is uncertain for those columns,
though this may be acceptable if the uncertainty is distributed across closely related cell types.
(Note that the assigned label for a cell may not be the visually top-scoring label if fine-tuning is applied,
as the only the pre-tuned scores are directly comparable across all labels.)
```r
library(SingleR)
plotScoreHeatmap(pred.grun)
```
(\#fig:score-heatmap-grun)Heatmap of normalized scores for the Grun dataset. Each cell is a column while each row is a label in the reference Muraro dataset. The final label (after fine-tuning) for each cell is shown in the top color bar.
We can also display other metadata information for each cell by setting `clusters=` or `annotation_col=`.
This is occasionally useful for examining potential batch effects,
differences in cell type composition between conditions,
relationship to clusters from an unsupervised analysis and so on,.
For example, Figure \@ref(fig:score-heatmap-grun-donor) displays the donor of origin for each cell;
we can see that each cell type has contributions from multiple donors,
which is reassuring as it indicates that our assignments are not (purely) driven by donor effects.
```r
plotScoreHeatmap(pred.grun,
annotation_col=as.data.frame(colData(sceG)[,"donor",drop=FALSE]))
```
(\#fig:score-heatmap-grun-donor)Heatmap of normalized scores for the Grun dataset, including the donor of origin for each cell.
The `scores` matrix has several caveats associated with its interpretation.
Only the pre-tuned scores are stored in this matrix, as scores after fine-tuning are not comparable across all labels.
This means that the label with the highest score for a cell may not be the cell's final label if fine-tuning is applied.
Moreover, the magnitude of differences in the scores has no clear interpretation;
indeed, `plotScoreHeatmap()` dispenses with any faithful representation of the scores
and instead adjusts the values to highlight any differences between labels within each cell.
## Based on the deltas across cells
We identify poor-quality or ambiguous assignments based on the per-cell "delta",
i.e., the difference between the score for the assigned label and the median across all labels for each cell.
Our assumption is that most of the labels in the reference are not relevant to any given cell.
Thus, the median across all labels can be used as a measure of the baseline correlation,
while the gap from the assigned label to this baseline can be used as a measure of the assignment confidence.
Low deltas indicate that the assignment is uncertain, possibly because the cell's true label does not exist in the reference.
An obvious next step is to apply a threshold on the delta to filter out these low-confidence assignments.
We use the delta rather than the assignment score as the latter is more sensitive to technical effects.
For example, changes in library size affect the technical noise and can increase/decrease all scores for a given cell,
while the delta is somewhat more robust as it focuses on the differences between scores within each cell.
`SingleR()` will set a threshold on the delta for each label using an outlier-based strategy.
Specifically, we identify cells with deltas that are small outliers relative to the deltas of other cells with the same label.
This assumes that, for any given label, most cells assigned to that label are correct.
We focus on outliers to avoid difficulties with setting a fixed threshold,
especially given that the magnitudes of the deltas are about as uninterpretable as the scores themselves.
Pruned labels are reported in the `pruned.labels` field where low-quality assignments are replaced with `NA`.
```r
to.remove <- is.na(pred.grun$pruned.labels)
table(Label=pred.grun$labels, Removed=to.remove)
```
```
## Removed
## Label FALSE TRUE
## acinar 251 26
## alpha 198 5
## beta 180 1
## delta 49 1
## duct 301 5
## endothelial 4 1
## epsilon 1 0
## mesenchymal 22 0
## pp 17 2
```
However, the default pruning parameters may not be appropriate for every dataset.
For example, if one label is consistently misassigned, the assumption that most cells are correctly assigned will not be appropriate.
In such cases, we can revert to a fixed threshold by manually calling the underlying `pruneScores()` function with `min.diff.med=`.
The example below discards cells with deltas below an arbitrary threshold of 0.2,
where higher thresholds correspond to greater assignment certainty.
```r
to.remove <- pruneScores(pred.grun, min.diff.med=0.2)
table(Label=pred.grun$labels, Removed=to.remove)
```
```
## Removed
## Label FALSE TRUE
## acinar 250 27
## alpha 155 48
## beta 148 33
## delta 33 17
## duct 301 5
## endothelial 4 1
## epsilon 0 1
## mesenchymal 22 0
## pp 4 15
```
This entire process can be visualized using the `plotScoreDistribution()` function,
which displays the per-label distribution of the deltas across cells (Figure \@ref(fig:score-dist-grun)).
We can use this plot to check that outlier detection in `pruneScores()` behaved sensibly.
Labels with especially low deltas may warrant some additional caution in their interpretation.
```r
plotDeltaDistribution(pred.grun)
```
(\#fig:score-dist-grun)Distribution of deltas for the Grun dataset. Each facet represents a label in the Muraro dataset, and each point represents a cell assigned to that label (colored by whether it was pruned).
If fine-tuning was performed, we can apply an even more stringent filter
based on the difference between the highest and second-highest scores after fine-tuning.
Cells will only pass the filter if they are assigned to a label that is clearly distinguishable from any other label.
In practice, this approach tends to be too conservative as assignments involving closely related labels are heavily penalized.
```r
to.remove2 <- pruneScores(pred.grun, min.diff.next=0.1)
table(Label=pred.grun$labels, Removed=to.remove2)
```
```
## Removed
## Label FALSE TRUE
## acinar 235 42
## alpha 166 37
## beta 117 64
## delta 25 25
## duct 157 149
## endothelial 4 1
## epsilon 0 1
## mesenchymal 22 0
## pp 9 10
```
## Based on marker gene expression
Another simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset.
The marker genes used for each label are reported in the `metadata()` of the `SingleR()` output, so we can simply retrieve them to visualize their (usually log-transformed) expression values across the test dataset.
In Figure \@ref(fig:grun-beta-heat), we use the `plotHeatmap()` function from *[scater](https://bioconductor.org/packages/3.18/scater)* to examine the expression of markers used to identify beta cells.
```r
all.markers <- metadata(pred.grun)$de.genes
beta.markers <- unique(unlist(all.markers$beta))
sceG$labels <- pred.grun$labels
library(scater)
plotHeatmap(sceG, order_columns_by="labels", features=beta.markers)
```
(\#fig:grun-beta-heat)Heatmap of log-expression values in the Grun dataset for all marker genes upregulated in beta cells in the Muraro reference dataset. Assigned labels for each cell are shown at the top of the plot.
If a cell in the test dataset is confidently assigned to a particular label,
we would expect it to have strong expression of that label's markers.
We would also hope that those label's markers are biologically meaningful;
in this case, we do observe strong upregulation of insulin (_INS_) in the beta cells,
which is reassuring and gives greater confidence to the correctness of the assignment.
If the identified markers are not meaningful or not consistently upregulated,
some skepticism towards the quality of the assignments is warranted.
In practice, the heatmap may be overwhelmingly large if there too many reference-derived markers.
To resolve this, we can prune the set of markers to focus on the most interesting genes based on their test expression profiles.
Figure \@ref(fig:grun-beta-heat2) is limited to the top genes with the strongest evidence for upregulation in our test dataset using the assigned labels; such genes are effectively markers for beta cells in both the reference _and_ test datasets.
As a diagnostic plot, this is much more amenable to quick inspection to check that the expected genes are present.
```r
# Taking the first 20 reference markers that are the top empirical markers.
library(scran)
empirical.markers <- findMarkers(sceG, sceG$labels, direction="up")
m <- match(beta.markers, rownames(empirical.markers$beta))
m <- beta.markers[rank(m) <= 20]
library(scater)
plotHeatmap(sceG, order_columns_by="labels", features=m)
```
(\#fig:grun-beta-heat2)Heatmap of log-expression values in the Grun dataset for all marker genes upregulated in beta cells in the Muraro reference dataset, pruned to those that are also upregulated in the assigned cells in the Grun dataset. Assigned labels for each cell are shown at the top of the plot.
It is straightforward to repeat this process for all labels by wrapping this code in a loop,
as shown below in Figure \@ref(fig:grun-beta-heat-all).
Note that `plotHeatmap()` is not the only function that can be used for this visualization;
we could also use `plotDots()` to create a *[Seurat](https://CRAN.R-project.org/package=Seurat)*-style dot plot,
or we could use other heatmap plotting functions such as `dittoHeatmap()` from *[dittoSeq](https://bioconductor.org/packages/3.18/dittoSeq)*.
```r
collected <- list()
for (lab in unique(pred.grun$labels)) {
lab.markers <- unique(unlist(all.markers[[lab]]))
m <- match(lab.markers, rownames(empirical.markers[[lab]]))
m <- lab.markers[rank(m) <= 20]
collected[[lab]] <- plotHeatmap(sceG, silent=TRUE,
order_columns_by="labels", main=lab, features=m)[[4]]
}
do.call(gridExtra::grid.arrange, collected)
```
(\#fig:grun-beta-heat-all)Heatmaps of log-expression values in the Grun dataset for all marker genes upregulated in each label in the Muraro reference dataset. Assigned labels for each cell are shown at the top of each plot.
In general, the heatmap provides a more interpretable diagnostic visualization than the plots of scores and deltas.
However, it does require more effort to inspect and may not be feasible for large numbers of labels.
It is also difficult to use a heatmap to determine the correctness of assignment for closely related labels.
## Comparing to unsupervised clustering
It can also be instructive to compare the assigned labels to the groupings generated from unsupervised clustering algorithms.
The assumption is that the differences between reference labels are also the dominant factor of variation in the test dataset;
this implies that we should expect strong agreement between the clusters and the assigned labels.
To demonstrate, we'll use the `sceG` from Chapter \@ref(pancreas-case-study)
where clusters have generated using a graph-based method [@xu2015identification].
We compare these clusters to the labels generated by *[SingleR](https://bioconductor.org/packages/3.18/SingleR)*.
Any similarity can be quantified with the adjusted rand index (ARI) with `pairwiseRand()` from the *[bluster](https://bioconductor.org/packages/3.18/bluster)* package.
Large ARIs indicate that the two partitionings are in agreement, though an acceptable definition of "large" is difficult to gauge;
experience suggests that a reasonable level of consistency is achieved at ARIs above 0.5.
```r
library(bluster)
pairwiseRand(sceG$cluster, pred.grun$labels, mode="index")
```
```
## [1] 0.3922
```
In practice, it is more informative to examine the distribution of cells across each cluster/label combination.
Figure \@ref(fig:grun-label-clusters) shows that most clusters are nested within labels,
a difference in resolution that is likely responsible for reducing the ARI.
Clusters containing multiple labels are particularly interesting for diagnostic purposes,
as this suggests that the differences between labels are not strong enough to drive formation of distinct clusters in the test.
```r
tab <- table(cluster=sceG$cluster, label=pred.grun$labels)
pheatmap::pheatmap(log10(tab+10)) # using a larger pseudo-count for smoothing.
```
(\#fig:grun-label-clusters)Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Grun dataset.
The underlying assumption is somewhat reasonable in most scenarios where the labels relate to cell type identity.
However, disagreements between the clusters and labels should not be cause for much concern.
The whole point of unsupervised clustering is to identify novel variation that, by definition, is not in the reference.
It is entirely possible for the clustering and labels to be different without compromising the validity or utility of either;
the former captures new heterogeneity while the latter facilitates interpretation in the context of existing knowledge.
## Session information {-}