---
bibliography: ref.bib
---
# Advanced options
## Preconstructed indices
Advanced users can split the `SingleR()` workflow into two separate training and classification steps.
This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once
for any reference.
The resulting data structure can then be re-used across multiple classifications with different test datasets,
provided the gene annotation in the test dataset is identical to or a superset of the genes in the training set.
To illustrate, we will consider the DICE reference dataset [@diceRef] from the *[celldex](https://bioconductor.org/packages/3.21/celldex)* package.
``` r
library(celldex)
dice <- DatabaseImmuneCellExpressionData(ensembl=TRUE)
dice
```
```
## class: SummarizedExperiment
## dim: 29914 1561
## metadata(0):
## assays(1): logcounts
## rownames(29914): ENSG00000121410 ENSG00000268895 ... ENSG00000159840
## ENSG00000074755
## rowData names(0):
## colnames(1561): TPM_1 TPM_2 ... TPM_101 TPM_102
## colData names(3): label.main label.fine label.ont
```
``` r
table(dice$label.fine)
```
```
##
## B cells, naive Monocytes, CD14+
## 106 106
## Monocytes, CD16+ NK cells
## 105 105
## T cells, CD4+, TFH T cells, CD4+, Th1
## 104 104
## T cells, CD4+, Th17 T cells, CD4+, Th1_17
## 104 104
## T cells, CD4+, Th2 T cells, CD4+, memory TREG
## 104 104
## T cells, CD4+, naive T cells, CD4+, naive TREG
## 103 104
## T cells, CD4+, naive, stimulated T cells, CD8+, naive
## 102 104
## T cells, CD8+, naive, stimulated
## 102
```
Let's say we want to use the DICE reference to annotate the PBMC dataset from Chapter \@ref(introduction).
``` r
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc3k")
```
We use the `trainSingleR()` function to do all the necessary calculations that are independent of the test dataset.
This yields a list of various components that contains all identified marker genes
and precomputed rank indices to be used in the score calculation.
We can also turn on aggregation with `aggr.ref=TRUE` (Section \@ref(pseudo-bulk-aggregation))
to further reduce computational work.
Note that we need the identities of the genes in the test dataset (hence, `test.genes=`) to ensure that our chosen markers will actually be present in the test.
``` r
library(SingleR)
set.seed(2000)
trained <- trainSingleR(dice, labels=dice$label.fine,
test.genes=rownames(sce), aggr.ref=TRUE)
```
We then use the `trained` object to annotate our dataset of interest through the `classifySingleR()` function.
As we can see, this yields exactly the same result as applying `SingleR()` directly.
The advantage here is that `trained` can be re-used for multiple `classifySingleR()` calls -
possibly on different datasets - without having to repeat unnecessary steps when the reference is unchanged.
``` r
pred <- classifySingleR(sce, trained, assay.type=1)
table(pred$labels)
```
```
##
## B cells, naive Monocytes, CD14+
## 344 516
## Monocytes, CD16+ NK cells
## 185 313
## T cells, CD4+, TFH T cells, CD4+, Th1
## 456 220
## T cells, CD4+, Th17 T cells, CD4+, Th1_17
## 59 64
## T cells, CD4+, Th2 T cells, CD4+, memory TREG
## 45 146
## T cells, CD4+, naive T cells, CD4+, naive TREG
## 117 25
## T cells, CD8+, naive
## 210
```
``` r
# Comparing to the direct approach.
set.seed(2000)
direct <- SingleR(sce, ref=dice, labels=dice$label.fine,
assay.type.test=1, aggr.ref=TRUE)
identical(pred$labels, direct$labels)
```
```
## [1] TRUE
```
## Parallelization
Parallelization is an obvious approach to increasing annotation throughput.
This is done using the framework in the *[BiocParallel](https://bioconductor.org/packages/3.21/BiocParallel)* package,
which provides several options for parallelization depending on the available hardware.
On POSIX-compliant systems (i.e., Linux and MacOS), the simplest method is to use forking
by passing `MulticoreParam()` to the `BPPARAM=` argument:
``` r
library(BiocParallel)
pred2a <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine,
BPPARAM=MulticoreParam(8)) # 8 CPUs.
```
Alternatively, one can use separate processes with `SnowParam()`,
which is slower but can be used on all systems - including Windows, our old nemesis.
``` r
pred2b <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine,
BPPARAM=SnowParam(8))
identical(pred2a$labels, pred2b$labels)
```
```
## [1] TRUE
```
When working on a cluster, passing `BatchtoolsParam()` to `SingleR()` allows us to
seamlessly interface with various job schedulers like SLURM, LSF and so on.
This permits heavy-duty parallelization across hundreds of CPUs for highly intensive jobs,
though often some configuration is required -
see the [vignette](https://bioconductor.org/packages/3.21/BiocParallel/vignettes/BiocParallel_BatchtoolsParam.pdf) for more details.
## Approximate algorithms
It is possible to sacrifice accuracy to squeeze more speed out of *[SingleR](https://bioconductor.org/packages/3.21/SingleR)*.
The most obvious approach is to simply turn off the fine-tuning with `fine.tune=FALSE`,
which avoids the time-consuming fine-tuning iterations.
When the reference labels are well-separated, this is probably an acceptable trade-off.
``` r
pred3a <- SingleR(sce, ref=dice, assay.type.test=1,
labels=dice$label.main, fine.tune=FALSE)
table(pred3a$labels)
```
```
##
## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
## 348 705 357 950 340
```
Another approximation is based on the fact that the initial score calculation is done using a nearest-neighbors search.
By default, this is an exact seach but we can switch to an approximate algorithm via the `BNPARAM=` argument.
In the example below, we use the [Annoy algorithm](https://github.com/spotify/annoy)
via the *[BiocNeighbors](https://bioconductor.org/packages/3.21/BiocNeighbors)* framework, which yields mostly similar results.
(Note, though, that the Annoy method does involve a considerable amount of overhead,
so for small jobs it will actually be slower than the exact search.)
``` r
library(BiocNeighbors)
pred3b <- SingleR(sce, ref=dice, assay.type.test=1,
labels=dice$label.main, fine.tune=FALSE, # for comparison with pred3a.
BNPARAM=AnnoyParam())
table(pred3a$labels, pred3b$labels)
```
```
##
## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
## B cells 348 0 0 0 0
## Monocytes 0 705 0 0 0
## NK cells 0 0 357 0 0
## T cells, CD4+ 0 0 0 950 0
## T cells, CD8+ 0 0 0 0 340
```
## Cluster-level annotation
The default philosophy of *[SingleR](https://bioconductor.org/packages/3.21/SingleR)* is to perform annotation of each individual cell in the test dataset.
An alternative strategy is to perform annotation of aggregated profiles for groups or clusters of cells.
To demonstrate, we will perform a quick-and-dirty clustering of our PBMC dataset with a variety of Bioconductor packages.
``` r
library(scuttle)
sce <- logNormCounts(sce)
library(scran)
dec <- modelGeneVarByPoisson(sce)
sce <- denoisePCA(sce, dec, subset.row=getTopHVGs(dec, n=5000))
library(bluster)
colLabels(sce) <- clusterRows(reducedDim(sce), NNGraphParam())
library(scater)
set.seed(117)
sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by="label")
```
By passing `clusters=` to `SingleR()`, we direct the function to compute an aggregated profile per cluster.
Annotation is then performed on the cluster-level profiles rather than on the single-cell level.
This has the major advantage of being much faster to compute as there are obviously fewer clusters than cells;
it is also easier to interpret as it directly returns the likely cell type identity of each cluster.
``` r
SingleR(sce, dice, clusters=colLabels(sce), labels=dice$label.main)
```
```
## DataFrame with 12 rows and 4 columns
## scores labels delta.next pruned.labels
##
## 1 0.2064515:0.234911:0.365014:... T cells, CD4+ 0.0477942 T cells, CD4+
## 2 0.2252281:0.623581:0.205190:... Monocytes 0.3983524 Monocytes
## 3 0.0550041:0.270787:0.728557:... NK cells 0.3343725 NK cells
## 4 0.1427138:0.781610:0.209363:... Monocytes 0.5722475 Monocytes
## 5 0.1740008:0.756285:0.254398:... Monocytes 0.5018874 Monocytes
## ... ... ... ... ...
## 8 0.1527166:0.235676:0.533110:... T cells, CD4+ 0.0614282 T cells, CD4+
## 9 0.2055489:0.277134:0.405779:... T cells, CD4+ 0.1168394 T cells, CD4+
## 10 0.1403238:0.258656:0.605894:... NK cells 0.0820843 NK cells
## 11 0.2535745:0.258933:0.328738:... T cells, CD4+ 0.0569244 T cells, CD4+
## 12 0.0713926:0.223101:0.117047:... Monocytes 0.1060540 NA
```
This approach assumes that each cluster in the test dataset corresponds to exactly one reference label.
If a cluster actually contains a mixture of multiple labels, this will not be reflected in its lone assigned label.
(We note that it would be very difficult to determine the composition of the mixture from the `SingleR()` scores.)
Indeed, there is no guarantee that the clustering is driven by the same factors that distinguish the reference labels,
decreasing the reliability of the annotations when novel heterogeneity is present in the test dataset.
The default per-cell strategy is safer and provides more information about the ambiguity of the annotations,
which is important for closely related labels where a close correspondence between clusters and labels cannot be expected.
## Session information {-}