---
output:
html_document
bibliography: ref.bib
---
# (PART) Differential abundance {-}
# Changes in cluster abundance {#differential-abundance}
## Overview
In a DA analysis, we test for significant changes in per-label cell abundance across conditions.
This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type.
The DA analysis has a long history in flow cytometry [@finak2014opencyto;@lun2017testing] where it is routinely used to examine the effects of different conditions on the composition of complex cell populations.
By performing it here, we effectively treat scRNA-seq as a "super-FACS" technology for defining relevant subpopulations using the entire transcriptome.
We prepare for the DA analysis by quantifying the number of cells assigned to each label (or cluster) in our WT chimeric experiment [@pijuansala2019single].
In this case, we are aiming to identify labels that change in abundance among the compartment of injected cells compared to the background.
```r
abundances <- table(merged$celltype.mapped, merged$sample)
abundances <- unclass(abundances)
head(abundances)
```
```
##
## 5 6 7 8 9 10
## Allantois 97 15 139 127 318 259
## Blood progenitors 1 6 3 16 6 8 17
## Blood progenitors 2 31 8 28 21 43 114
## Cardiomyocytes 85 21 79 31 174 211
## Caudal Mesoderm 10 10 9 3 10 29
## Caudal epiblast 2 2 0 0 22 45
```
## Performing the DA analysis
Our DA analysis will again be performed with the *[edgeR](https://bioconductor.org/packages/3.19/edgeR)* package.
This allows us to take advantage of the NB GLM methods to model overdispersed count data in the presence of limited replication -
except that the counts are not of reads per gene, but of cells per label [@lun2017testing].
The aim is to share information across labels to improve our estimates of the biological variability in cell abundance between replicates.
```r
library(edgeR)
# Attaching some column metadata.
extra.info <- colData(merged)[match(colnames(abundances), merged$sample),]
y.ab <- DGEList(abundances, samples=extra.info)
y.ab
```
```
## An object of class "DGEList"
## $counts
##
## 5 6 7 8 9 10
## Allantois 97 15 139 127 318 259
## Blood progenitors 1 6 3 16 6 8 17
## Blood progenitors 2 31 8 28 21 43 114
## Cardiomyocytes 85 21 79 31 174 211
## Caudal Mesoderm 10 10 9 3 10 29
## 29 more rows ...
##
## $samples
## group lib.size norm.factors batch cell barcode sample stage
## 5 1 2298 1 5 cell_9769 AAACCTGAGACTGTAA 5 E8.5
## 6 1 1026 1 6 cell_12180 AAACCTGCAGATGGCA 6 E8.5
## 7 1 2740 1 7 cell_13227 AAACCTGAGACAAGCC 7 E8.5
## 8 1 2904 1 8 cell_16234 AAACCTGCAAACCCAT 8 E8.5
## 9 1 4057 1 9 cell_19332 AAACCTGCAACGATCT 9 E8.5
## 10 1 6401 1 10 cell_23875 AAACCTGAGGCATGTG 10 E8.5
## tomato pool stage.mapped celltype.mapped closest.cell
## 5 TRUE 3 E8.25 Mesenchyme cell_24159
## 6 FALSE 3 E8.25 Somitic mesoderm cell_63247
## 7 TRUE 4 E8.5 Somitic mesoderm cell_25454
## 8 FALSE 4 E8.25 ExE mesoderm cell_139075
## 9 TRUE 5 E8.0 ExE mesoderm cell_116116
## 10 FALSE 5 E8.5 Forebrain/Midbrain/Hindbrain cell_39343
## doub.density sizeFactor label
## 5 0.029850 1.6349 1
## 6 0.291916 2.5981 12
## 7 0.601740 1.5939 11
## 8 0.004733 0.8707 17
## 9 0.079415 0.8933 3
## 10 0.040747 0.3947 7
```
We filter out low-abundance labels as previously described.
This avoids cluttering the result table with very rare subpopulations that contain only a handful of cells.
For a DA analysis of cluster abundances, filtering is generally not required as most clusters will not be of low-abundance (otherwise there would not have been enough evidence to define the cluster in the first place).
```r
keep <- filterByExpr(y.ab, group=y.ab$samples$tomato)
y.ab <- y.ab[keep,]
summary(keep)
```
```
## Mode FALSE TRUE
## logical 10 24
```
Unlike DE analyses, we do not perform an additional normalization step with `calcNormFactors()`.
This means that we are only normalizing based on the "library size", i.e., the total number of cells in each sample.
Any changes we detect between conditions will subsequently represent differences in the proportion of cells in each cluster.
The motivation behind this decision is discussed in more detail in Section \@ref(composition-effects).
We formulate the design matrix with a blocking factor for the batch of origin for each sample and an additive term for the td-Tomato status (i.e., injection effect).
Here, the log-fold change in our model refers to the change in cell abundance after injection, rather than the change in gene expression.
```r
design <- model.matrix(~factor(pool) + factor(tomato), y.ab$samples)
```
We use the `estimateDisp()` function to estimate the NB dispersion for each cluster (Figure \@ref(fig:abplotbcv)).
We turn off the trend as we do not have enough points for its stable estimation.
```r
y.ab <- estimateDisp(y.ab, design, trend="none")
summary(y.ab$common.dispersion)
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0614 0.0614 0.0614 0.0614 0.0614 0.0614
```
```r
plotBCV(y.ab, cex=1)
```
(\#fig:abplotbcv)Biological coefficient of variation (BCV) for each label with respect to its average abundance. BCVs are defined as the square root of the NB dispersion. Common dispersion estimates are shown in red.
We repeat this process with the QL dispersion, again disabling the trend (Figure \@ref(fig:abplotql)).
```r
fit.ab <- glmQLFit(y.ab, design, robust=TRUE, abundance.trend=FALSE)
summary(fit.ab$var.prior)
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.3 4.3 4.3 4.3 4.3 4.3
```
```r
summary(fit.ab$df.prior)
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Inf Inf Inf Inf Inf Inf
```
```r
plotQLDisp(fit.ab, cex=1)
```
(\#fig:abplotql)QL dispersion estimates for each label with respect to its average abundance. Quarter-root values of the raw estimates are shown in black while the shrunken estimates are shown in red. Shrinkage is performed towards the common dispersion in blue.
We test for differences in abundance between td-Tomato-positive and negative samples using `glmQLFTest()`.
We see that extra-embryonic ectoderm is strongly depleted in the injected cells.
This is consistent with the expectation that cells injected into the blastocyst should not contribute to extra-embryonic tissue.
The injected cells also contribute more to the mesenchyme, which may also be of interest.
```r
res <- glmQLFTest(fit.ab, coef=ncol(design))
summary(decideTests(res))
```
```
## factor(tomato)TRUE
## Down 1
## NotSig 22
## Up 1
```
```r
topTags(res)
```
```
## Coefficient: factor(tomato)TRUE
## logFC logCPM F PValue FDR
## ExE ectoderm -6.6233 13.02 35.440 2.956e-07 7.094e-06
## Mesenchyme 1.1767 16.29 18.286 8.994e-05 1.079e-03
## Erythroid3 -0.6352 17.28 6.779 1.224e-02 9.792e-02
## Allantois 0.7617 15.51 6.177 1.648e-02 9.887e-02
## Neural crest -0.8109 14.76 5.157 2.768e-02 1.329e-01
## Cardiomyocytes 0.7306 14.86 4.427 4.065e-02 1.626e-01
## Endothelium 0.7572 14.29 3.589 6.419e-02 2.201e-01
## Haematoendothelial progenitors 0.6129 14.72 3.032 8.803e-02 2.641e-01
## Pharyngeal mesoderm 0.3365 15.72 1.324 2.555e-01 5.718e-01
## Forebrain/Midbrain/Hindbrain -0.2959 16.55 1.296 2.606e-01 5.718e-01
```
## Handling composition effects {#composition-effects}
### Background
As mentioned above, we do not use `calcNormFactors()` in our default DA analysis.
This normalization step assumes that most of the input features are not different between conditions.
While this assumption is reasonable for most types of gene expression data, it is generally too strong for cell type abundance - most experiments consist of only a few cell types that may all change in abundance upon perturbation.
Thus, our default approach is to only normalize based on the total number of cells in each sample, which means that we are effectively testing for differential proportions between conditions.
Unfortunately, the use of the total number of cells leaves us susceptible to composition effects.
For example, a large increase in abundance for one cell subpopulation will introduce decreases in proportion for all other subpopulations - which is technically correct, but may be misleading if one concludes that those other subpopulations are decreasing in abundance of their own volition.
If composition biases are proving problematic for interpretation of DA results, we have several avenues for removing them or mitigating their impact by leveraging _a priori_ biological knowledge.
### Assuming most labels do not change
If it is possible to assume that most labels (i.e., cell types) do not change in abundance, we can use `calcNormFactors()` to compute normalization factors.
This seems to be a fairly reasonable assumption for the WT chimeras where the injection is expected to have only a modest effect at most.
```r
y.ab2 <- calcNormFactors(y.ab)
y.ab2$samples$norm.factors
```
```
## [1] 1.0055 1.0833 1.1658 0.7614 1.0616 0.9743
```
We then proceed with the remainder of the *[edgeR](https://bioconductor.org/packages/3.19/edgeR)* analysis, shown below in condensed format.
Many of the positive log-fold changes are shifted towards zero, consistent with the removal of composition biases from the presence of extra-embryonic ectoderm in only background cells.
In particular, the mesenchyme is no longer significantly DA after injection.
```r
y.ab2 <- estimateDisp(y.ab2, design, trend="none")
# We use `legacy=TRUE` to ensure consistency with previous versions of OSCA.
fit.ab2 <- glmQLFit(y.ab2, design, robust=TRUE, abundance.trend=FALSE, legacy=TRUE)
res2 <- glmQLFTest(fit.ab2, coef=ncol(design))
topTags(res2, n=10)
```
```
## Coefficient: factor(tomato)TRUE
## logFC logCPM F PValue FDR
## ExE ectoderm -6.9215 13.17 70.364 5.738e-11 1.377e-09
## Mesenchyme 0.9513 16.27 6.787 1.219e-02 1.143e-01
## Neural crest -1.0032 14.78 6.464 1.429e-02 1.143e-01
## Erythroid3 -0.8504 17.35 5.517 2.299e-02 1.380e-01
## Cardiomyocytes 0.6400 14.84 2.735 1.047e-01 4.809e-01
## Allantois 0.6054 15.51 2.503 1.202e-01 4.809e-01
## Forebrain/Midbrain/Hindbrain -0.4943 16.55 1.928 1.713e-01 5.178e-01
## Endothelium 0.5482 14.27 1.917 1.726e-01 5.178e-01
## Erythroid2 -0.4818 16.00 1.677 2.015e-01 5.373e-01
## Haematoendothelial progenitors 0.4262 14.73 1.185 2.818e-01 6.240e-01
```
### Removing the offending labels
Another approach is to repeat the analysis after removing DA clusters containing many cells.
This provides a clearer picture of the changes in abundance among the remaining clusters.
Here, we remove the extra-embryonic ectoderm and reset the total number of cells for all samples with `keep.lib.sizes=FALSE`.
```r
offenders <- "ExE ectoderm"
y.ab3 <- y.ab[setdiff(rownames(y.ab), offenders),, keep.lib.sizes=FALSE]
y.ab3$samples
```
```
## group lib.size norm.factors batch cell barcode sample stage
## 5 1 2268 1 5 cell_9769 AAACCTGAGACTGTAA 5 E8.5
## 6 1 993 1 6 cell_12180 AAACCTGCAGATGGCA 6 E8.5
## 7 1 2708 1 7 cell_13227 AAACCTGAGACAAGCC 7 E8.5
## 8 1 2749 1 8 cell_16234 AAACCTGCAAACCCAT 8 E8.5
## 9 1 4009 1 9 cell_19332 AAACCTGCAACGATCT 9 E8.5
## 10 1 6224 1 10 cell_23875 AAACCTGAGGCATGTG 10 E8.5
## tomato pool stage.mapped celltype.mapped closest.cell
## 5 TRUE 3 E8.25 Mesenchyme cell_24159
## 6 FALSE 3 E8.25 Somitic mesoderm cell_63247
## 7 TRUE 4 E8.5 Somitic mesoderm cell_25454
## 8 FALSE 4 E8.25 ExE mesoderm cell_139075
## 9 TRUE 5 E8.0 ExE mesoderm cell_116116
## 10 FALSE 5 E8.5 Forebrain/Midbrain/Hindbrain cell_39343
## doub.density sizeFactor label
## 5 0.029850 1.6349 1
## 6 0.291916 2.5981 12
## 7 0.601740 1.5939 11
## 8 0.004733 0.8707 17
## 9 0.079415 0.8933 3
## 10 0.040747 0.3947 7
```
```r
y.ab3 <- estimateDisp(y.ab3, design, trend="none")
fit.ab3 <- glmQLFit(y.ab3, design, robust=TRUE, abundance.trend=FALSE)
res3 <- glmQLFTest(fit.ab3, coef=ncol(design))
topTags(res3, n=10)
```
```
## Coefficient: factor(tomato)TRUE
## logFC logCPM F PValue FDR
## Mesenchyme 1.1381 16.32 17.499 0.0001278 0.00294
## Erythroid3 -0.6744 17.32 7.663 0.0080961 0.09311
## Neural crest -0.8427 14.80 5.840 0.0196905 0.11654
## Allantois 0.7255 15.54 5.783 0.0202671 0.11654
## Cardiomyocytes 0.7047 14.90 4.310 0.0435138 0.20016
## Endothelium 0.7220 14.32 3.482 0.0684233 0.26229
## Haematoendothelial progenitors 0.5758 14.76 2.815 0.1001708 0.32913
## Forebrain/Midbrain/Hindbrain -0.3313 16.59 1.654 0.2048299 0.58889
## Erythroid2 -0.3113 15.96 1.149 0.2893825 0.63508
## Pharyngeal mesoderm 0.3016 15.76 1.095 0.3007484 0.63508
```
A similar strategy can be used to focus on proportional changes within a single subpopulation of a very heterogeneous data set.
For example, if we collected a whole blood data set, we could subset to T cells and test for changes in T cell subtypes (memory, killer, regulatory, etc.) using the total number of T cells in each sample as the library size.
This avoids detecting changes in T cell subsets that are driven by compositional effects from changes in abundance of, say, B cells in the same sample.
### Testing against a log-fold change threshold
Here, we assume that composition bias introduces a spurious log~2~-fold change of no more than $\tau$ for a non-DA label.
This can be roughly interpreted as the maximum log-fold change in the total number of cells caused by DA in other labels.
(By comparison, fold-differences in the totals due to differences in capture efficiency or the size of the original cell population are not attributable to composition bias and should not be considered when choosing $\tau$.)
We then mitigate the effect of composition biases by testing each label for changes in abundance beyond $\tau$ [@mccarthy2009treat;@lun2017testing].
```r
res.lfc <- glmTreat(fit.ab, coef=ncol(design), lfc=1)
summary(decideTests(res.lfc))
```
```
## factor(tomato)TRUE
## Down 1
## NotSig 23
## Up 0
```
```r
topTags(res.lfc)
```
```
## Coefficient: factor(tomato)TRUE
## logFC unshrunk.logFC logCPM PValue
## ExE ectoderm -6.6233 -7.0539 13.02 3.158e-06
## Mesenchyme 1.1767 1.1774 16.29 1.038e-01
## Neural crest -0.8109 -0.8121 14.76 4.216e-01
## Endothelium 0.7572 0.7590 14.29 4.478e-01
## Cardiomyocytes 0.7306 0.7316 14.86 5.080e-01
## Def. endoderm 0.6113 0.6165 12.40 5.095e-01
## Allantois 0.7617 0.7624 15.51 5.102e-01
## Caudal Mesoderm -0.4881 -0.4935 12.09 6.106e-01
## Haematoendothelial progenitors 0.6129 0.6138 14.72 6.226e-01
## Erythroid3 -0.6352 -0.6353 17.28 7.500e-01
## FDR
## ExE ectoderm 0.0000758
## Mesenchyme 0.9733903
## Neural crest 0.9733903
## Endothelium 0.9733903
## Cardiomyocytes 0.9733903
## Def. endoderm 0.9733903
## Allantois 0.9733903
## Caudal Mesoderm 0.9733903
## Haematoendothelial progenitors 0.9733903
## Erythroid3 0.9733903
```
The choice of $\tau$ can be loosely motivated by external experimental data.
For example, if we observe a doubling of cell numbers in an _in vitro_ system after treatment, we might be inclined to set $\tau=1$.
This ensures that any non-DA subpopulation is not reported as being depleted after treatment.
Some caution is still required, though - even if the external numbers are accurate, we need to assume that cell capture efficiency is (on average) equal between conditions to justify their use as $\tau$.
And obviously, the use of a non-zero $\tau$ will reduce power to detect real changes when the composition bias is not present.
## Comments on interpretation
### DE or DA? Two sides of the same coin {#de-da-duality}
While useful, the distinction between DA and DE analyses is inherently artificial for scRNA-seq data.
This is because the labels used in the former are defined based on the genes to be tested in the latter.
To illustrate, consider a scRNA-seq experiment involving two biological conditions with several shared cell types.
We focus on a cell type $X$ that is present in both conditions but contains some DEGs between conditions.
This leads to two possible outcomes:
1. The DE between conditions causes $X$ to form two separate clusters (say, $X_1$ and $X_2$) in expression space.
This manifests as DA where $X_1$ is enriched in one condition and $X_2$ is enriched in the other condition.
2. The DE between conditions is not sufficient to split $X$ into two separate clusters,
e.g., because the data integration procedure identifies them as corresponding cell types and merges them together.
This means that the differences between conditions manifest as DE within the single cluster corresponding to $X$.
We have described the example above in terms of clustering, but the same arguments apply for any labelling strategy based on the expression profiles, e.g., automated cell type assignment ([Basic Chapter 7](http://bioconductor.org/books/3.19/OSCA.basic/cell-type-annotation.html#cell-type-annotation)).
Moreover, the choice between outcomes 1 and 2 is made implicitly by the combined effect of the data merging, clustering and label assignment procedures.
For example, differences between conditions are more likely to manifest as DE for coarser clusters and as DA for finer clusters, but this is difficult to predict reliably.
The moral of the story is that DA and DE analyses are simply two different perspectives on the same phenomena.
For any comprehensive characterization of differences between populations, it is usually necessary to consider both analyses.
Indeed, they complement each other almost by definition, e.g., clustering parameters that reduce DE will increase DA and vice versa.
### Sacrificing biology by integration {#sacrificing-differences}
Earlier in this chapter, we defined clusters from corrected values after applying `fastMNN()` to cells from all samples in the chimera dataset.
Alert readers may realize that this would result in the removal of biological differences between our conditions.
Any systematic difference in expression caused by injection would be treated as a batch effect and lost when cells from different samples are aligned to the same coordinate space.
Now, one may not consider injection to be an interesting biological effect, but the same reasoning applies for other conditions, e.g., integration of wild-type and knock-out samples (Section \@ref(ambient-problems)) would result in the loss of any knock-out effect in the corrected values.
This loss is both expected and desirable.
As we mentioned in Section \@ref(using-corrected-values), the main motivation for performing batch correction is to enable us to characterize population heterogeneity in a consistent manner across samples.
This remains true in situations with multiple conditions where we would like one set of clusters and annotations that can be used as common labels for the DE or DA analyses described above.
The alternative would be to cluster each condition separately and to attempt to identify matching clusters across conditions - not straightforward for poorly separated clusters in contexts like differentiation.
It may seem distressing to some that a (potentially very interesting) biological difference between conditions is lost during correction.
However, this concern is largely misplaced as the correction is only ever used for defining common clusters and annotations.
The DE analysis itself is performed on pseudo-bulk samples created from the uncorrected counts, preserving the biological difference and ensuring that it manifests in the list of DE genes for affected cell types.
Of course, if the DE is strong enough, it may result in a new condition-specific cluster that would be captured by a DA analysis as discussed in Section \@ref(de-da-duality).
One final consideration is the interaction of condition-specific expression with the assumptions of each batch correction method.
For example, MNN correction assumes that the differences between samples are orthogonal to the variation within samples.
Arguably, this assumption is becomes more questionable if the between-sample differences are biological in nature, e.g., a treatment effect that makes one cell type seem more transcriptionally similar to another may cause the wrong clusters to be aligned across conditions.
As usual, users will benefit from the diagnostics described in Chapter \@ref(integrating-datasets) and a healthy dose of skepticism.
## Session Info {-}