---
output:
html_document
bibliography: ref.bib
---
# Problems with ambient RNA {#ambient-problems}
## Background
Ambient contamination is a phenomenon that is generally most pronounced in massively multiplexed scRNA-seq protocols.
Briefly, extracellular RNA (most commonly released upon cell lysis) is captured along with each cell in its reaction chamber, contributing counts to genes that are not otherwise expressed in that cell (see [Advanced Section 7.2](http://bioconductor.org/books/3.19/OSCA.advanced/droplet-processing.html#qc-droplets)).
Differences in the ambient profile across samples are not uncommon when dealing with strong experimental perturbations where strong expression of a gene in a condition-specific cell type can "bleed over" into all other cell types in the same sample.
This is problematic for DE analyses between conditions, as DEGs detected for a particular cell type may be driven by differences in the ambient profiles rather than any intrinsic change in gene regulation.
To illustrate, we consider the _Tal1_-knockout (KO) chimera data from @pijuansala2019single.
This is very similar to the WT chimera dataset we previously examined, only differing in that the _Tal1_ gene was knocked out in the injected cells.
_Tal1_ is a transcription factor that has known roles in erythroid differentiation; the aim of the experiment was to determine if blocking of the erythroid lineage diverted cells to other developmental fates.
(To cut a long story short: yes, it did.)
```r
library(MouseGastrulationData)
sce.tal1 <- Tal1ChimeraData()
counts(sce.tal1) <- as(counts(sce.tal1), "CsparseMatrix")
library(scuttle)
rownames(sce.tal1) <- uniquifyFeatureNames(
rowData(sce.tal1)$ENSEMBL,
rowData(sce.tal1)$SYMBOL
)
sce.tal1
```
```
## class: SingleCellExperiment
## dim: 29453 56122
## metadata(0):
## assays(1): counts
## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(56122): cell_1 cell_2 ... cell_56121 cell_56122
## colData names(9): cell barcode ... pool sizeFactor
## reducedDimNames(1): pca.corrected
## mainExpName: NULL
## altExpNames(0):
```
We will perform a DE analysis between WT and KO cells labelled as "neural crest".
We observe that the strongest DEGs are the hemoglobins, which are downregulated in the injected cells.
This is rather surprising as these cells are distinct from the erythroid lineage and should not express hemoglobins at all.
The most sober explanation is that the background samples contain more hemoglobin transcripts in the ambient solution due to leakage from erythrocytes (or their precursors) during sorting and dissociation.
```r
library(scran)
summed.tal1 <- aggregateAcrossCells(sce.tal1,
ids=DataFrame(sample=sce.tal1$sample,
label=sce.tal1$celltype.mapped)
)
summed.tal1$block <- summed.tal1$sample %% 2 == 0 # Add blocking factor.
# Subset to our neural crest cells.
summed.neural <- summed.tal1[,summed.tal1$label=="Neural crest"]
summed.neural
```
```
## class: SingleCellExperiment
## dim: 29453 4
## metadata(0):
## assays(1): counts
## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames: NULL
## colData names(13): cell barcode ... ncells block
## reducedDimNames(1): pca.corrected
## mainExpName: NULL
## altExpNames(0):
```
```r
# Standard edgeR analysis, as described in previous chapters.
res.neural <- pseudoBulkDGE(summed.neural,
label=summed.neural$label,
design=~factor(block) + tomato,
coef="tomatoTRUE",
condition=summed.neural$tomato)
summarizeTestsPerLabel(decideTestsPerLabel(res.neural))
```
```
## -1 0 1 NA
## Neural crest 351 9820 479 18803
```
```r
# Summary of the direction of log-fold changes.
tab.neural <- res.neural[[1]]
tab.neural <- tab.neural[order(tab.neural$PValue),]
head(tab.neural, 10)
```
```
## DataFrame with 10 rows and 5 columns
## logFC logCPM F PValue FDR
##
## Xist -7.555614 8.21232 6439.909 0.00000e+00 0.00000e+00
## Hbb-bh1 -8.091096 9.15972 10479.986 0.00000e+00 0.00000e+00
## Hbb-y -8.415613 8.35705 7129.226 0.00000e+00 0.00000e+00
## Hba-x -7.724812 8.53284 7658.595 0.00000e+00 0.00000e+00
## Hba-a1 -8.596642 6.74429 2748.738 0.00000e+00 0.00000e+00
## Hba-a2 -8.866213 5.81300 1468.407 4.31982e-301 7.66768e-298
## Erdr1 1.892460 7.61593 1424.560 1.05681e-292 1.60787e-289
## Cdkn1c -8.864554 4.96097 799.181 1.34505e-169 1.79060e-166
## Uba52 -0.879675 8.38618 418.189 3.36978e-91 3.98758e-88
## Grb10 -1.402937 6.58314 409.715 2.00893e-89 2.13951e-86
```
As an aside, it is worth mentioning that the "replicates" in this study are more technical than biological,
so some exaggeration of the significance of the effects is to be expected.
Nonetheless, it is a useful dataset to demonstrate some strategies for mitigating issues caused by ambient contamination.
## Filtering out affected DEGs
### By estimating ambient contamination
As shown above, the presence of ambient contamination makes it difficult to interpret multi-condition DE analyses.
To mitigate its effects, we need to obtain an estimate of the ambient "expression" profile from the raw count matrix for each sample.
We follow the approach used in `emptyDrops()` [@lun2018distinguishing] and consider all barcodes with total counts below 100 to represent empty droplets.
We then sum the counts for each gene across these barcodes to obtain an expression vector representing the ambient profile for each sample.
```r
library(DropletUtils)
ambient <- vector("list", ncol(summed.neural))
# Looping over all raw (unfiltered) count matrices and
# computing the ambient profile based on its low-count barcodes.
# Turning off rounding, as we know this is count data.
for (s in seq_along(ambient)) {
raw.tal1 <- Tal1ChimeraData(type="raw", samples=s)[[1]]
counts(raw.tal1) <- as(counts(raw.tal1), "CsparseMatrix")
ambient[[s]] <- ambientProfileEmpty(counts(raw.tal1),
good.turing=FALSE, round=FALSE)
}
# Cleaning up the output for pretty printing.
ambient <- do.call(cbind, ambient)
colnames(ambient) <- seq_len(ncol(ambient))
rownames(ambient) <- uniquifyFeatureNames(
rowData(raw.tal1)$ENSEMBL,
rowData(raw.tal1)$SYMBOL
)
head(ambient)
```
```
## 1 2 3 4
## Xkr4 1 0 0 0
## Gm1992 0 0 0 0
## Gm37381 1 0 1 0
## Rp1 0 1 0 1
## Sox17 76 76 31 53
## Gm37323 0 0 0 0
```
For each sample, we determine the maximum proportion of the count for each gene that could be attributed to ambient contamination.
This is done by scaling the ambient profile in `ambient` to obtain a per-gene expected count from ambient contamination, with which we compute the $p$-value for observing a count equal to or lower than that in `summed.neural`.
We perform this for a range of scaling factors and identify the largest factor that yields a $p$-value above a given threshold.
The scaled ambient profile represents the upper bound of the contribution to each sample from ambient contamination.
We deliberately use an upper bound so that our next step will aggressively remove any gene that is potentially problematic.
```r
max.ambient <- ambientContribMaximum(counts(summed.neural),
ambient, mode="proportion")
head(max.ambient)
```
```
## [,1] [,2] [,3] [,4]
## Xkr4 NaN NaN NaN NaN
## Gm1992 NaN NaN NaN NaN
## Gm37381 NaN NaN NaN NaN
## Rp1 NaN NaN NaN NaN
## Sox17 0.1775 0.1833 0.468 1
## Gm37323 NaN NaN NaN NaN
```
Genes in which over 10% of the counts are ambient-derived are subsequently discarded from our analysis.
For balanced designs, this threshold prevents ambient contribution from biasing the true fold-change by more than 10%, which is a tolerable margin of error for most applications.
(Unbalanced designs may warrant the use of a weighted average to account for sample size differences between groups.)
This approach yields a slightly smaller list of DEGs without the hemoglobins, which is encouraging as it suggests that any other, less obvious effects of ambient contamination have also been removed.
```r
# Averaging the ambient contribution across samples.
contamination <- rowMeans(max.ambient, na.rm=TRUE)
non.ambient <- contamination <= 0.1
summary(non.ambient)
```
```
## Mode FALSE TRUE NA's
## logical 1475 15306 12672
```
```r
okay.genes <- names(non.ambient)[which(non.ambient)]
tab.neural2 <- tab.neural[rownames(tab.neural) %in% okay.genes,]
table(Direction=tab.neural2$logFC > 0, Significant=tab.neural2$FDR <= 0.05)
```
```
## Significant
## Direction FALSE TRUE
## FALSE 4818 317
## TRUE 4785 450
```
```r
head(tab.neural2, 10)
```
```
## DataFrame with 10 rows and 5 columns
## logFC logCPM F PValue FDR
##
## Xist -7.555614 8.21232 6439.909 0.00000e+00 0.00000e+00
## Erdr1 1.892460 7.61593 1424.560 1.05681e-292 1.60787e-289
## Uba52 -0.879675 8.38618 418.189 3.36978e-91 3.98758e-88
## Grb10 -1.402937 6.58314 409.715 2.00893e-89 2.13951e-86
## Fdps 0.981454 7.21805 347.794 2.09662e-76 2.02991e-73
## Gt(ROSA)26Sor 1.481292 5.71617 340.040 9.06070e-75 8.04137e-72
## Mest 0.549361 10.98269 317.909 4.29608e-70 3.51948e-67
## Mcts2 1.137657 6.42689 309.522 2.55733e-68 1.94540e-65
## Impact 1.396653 5.71801 304.143 3.52176e-67 2.50045e-64
## H13 -1.481673 5.90902 296.431 1.51643e-65 1.00937e-62
```
A softer approach is to simply report the average contaminating percentage for each gene in the table of DE statistics.
Readers can then make up their own minds as to whether a particular DEG's effect is driven by ambient contamination.
Indeed, it is worth remembering that `maximumAmbience()` will report the maximum possible contamination rather than attempting to estimate the actual level of contamination, and filtering on the former may be too conservative.
This is especially true for cell populations that are contributing to the differences in the ambient pool; in the most extreme case, the reported maximum contamination would be 100% for cell types with an expression profile that is identical to the ambient pool.
```r
tab.neural3 <- tab.neural
tab.neural3$contamination <- contamination[rownames(tab.neural3)]
head(tab.neural3)
```
```
## DataFrame with 6 rows and 6 columns
## logFC logCPM F PValue FDR contamination
##
## Xist -7.55561 8.21232 6439.91 0.00000e+00 0.00000e+00 0.0605735
## Hbb-bh1 -8.09110 9.15972 10479.99 0.00000e+00 0.00000e+00 0.9900717
## Hbb-y -8.41561 8.35705 7129.23 0.00000e+00 0.00000e+00 0.9674483
## Hba-x -7.72481 8.53284 7658.59 0.00000e+00 0.00000e+00 0.9945348
## Hba-a1 -8.59664 6.74429 2748.74 0.00000e+00 0.00000e+00 0.8626846
## Hba-a2 -8.86621 5.81300 1468.41 4.31982e-301 7.66768e-298 0.7351403
```
### With prior knowledge
Another strategy to estimating the ambient proportions involves the use of prior knowledge of mutually exclusive gene expression profiles [@young2018soupx].
In this case, we assume (reasonably) that hemoglobins should not be expressed in neural crest cells and use this to estimate the contamination in each sample.
This is achieved with the `controlAmbience()` function, which scales the ambient profile so that the hemoglobin coverage is the same as the corresponding sample of `summed.neural`.
From these profiles, we compute proportions of ambient contamination that are used to mark or filter out affected genes in the same manner as described above.
```r
is.hbb <- grep("^Hb[ab]-", rownames(summed.neural))
ctrl.ambient <- ambientContribNegative(counts(summed.neural), ambient,
features=is.hbb, mode="proportion")
head(ctrl.ambient)
```
```
## [,1] [,2] [,3] [,4]
## Xkr4 NaN NaN NaN NaN
## Gm1992 NaN NaN NaN NaN
## Gm37381 NaN NaN NaN NaN
## Rp1 NaN NaN NaN NaN
## Sox17 0.06774 0.08798 0.4796 1
## Gm37323 NaN NaN NaN NaN
```
```r
ctrl.non.ambient <- rowMeans(ctrl.ambient, na.rm=TRUE) <= 0.1
summary(ctrl.non.ambient)
```
```
## Mode FALSE TRUE NA's
## logical 1388 15393 12672
```
```r
okay.genes <- names(ctrl.non.ambient)[which(ctrl.non.ambient)]
tab.neural4 <- tab.neural[rownames(tab.neural) %in% okay.genes,]
head(tab.neural4)
```
```
## DataFrame with 6 rows and 5 columns
## logFC logCPM F PValue FDR
##
## Xist -7.555614 8.21232 6439.909 0.00000e+00 0.00000e+00
## Erdr1 1.892460 7.61593 1424.560 1.05681e-292 1.60787e-289
## Uba52 -0.879675 8.38618 418.189 3.36978e-91 3.98758e-88
## Grb10 -1.402937 6.58314 409.715 2.00893e-89 2.13951e-86
## Fdps 0.981454 7.21805 347.794 2.09662e-76 2.02991e-73
## Gt(ROSA)26Sor 1.481292 5.71617 340.040 9.06070e-75 8.04137e-72
```
Any highly expressed cell type-specific gene is a candidate for this procedure,
most typically in cell types that are highly specialized towards manufacturing a protein product.
Aside from hemoglobin, we could use immunoglobulins in populations containing B cells,
or insulin and glucagon in pancreas datasets ([Advanced Figure 6.3](http://bioconductor.org/books/3.19/OSCA.advanced/marker-detection-redux.html#fig:viol-gcg-lawlor)).
The experimental setting may also provide some genes that must only be present in the ambient solution;
for example, the mitochondrial transcripts can be used to estimate ambient contamination in single-nucleus RNA-seq,
while _Xist_ can be used for datasets involving mixtures of male and female cells
(where the contaminating percentages are estimated from the profiles of male cells only).
If appropriate control features are available, this approach allows us to obtain a more accurate estimate of the contamination in each pseudo-bulk sample compared to the upper bound provided by `maximumAmbience()`.
This avoids the removal of genuine DEGs due to overestimation fo the ambient contamination from the latter.
However, the performance of this approach is fully dependent on the suitability of the control features - if a "control" feature is actually genuinely expressed in a cell type, the ambient contribution will be overestimated.
A simple mitigating strategy is to simply take the lower of the proportions from `controlAmbience()` and `maximumAmbience()`, with the idea being that the latter will avoid egregious overestimation when the control set is misspecified.
### Without an ambient profile
An estimate of the ambient profile is rarely available for public datasets where only the per-cell count matrices are provided.
In such cases, we must instead use the rest of the dataset to infer something about the effects of ambient contamination.
The most obvious approach is construct a proxy ambient profile by summing the counts for all cells from each sample, which can be used in place of the actual profile in the previous calculations.
```r
proxy.ambient <- aggregateAcrossCells(summed.tal1,
ids=summed.tal1$sample)
# Using 'proxy.ambient' instead of the estimaed 'ambient'.
max.ambient.proxy <- ambientContribMaximum(counts(summed.neural),
counts(proxy.ambient), mode="proportion")
head(max.ambient.proxy)
```
```
## [,1] [,2] [,3] [,4]
## Xkr4 NaN NaN NaN NaN
## Gm1992 NaN NaN NaN NaN
## Gm37381 NaN NaN NaN NaN
## Rp1 NaN NaN NaN NaN
## Sox17 0.7427 0.9891 0.5283 0.9067
## Gm37323 NaN NaN NaN NaN
```
```r
con.ambient.proxy <- ambientContribNegative(counts(summed.neural),
counts(proxy.ambient), features=is.hbb, mode="proportion")
head(con.ambient.proxy)
```
```
## [,1] [,2] [,3] [,4]
## Xkr4 NaN NaN NaN NaN
## Gm1992 NaN NaN NaN NaN
## Gm37381 NaN NaN NaN NaN
## Rp1 NaN NaN NaN NaN
## Sox17 1 1 0.6032 1
## Gm37323 NaN NaN NaN NaN
```
This assumes equal contributions from all labels to the ambient pool, which is not entirely unrealistic (Figure \@ref(fig:proxy-ambience)) though some discrepancies can be expected due to the presence of particularly fragile cell types or extracellular RNA.
```r
par(mfrow=c(2,2))
for (i in seq_len(ncol(proxy.ambient))) {
true <- ambient[,i]
proxy <- assay(proxy.ambient)[,i]
logged <- edgeR::cpm(cbind(proxy, true), log=TRUE, prior.count=2)
logFC <- logged[,1] - logged[,2]
abundance <- rowMeans(logged)
plot(abundance, logFC, main=paste("Sample", i))
}
```
(\#fig:proxy-ambience)MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the _Tal1_ chimera dataset.
Alternatively, we may choose to mitigate the effect of ambient contamination by focusing on label-specific DEGs.
Contamination-driven DEGs should be systematically present in comparisons for all labels, and thus can be eliminated by simply ignoring all genes that are significant in a majority of these comparisons (Section \@ref(cross-label-meta-analyses)).
The obvious drawback of this approach is that it discounts genuine DEGs that have a consistent effect in most/all labels, though one could perhaps argue that such "global" DEGs are not particularly interesting anyway.
It is also complicated by fluctuations in detection power across comparisons involving different numbers of cells - or replicates, after filtering pseudo-bulk profiles by the number of cells.
```r
res.tal1 <- pseudoBulkSpecific(summed.tal1,
label=summed.tal1$label,
design=~factor(block) + tomato,
coef="tomatoTRUE",
condition=summed.tal1$tomato)
# Inspecting our neural crest results again.
tab.neural.again <- res.tal1[["Neural crest"]]
head(tab.neural.again[order(tab.neural.again$PValue),], 10)
```
```
## DataFrame with 10 rows and 6 columns
## logFC logCPM F PValue FDR
##
## Fdps 0.981454 7.21805 347.7937 2.09662e-76 2.23290e-72
## Msmo1 1.493754 5.43923 290.1148 5.37126e-62 2.86019e-58
## Hmgcs1 1.250254 5.70837 243.4674 2.74677e-54 9.75105e-51
## Idi1 1.173814 5.37688 174.9294 1.27276e-39 3.38874e-36
## Gt(ROSA)26Sor 1.481292 5.71617 340.0404 2.53519e-31 5.39996e-28
## Sox9 0.537660 7.17373 102.4739 5.61048e-24 9.95861e-21
## Nkd1 0.719092 5.92690 92.9524 4.34496e-20 6.61054e-17
## Insig1 1.257329 4.06887 82.7426 1.10092e-19 1.46559e-16
## Fdft1 0.841060 5.32293 87.1582 3.61959e-19 4.28318e-16
## Acat2 0.508872 6.80012 76.0489 3.17888e-18 3.38551e-15
## OtherAverage
##
## Fdps -0.0902359
## Msmo1 0.0269848
## Hmgcs1 -0.0616961
## Idi1 -0.0820029
## Gt(ROSA)26Sor 0.5408830
## Sox9 -0.0426835
## Nkd1 0.0332958
## Insig1 -0.2879254
## Fdft1 0.0336545
## Acat2 -0.0520168
```
```r
# By comparison, the hemoglobins are all the way at the bottom.
head(tab.neural.again[is.hbb,], 10)
```
```
## DataFrame with 8 rows and 6 columns
## logFC logCPM F PValue FDR OtherAverage
##
## Hbb-bt -7.76723 1.33059 57.2819 1.0000000 1.000000 -7.86818
## Hbb-bs -5.84809 3.42835 236.0230 1.0000000 1.000000 -8.15478
## Hbb-bh2 NA NA NA NA NA -8.98450
## Hbb-bh1 -8.09110 9.15972 10479.9859 0.9298044 1.000000 -8.08037
## Hbb-y -8.41561 8.35705 7129.2261 0.4498399 1.000000 -8.21261
## Hba-x -7.72481 8.53284 7658.5945 0.0953835 0.819654 -7.38506
## Hba-a1 -8.59664 6.74429 2748.7378 0.2942747 1.000000 -8.05722
## Hba-a2 -8.86621 5.81300 1468.4069 0.2377481 1.000000 -7.96511
```
The common theme here is that, in the absence of an ambient profile, we are using all labels as a proxy for the ambient effect.
This can have unpredictable consequences as the results for each label are now dependent on the behavior of the entire dataset.
For example, the metrics are susceptible to the idiosyncrasies of clustering where one cell type may be represented in multple related clusters that distort the percentages in `up.de` and `down.de` or the average log-fold change.
The metrics may also be invalidated in analyses of a subset of the data - for example, a subclustering analysis focusing on a particular cell type may mark all relevant DEGs as problematic because they are consistently DE in all subtypes.
## Subtracting ambient counts
It is worth commenting on the seductive idea of subtracting the ambient counts from the pseudo-bulk samples.
This may seem like the most obvious approach for removing ambient contamination, but unfortunately, subtracted counts have unpredictable statistical properties due the distortion of the mean-variance relationship.
Minor relative fluctuations at very large counts become large fold-changes after subtraction, manifesting as spurious DE in genes where a substantial proportion of counts is derived from the ambient solution.
For example, several hemoglobin genes retain strong DE even after subtraction of the scaled ambient profile.
```r
scaled.ambient <- controlAmbience(counts(summed.neural), ambient,
features=is.hbb, mode="profile")
subtracted <- counts(summed.neural) - scaled.ambient
subtracted <- round(subtracted)
subtracted[subtracted < 0] <- 0
subtracted[is.hbb,]
```
```
## [,1] [,2] [,3] [,4]
## Hbb-bt 0 0 7 18
## Hbb-bs 1 2 31 42
## Hbb-bh2 0 0 0 0
## Hbb-bh1 2 0 0 0
## Hbb-y 0 0 39 107
## Hba-x 1 1 0 0
## Hba-a1 0 0 365 452
## Hba-a2 0 0 314 329
```
Another tempting approach is to use interaction models to implicitly subtract the ambient effect during GLM fitting.
The assumption is that, for a genuine DEG, the log-fold change within cells is larger in magnitude than that in the ambient solution.
This is based on the expectation that any DE in the latter is "diluted" by contributions from cell types where that gene is not DE.
Unfortunately, this is not always the case; a DE analysis of the ambient counts indicates that the hemoglobin log-fold change is actually stronger in the neural crest cells compared to the ambient solution, which leads to the rather awkward conclusion that the WT neural crest cells are expressing hemoglobin beyond that explained by ambient contamination.
(This is probably an artifact of how cell calling is performed.)
```r
library(edgeR)
y.ambient <- DGEList(ambient, samples=colData(summed.neural))
y.ambient <- y.ambient[filterByExpr(y.ambient, group=y.ambient$samples$tomato),]
y.ambient <- calcNormFactors(y.ambient)
design <- model.matrix(~factor(block) + tomato, y.ambient$samples)
y.ambient <- estimateDisp(y.ambient, design)
fit.ambient <- glmQLFit(y.ambient, design, robust=TRUE)
res.ambient <- glmQLFTest(fit.ambient, coef=ncol(design))
summary(decideTests(res.ambient))
```
```
## tomatoTRUE
## Down 1882
## NotSig 7717
## Up 1639
```
```r
topTags(res.ambient, n=10)
```
```
## Coefficient: tomatoTRUE
## logFC logCPM F PValue FDR
## Hbb-y -5.267 12.803 14972 1.344e-57 1.510e-53
## Hbb-bh1 -5.075 13.725 13520 1.272e-56 7.148e-53
## Hba-x -4.827 13.122 13120 2.469e-56 9.249e-53
## Hba-a1 -4.662 10.734 10625 2.567e-54 7.213e-51
## Hba-a2 -4.521 9.480 7956 1.491e-51 3.350e-48
## Blvrb -4.319 7.649 3965 6.417e-45 1.202e-41
## Xist -4.376 7.484 3796 1.666e-44 2.675e-41
## Gypa -5.138 7.213 3754 2.122e-44 2.981e-41
## Car2 -3.499 8.534 4089 8.048e-44 1.005e-40
## Hbb-bs -4.941 7.209 3514 8.991e-44 1.010e-40
```
In addition, there are other issues with implicit subtraction in the fitted GLM that warrant caution with its use.
This strategy precludes detection of DEGs that are common to all cell types as there is no longer a dilution effect being applied to the log-fold change in the ambient solution.
It requires inclusion of the ambient profiles in the model, which is cause for at least some concern as they are unlikely to have the same degree of variability as the cell-derived pseudo-bulk profiles.
Interpretation is also complicated by the fact that we are only interested in log-fold changes that are more extreme in the cells compared to the ambient solution; a non-zero interaction term is not sufficient for removing spurious DE.
See also comments in [Advanced Section 7.3](http://bioconductor.org/books/3.19/OSCA.advanced/droplet-processing.html#removing-ambient-contamination) for more comments on the removal of ambient contamination, mostly for visualization purposes.
## Session Info {-}