Chapter 5 Problems with ambient RNA

5.1 Background

Ambient contamination is a phenomenon that is generally most pronounced in massively multiplexed scRNA-seq protocols. Briefly, extracellular RNA (most commonly released upon cell lysis) is captured along with each cell in its reaction chamber, contributing counts to genes that are not otherwise expressed in that cell (see Advanced Section 7.2). Differences in the ambient profile across samples are not uncommon when dealing with strong experimental perturbations where strong expression of a gene in a condition-specific cell type can “bleed over” into all other cell types in the same sample. This is problematic for DE analyses between conditions, as DEGs detected for a particular cell type may be driven by differences in the ambient profiles rather than any intrinsic change in gene regulation.

To illustrate, we consider the Tal1-knockout (KO) chimera data from Pijuan-Sala et al. (2019). This is very similar to the WT chimera dataset we previously examined, only differing in that the Tal1 gene was knocked out in the injected cells. Tal1 is a transcription factor that has known roles in erythroid differentiation; the aim of the experiment was to determine if blocking of the erythroid lineage diverted cells to other developmental fates. (To cut a long story short: yes, it did.)

library(MouseGastrulationData)
sce.tal1 <- Tal1ChimeraData()
counts(sce.tal1) <- as(counts(sce.tal1), "CsparseMatrix") 

library(scuttle)
rownames(sce.tal1) <- uniquifyFeatureNames(
    rowData(sce.tal1)$ENSEMBL, 
    rowData(sce.tal1)$SYMBOL
)
sce.tal1
## class: SingleCellExperiment 
## dim: 29453 56122 
## metadata(0):
## assays(1): counts
## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(56122): cell_1 cell_2 ... cell_56121 cell_56122
## colData names(9): cell barcode ... pool sizeFactor
## reducedDimNames(1): pca.corrected
## mainExpName: NULL
## altExpNames(0):

We will perform a DE analysis between WT and KO cells labelled as “neural crest”. We observe that the strongest DEGs are the hemoglobins, which are downregulated in the injected cells. This is rather surprising as these cells are distinct from the erythroid lineage and should not express hemoglobins at all. The most sober explanation is that the background samples contain more hemoglobin transcripts in the ambient solution due to leakage from erythrocytes (or their precursors) during sorting and dissociation.

library(scran)
summed.tal1 <- aggregateAcrossCells(sce.tal1, 
    ids=DataFrame(sample=sce.tal1$sample,
        label=sce.tal1$celltype.mapped)
)
summed.tal1$block <- summed.tal1$sample %% 2 == 0 # Add blocking factor.

# Subset to our neural crest cells.
summed.neural <- summed.tal1[,summed.tal1$label=="Neural crest"]
summed.neural
## class: SingleCellExperiment 
## dim: 29453 4 
## metadata(0):
## assays(1): counts
## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames: NULL
## colData names(13): cell barcode ... ncells block
## reducedDimNames(1): pca.corrected
## mainExpName: NULL
## altExpNames(0):
# Standard edgeR analysis, as described in previous chapters.
res.neural <- pseudoBulkDGE(summed.neural, 
    label=summed.neural$label,
    design=~factor(block) + tomato,
    coef="tomatoTRUE",
    condition=summed.neural$tomato)
summarizeTestsPerLabel(decideTestsPerLabel(res.neural))
##               -1    0   1    NA
## Neural crest 351 9820 479 18803
# Summary of the direction of log-fold changes.
tab.neural <- res.neural[[1]]
tab.neural <- tab.neural[order(tab.neural$PValue),]
head(tab.neural, 10)
## DataFrame with 10 rows and 5 columns
##             logFC    logCPM         F       PValue          FDR
##         <numeric> <numeric> <numeric>    <numeric>    <numeric>
## Xist    -7.555614   8.21232  6439.909  0.00000e+00  0.00000e+00
## Hbb-bh1 -8.091096   9.15972 10479.986  0.00000e+00  0.00000e+00
## Hbb-y   -8.415613   8.35705  7129.226  0.00000e+00  0.00000e+00
## Hba-x   -7.724812   8.53284  7658.595  0.00000e+00  0.00000e+00
## Hba-a1  -8.596642   6.74429  2748.738  0.00000e+00  0.00000e+00
## Hba-a2  -8.866213   5.81300  1468.407 4.31982e-301 7.66768e-298
## Erdr1    1.892460   7.61593  1424.560 1.05681e-292 1.60787e-289
## Cdkn1c  -8.864554   4.96097   799.181 1.34505e-169 1.79060e-166
## Uba52   -0.879675   8.38618   418.189  3.36978e-91  3.98758e-88
## Grb10   -1.402937   6.58314   409.715  2.00893e-89  2.13951e-86

As an aside, it is worth mentioning that the “replicates” in this study are more technical than biological, so some exaggeration of the significance of the effects is to be expected. Nonetheless, it is a useful dataset to demonstrate some strategies for mitigating issues caused by ambient contamination.

5.2 Filtering out affected DEGs

5.2.1 By estimating ambient contamination

As shown above, the presence of ambient contamination makes it difficult to interpret multi-condition DE analyses. To mitigate its effects, we need to obtain an estimate of the ambient “expression” profile from the raw count matrix for each sample. We follow the approach used in emptyDrops() (Lun et al. 2019) and consider all barcodes with total counts below 100 to represent empty droplets. We then sum the counts for each gene across these barcodes to obtain an expression vector representing the ambient profile for each sample.

library(DropletUtils)
ambient <- vector("list", ncol(summed.neural))

# Looping over all raw (unfiltered) count matrices and
# computing the ambient profile based on its low-count barcodes.
# Turning off rounding, as we know this is count data.
for (s in seq_along(ambient)) {
    raw.tal1 <- Tal1ChimeraData(type="raw", samples=s)[[1]]
    counts(raw.tal1) <- as(counts(raw.tal1), "CsparseMatrix")
    ambient[[s]] <- ambientProfileEmpty(counts(raw.tal1), 
        good.turing=FALSE, round=FALSE)
}

# Cleaning up the output for pretty printing.
ambient <- do.call(cbind, ambient)
colnames(ambient) <- seq_len(ncol(ambient))
rownames(ambient) <- uniquifyFeatureNames(
    rowData(raw.tal1)$ENSEMBL, 
    rowData(raw.tal1)$SYMBOL
)
head(ambient)
##          1  2  3  4
## Xkr4     1  0  0  0
## Gm1992   0  0  0  0
## Gm37381  1  0  1  0
## Rp1      0  1  0  1
## Sox17   76 76 31 53
## Gm37323  0  0  0  0

For each sample, we determine the maximum proportion of the count for each gene that could be attributed to ambient contamination. This is done by scaling the ambient profile in ambient to obtain a per-gene expected count from ambient contamination, with which we compute the \(p\)-value for observing a count equal to or lower than that in summed.neural. We perform this for a range of scaling factors and identify the largest factor that yields a \(p\)-value above a given threshold. The scaled ambient profile represents the upper bound of the contribution to each sample from ambient contamination. We deliberately use an upper bound so that our next step will aggressively remove any gene that is potentially problematic.

max.ambient <- ambientContribMaximum(counts(summed.neural), 
    ambient, mode="proportion")
head(max.ambient)
##           [,1]   [,2]  [,3] [,4]
## Xkr4       NaN    NaN   NaN  NaN
## Gm1992     NaN    NaN   NaN  NaN
## Gm37381    NaN    NaN   NaN  NaN
## Rp1        NaN    NaN   NaN  NaN
## Sox17   0.1775 0.1833 0.468    1
## Gm37323    NaN    NaN   NaN  NaN

Genes in which over 10% of the counts are ambient-derived are subsequently discarded from our analysis. For balanced designs, this threshold prevents ambient contribution from biasing the true fold-change by more than 10%, which is a tolerable margin of error for most applications. (Unbalanced designs may warrant the use of a weighted average to account for sample size differences between groups.) This approach yields a slightly smaller list of DEGs without the hemoglobins, which is encouraging as it suggests that any other, less obvious effects of ambient contamination have also been removed.

# Averaging the ambient contribution across samples.
contamination <- rowMeans(max.ambient, na.rm=TRUE)
non.ambient <- contamination <= 0.1
summary(non.ambient)
##    Mode   FALSE    TRUE    NA's 
## logical    1475   15306   12672
okay.genes <- names(non.ambient)[which(non.ambient)]
tab.neural2 <- tab.neural[rownames(tab.neural) %in% okay.genes,]

table(Direction=tab.neural2$logFC > 0, Significant=tab.neural2$FDR <= 0.05)
##          Significant
## Direction FALSE TRUE
##     FALSE  4818  317
##     TRUE   4785  450
head(tab.neural2, 10)
## DataFrame with 10 rows and 5 columns
##                   logFC    logCPM         F       PValue          FDR
##               <numeric> <numeric> <numeric>    <numeric>    <numeric>
## Xist          -7.555614   8.21232  6439.909  0.00000e+00  0.00000e+00
## Erdr1          1.892460   7.61593  1424.560 1.05681e-292 1.60787e-289
## Uba52         -0.879675   8.38618   418.189  3.36978e-91  3.98758e-88
## Grb10         -1.402937   6.58314   409.715  2.00893e-89  2.13951e-86
## Fdps           0.981454   7.21805   347.794  2.09662e-76  2.02991e-73
## Gt(ROSA)26Sor  1.481292   5.71617   340.040  9.06070e-75  8.04137e-72
## Mest           0.549361  10.98269   317.909  4.29608e-70  3.51948e-67
## Mcts2          1.137657   6.42689   309.522  2.55733e-68  1.94540e-65
## Impact         1.396653   5.71801   304.143  3.52176e-67  2.50045e-64
## H13           -1.481673   5.90902   296.431  1.51643e-65  1.00937e-62

A softer approach is to simply report the average contaminating percentage for each gene in the table of DE statistics. Readers can then make up their own minds as to whether a particular DEG’s effect is driven by ambient contamination. Indeed, it is worth remembering that maximumAmbience() will report the maximum possible contamination rather than attempting to estimate the actual level of contamination, and filtering on the former may be too conservative. This is especially true for cell populations that are contributing to the differences in the ambient pool; in the most extreme case, the reported maximum contamination would be 100% for cell types with an expression profile that is identical to the ambient pool.

tab.neural3 <- tab.neural
tab.neural3$contamination <- contamination[rownames(tab.neural3)]
head(tab.neural3)
## DataFrame with 6 rows and 6 columns
##             logFC    logCPM         F       PValue          FDR contamination
##         <numeric> <numeric> <numeric>    <numeric>    <numeric>     <numeric>
## Xist     -7.55561   8.21232   6439.91  0.00000e+00  0.00000e+00     0.0605735
## Hbb-bh1  -8.09110   9.15972  10479.99  0.00000e+00  0.00000e+00     0.9900717
## Hbb-y    -8.41561   8.35705   7129.23  0.00000e+00  0.00000e+00     0.9674483
## Hba-x    -7.72481   8.53284   7658.59  0.00000e+00  0.00000e+00     0.9945348
## Hba-a1   -8.59664   6.74429   2748.74  0.00000e+00  0.00000e+00     0.8626846
## Hba-a2   -8.86621   5.81300   1468.41 4.31982e-301 7.66768e-298     0.7351403

5.2.2 With prior knowledge

Another strategy to estimating the ambient proportions involves the use of prior knowledge of mutually exclusive gene expression profiles (Young and Behjati 2018). In this case, we assume (reasonably) that hemoglobins should not be expressed in neural crest cells and use this to estimate the contamination in each sample. This is achieved with the controlAmbience() function, which scales the ambient profile so that the hemoglobin coverage is the same as the corresponding sample of summed.neural. From these profiles, we compute proportions of ambient contamination that are used to mark or filter out affected genes in the same manner as described above.

is.hbb <- grep("^Hb[ab]-", rownames(summed.neural))
ctrl.ambient <- ambientContribNegative(counts(summed.neural), ambient,
    features=is.hbb,  mode="proportion")
head(ctrl.ambient)
##            [,1]    [,2]   [,3] [,4]
## Xkr4        NaN     NaN    NaN  NaN
## Gm1992      NaN     NaN    NaN  NaN
## Gm37381     NaN     NaN    NaN  NaN
## Rp1         NaN     NaN    NaN  NaN
## Sox17   0.06774 0.08798 0.4796    1
## Gm37323     NaN     NaN    NaN  NaN
ctrl.non.ambient <- rowMeans(ctrl.ambient, na.rm=TRUE) <= 0.1
summary(ctrl.non.ambient)
##    Mode   FALSE    TRUE    NA's 
## logical    1388   15393   12672
okay.genes <- names(ctrl.non.ambient)[which(ctrl.non.ambient)]
tab.neural4 <- tab.neural[rownames(tab.neural) %in% okay.genes,]
head(tab.neural4)
## DataFrame with 6 rows and 5 columns
##                   logFC    logCPM         F       PValue          FDR
##               <numeric> <numeric> <numeric>    <numeric>    <numeric>
## Xist          -7.555614   8.21232  6439.909  0.00000e+00  0.00000e+00
## Erdr1          1.892460   7.61593  1424.560 1.05681e-292 1.60787e-289
## Uba52         -0.879675   8.38618   418.189  3.36978e-91  3.98758e-88
## Grb10         -1.402937   6.58314   409.715  2.00893e-89  2.13951e-86
## Fdps           0.981454   7.21805   347.794  2.09662e-76  2.02991e-73
## Gt(ROSA)26Sor  1.481292   5.71617   340.040  9.06070e-75  8.04137e-72

Any highly expressed cell type-specific gene is a candidate for this procedure, most typically in cell types that are highly specialized towards manufacturing a protein product. Aside from hemoglobin, we could use immunoglobulins in populations containing B cells, or insulin and glucagon in pancreas datasets (Advanced Figure 6.3). The experimental setting may also provide some genes that must only be present in the ambient solution; for example, the mitochondrial transcripts can be used to estimate ambient contamination in single-nucleus RNA-seq, while Xist can be used for datasets involving mixtures of male and female cells (where the contaminating percentages are estimated from the profiles of male cells only).

If appropriate control features are available, this approach allows us to obtain a more accurate estimate of the contamination in each pseudo-bulk sample compared to the upper bound provided by maximumAmbience(). This avoids the removal of genuine DEGs due to overestimation fo the ambient contamination from the latter. However, the performance of this approach is fully dependent on the suitability of the control features - if a “control” feature is actually genuinely expressed in a cell type, the ambient contribution will be overestimated. A simple mitigating strategy is to simply take the lower of the proportions from controlAmbience() and maximumAmbience(), with the idea being that the latter will avoid egregious overestimation when the control set is misspecified.

5.2.3 Without an ambient profile

An estimate of the ambient profile is rarely available for public datasets where only the per-cell count matrices are provided. In such cases, we must instead use the rest of the dataset to infer something about the effects of ambient contamination. The most obvious approach is construct a proxy ambient profile by summing the counts for all cells from each sample, which can be used in place of the actual profile in the previous calculations.

proxy.ambient <- aggregateAcrossCells(summed.tal1,
    ids=summed.tal1$sample)

# Using 'proxy.ambient' instead of the estimaed 'ambient'.
max.ambient.proxy <- ambientContribMaximum(counts(summed.neural), 
    counts(proxy.ambient), mode="proportion")
head(max.ambient.proxy)
##           [,1]   [,2]   [,3]   [,4]
## Xkr4       NaN    NaN    NaN    NaN
## Gm1992     NaN    NaN    NaN    NaN
## Gm37381    NaN    NaN    NaN    NaN
## Rp1        NaN    NaN    NaN    NaN
## Sox17   0.7427 0.9891 0.5283 0.9067
## Gm37323    NaN    NaN    NaN    NaN
con.ambient.proxy <- ambientContribNegative(counts(summed.neural), 
    counts(proxy.ambient), features=is.hbb,  mode="proportion")
head(con.ambient.proxy)
##         [,1] [,2]   [,3] [,4]
## Xkr4     NaN  NaN    NaN  NaN
## Gm1992   NaN  NaN    NaN  NaN
## Gm37381  NaN  NaN    NaN  NaN
## Rp1      NaN  NaN    NaN  NaN
## Sox17      1    1 0.6032    1
## Gm37323  NaN  NaN    NaN  NaN

This assumes equal contributions from all labels to the ambient pool, which is not entirely unrealistic (Figure 5.1) though some discrepancies can be expected due to the presence of particularly fragile cell types or extracellular RNA.

par(mfrow=c(2,2))
for (i in seq_len(ncol(proxy.ambient))) {
    true <- ambient[,i]
    proxy <- assay(proxy.ambient)[,i]
    logged <- edgeR::cpm(cbind(proxy, true), log=TRUE, prior.count=2)
    logFC <- logged[,1] - logged[,2]
    abundance <- rowMeans(logged)
    plot(abundance, logFC, main=paste("Sample", i))
}
MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the _Tal1_ chimera dataset.

Figure 5.1: MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the Tal1 chimera dataset.

Alternatively, we may choose to mitigate the effect of ambient contamination by focusing on label-specific DEGs. Contamination-driven DEGs should be systematically present in comparisons for all labels, and thus can be eliminated by simply ignoring all genes that are significant in a majority of these comparisons (Section 4.5.2). The obvious drawback of this approach is that it discounts genuine DEGs that have a consistent effect in most/all labels, though one could perhaps argue that such “global” DEGs are not particularly interesting anyway. It is also complicated by fluctuations in detection power across comparisons involving different numbers of cells - or replicates, after filtering pseudo-bulk profiles by the number of cells.

res.tal1 <- pseudoBulkSpecific(summed.tal1, 
    label=summed.tal1$label,
    design=~factor(block) + tomato,
    coef="tomatoTRUE",
    condition=summed.tal1$tomato)

# Inspecting our neural crest results again.
tab.neural.again <- res.tal1[["Neural crest"]]
head(tab.neural.again[order(tab.neural.again$PValue),], 10)
## DataFrame with 10 rows and 6 columns
##                   logFC    logCPM         F      PValue         FDR
##               <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Fdps           0.981454   7.21805  347.7937 2.09662e-76 2.23290e-72
## Msmo1          1.493754   5.43923  290.1148 5.37126e-62 2.86019e-58
## Hmgcs1         1.250254   5.70837  243.4674 2.74677e-54 9.75105e-51
## Idi1           1.173814   5.37688  174.9294 1.27276e-39 3.38874e-36
## Gt(ROSA)26Sor  1.481292   5.71617  340.0404 2.53519e-31 5.39996e-28
## Sox9           0.537660   7.17373  102.4739 5.61048e-24 9.95861e-21
## Nkd1           0.719092   5.92690   92.9524 4.34496e-20 6.61054e-17
## Insig1         1.257329   4.06887   82.7426 1.10092e-19 1.46559e-16
## Fdft1          0.841060   5.32293   87.1582 3.61959e-19 4.28318e-16
## Acat2          0.508872   6.80012   76.0489 3.17888e-18 3.38551e-15
##               OtherAverage
##                  <numeric>
## Fdps            -0.0902359
## Msmo1            0.0269848
## Hmgcs1          -0.0616961
## Idi1            -0.0820029
## Gt(ROSA)26Sor    0.5408830
## Sox9            -0.0426835
## Nkd1             0.0332958
## Insig1          -0.2879254
## Fdft1            0.0336545
## Acat2           -0.0520168
# By comparison, the hemoglobins are all the way at the bottom.
head(tab.neural.again[is.hbb,], 10)
## DataFrame with 8 rows and 6 columns
##             logFC    logCPM          F    PValue       FDR OtherAverage
##         <numeric> <numeric>  <numeric> <numeric> <numeric>    <numeric>
## Hbb-bt   -7.76723   1.33059    57.2819 1.0000000  1.000000     -7.86818
## Hbb-bs   -5.84809   3.42835   236.0230 1.0000000  1.000000     -8.15478
## Hbb-bh2        NA        NA         NA        NA        NA     -8.98450
## Hbb-bh1  -8.09110   9.15972 10479.9859 0.9298044  1.000000     -8.08037
## Hbb-y    -8.41561   8.35705  7129.2261 0.4498399  1.000000     -8.21261
## Hba-x    -7.72481   8.53284  7658.5945 0.0953835  0.819654     -7.38506
## Hba-a1   -8.59664   6.74429  2748.7378 0.2942747  1.000000     -8.05722
## Hba-a2   -8.86621   5.81300  1468.4069 0.2377481  1.000000     -7.96511

The common theme here is that, in the absence of an ambient profile, we are using all labels as a proxy for the ambient effect. This can have unpredictable consequences as the results for each label are now dependent on the behavior of the entire dataset. For example, the metrics are susceptible to the idiosyncrasies of clustering where one cell type may be represented in multple related clusters that distort the percentages in up.de and down.de or the average log-fold change. The metrics may also be invalidated in analyses of a subset of the data - for example, a subclustering analysis focusing on a particular cell type may mark all relevant DEGs as problematic because they are consistently DE in all subtypes.

5.3 Subtracting ambient counts

It is worth commenting on the seductive idea of subtracting the ambient counts from the pseudo-bulk samples. This may seem like the most obvious approach for removing ambient contamination, but unfortunately, subtracted counts have unpredictable statistical properties due the distortion of the mean-variance relationship. Minor relative fluctuations at very large counts become large fold-changes after subtraction, manifesting as spurious DE in genes where a substantial proportion of counts is derived from the ambient solution. For example, several hemoglobin genes retain strong DE even after subtraction of the scaled ambient profile.

scaled.ambient <- controlAmbience(counts(summed.neural), ambient,
    features=is.hbb,  mode="profile")
subtracted <- counts(summed.neural) - scaled.ambient
subtracted <- round(subtracted)
subtracted[subtracted < 0] <- 0
subtracted[is.hbb,]
##         [,1] [,2] [,3] [,4]
## Hbb-bt     0    0    7   18
## Hbb-bs     1    2   31   42
## Hbb-bh2    0    0    0    0
## Hbb-bh1    2    0    0    0
## Hbb-y      0    0   39  107
## Hba-x      1    1    0    0
## Hba-a1     0    0  365  452
## Hba-a2     0    0  314  329

Another tempting approach is to use interaction models to implicitly subtract the ambient effect during GLM fitting. The assumption is that, for a genuine DEG, the log-fold change within cells is larger in magnitude than that in the ambient solution. This is based on the expectation that any DE in the latter is “diluted” by contributions from cell types where that gene is not DE. Unfortunately, this is not always the case; a DE analysis of the ambient counts indicates that the hemoglobin log-fold change is actually stronger in the neural crest cells compared to the ambient solution, which leads to the rather awkward conclusion that the WT neural crest cells are expressing hemoglobin beyond that explained by ambient contamination. (This is probably an artifact of how cell calling is performed.)

library(edgeR)
y.ambient <- DGEList(ambient, samples=colData(summed.neural))
y.ambient <- y.ambient[filterByExpr(y.ambient, group=y.ambient$samples$tomato),]
y.ambient <- calcNormFactors(y.ambient)

design <- model.matrix(~factor(block) + tomato, y.ambient$samples)
y.ambient <- estimateDisp(y.ambient, design)
fit.ambient <- glmQLFit(y.ambient, design, robust=TRUE)
res.ambient <- glmQLFTest(fit.ambient, coef=ncol(design))

summary(decideTests(res.ambient))
##        tomatoTRUE
## Down         1882
## NotSig       7717
## Up           1639
topTags(res.ambient, n=10)
## Coefficient:  tomatoTRUE 
##          logFC logCPM     F    PValue       FDR
## Hbb-y   -5.267 12.803 14972 1.344e-57 1.510e-53
## Hbb-bh1 -5.075 13.725 13520 1.272e-56 7.148e-53
## Hba-x   -4.827 13.122 13120 2.469e-56 9.249e-53
## Hba-a1  -4.662 10.734 10625 2.567e-54 7.213e-51
## Hba-a2  -4.521  9.480  7956 1.491e-51 3.350e-48
## Blvrb   -4.319  7.649  3965 6.417e-45 1.202e-41
## Xist    -4.376  7.484  3796 1.666e-44 2.675e-41
## Gypa    -5.138  7.213  3754 2.122e-44 2.981e-41
## Car2    -3.499  8.534  4089 8.048e-44 1.005e-40
## Hbb-bs  -4.941  7.209  3514 8.991e-44 1.010e-40

In addition, there are other issues with implicit subtraction in the fitted GLM that warrant caution with its use. This strategy precludes detection of DEGs that are common to all cell types as there is no longer a dilution effect being applied to the log-fold change in the ambient solution. It requires inclusion of the ambient profiles in the model, which is cause for at least some concern as they are unlikely to have the same degree of variability as the cell-derived pseudo-bulk profiles. Interpretation is also complicated by the fact that we are only interested in log-fold changes that are more extreme in the cells compared to the ambient solution; a non-zero interaction term is not sufficient for removing spurious DE.

See also comments in Advanced Section 7.3 for more comments on the removal of ambient contamination, mostly for visualization purposes.

Session Info

R version 4.4.0 beta (2024-04-15 r86425)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] edgeR_4.2.0                  limma_3.60.0                
 [3] DropletUtils_1.24.0          scran_1.32.0                
 [5] scuttle_1.14.0               MouseGastrulationData_1.17.1
 [7] SpatialExperiment_1.14.0     SingleCellExperiment_1.26.0 
 [9] SummarizedExperiment_1.34.0  Biobase_2.64.0              
[11] GenomicRanges_1.56.0         GenomeInfoDb_1.40.0         
[13] IRanges_2.38.0               S4Vectors_0.42.0            
[15] BiocGenerics_0.50.0          MatrixGenerics_1.16.0       
[17] matrixStats_1.3.0            BiocStyle_2.32.0            
[19] rebook_1.14.0               

loaded via a namespace (and not attached):
  [1] DBI_1.2.2                 CodeDepends_0.6.6        
  [3] rlang_1.1.3               magrittr_2.0.3           
  [5] compiler_4.4.0            RSQLite_2.3.6            
  [7] dir.expiry_1.12.0         DelayedMatrixStats_1.26.0
  [9] png_0.1-8                 vctrs_0.6.5              
 [11] pkgconfig_2.0.3           crayon_1.5.2             
 [13] fastmap_1.1.1             dbplyr_2.5.0             
 [15] magick_2.8.3              XVector_0.44.0           
 [17] utf8_1.2.4                rmarkdown_2.26           
 [19] graph_1.82.0              UCSC.utils_1.0.0         
 [21] purrr_1.0.2               bit_4.0.5                
 [23] xfun_0.43                 bluster_1.14.0           
 [25] zlibbioc_1.50.0           cachem_1.0.8             
 [27] beachmat_2.20.0           jsonlite_1.8.8           
 [29] blob_1.2.4                highr_0.10               
 [31] rhdf5filters_1.16.0       DelayedArray_0.30.0      
 [33] Rhdf5lib_1.26.0           BiocParallel_1.38.0      
 [35] cluster_2.1.6             irlba_2.3.5.1            
 [37] parallel_4.4.0            R6_2.5.1                 
 [39] bslib_0.7.0               jquerylib_0.1.4          
 [41] Rcpp_1.0.12               bookdown_0.39            
 [43] knitr_1.46                R.utils_2.12.3           
 [45] splines_4.4.0             igraph_2.0.3             
 [47] Matrix_1.7-0              tidyselect_1.2.1         
 [49] abind_1.4-5               yaml_2.3.8               
 [51] codetools_0.2-20          curl_5.2.1               
 [53] lattice_0.22-6            tibble_3.2.1             
 [55] withr_3.0.0               KEGGREST_1.44.0          
 [57] BumpyMatrix_1.12.0        evaluate_0.23            
 [59] BiocFileCache_2.12.0      ExperimentHub_2.12.0     
 [61] Biostrings_2.72.0         pillar_1.9.0             
 [63] BiocManager_1.30.22       filelock_1.0.3           
 [65] generics_0.1.3            BiocVersion_3.19.1       
 [67] sparseMatrixStats_1.16.0  glue_1.7.0               
 [69] metapod_1.12.0            tools_4.4.0              
 [71] AnnotationHub_3.12.0      BiocNeighbors_1.22.0     
 [73] ScaledMatrix_1.12.0       locfit_1.5-9.9           
 [75] XML_3.99-0.16.1           rhdf5_2.48.0             
 [77] grid_4.4.0                AnnotationDbi_1.66.0     
 [79] GenomeInfoDbData_1.2.12   HDF5Array_1.32.0         
 [81] BiocSingular_1.20.0       cli_3.6.2                
 [83] rsvd_1.0.5                rappdirs_0.3.3           
 [85] fansi_1.0.6               S4Arrays_1.4.0           
 [87] dplyr_1.1.4               R.methodsS3_1.8.2        
 [89] sass_0.4.9                digest_0.6.35            
 [91] dqrng_0.3.2               SparseArray_1.4.0        
 [93] rjson_0.2.21              R.oo_1.26.0              
 [95] memoise_2.0.1             htmltools_0.5.8.1        
 [97] lifecycle_1.0.4           httr_1.4.7               
 [99] statmod_1.5.0             mime_0.12                
[101] bit64_4.0.5              

References

Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1st Human Cell Atlas Jamboree, and J. Marioni. 2019. “EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.” Genome Biol. 20 (1): 63.

Pijuan-Sala, B., J. A. Griffiths, C. Guibentif, T. W. Hiscock, W. Jawaid, F. J. Calero-Nieto, C. Mulas, et al. 2019. “A Single-Cell Molecular Map of Mouse Gastrulation and Early Organogenesis.” Nature 566 (7745): 490–95.

Young, M. D., and S. Behjati. 2018. “SoupX Removes Ambient RNA Contamination from Droplet Based Single Cell RNA Sequencing Data.” bioRxiv.