## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 1
## ENSG00000087086 FTL 6.24421 1.000000 2.67120 1
## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 1
## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 2
## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 2
## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 2
## ENSG00000101439 CST3 3.93963 0.998729 2.38272 2
## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 4
## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 5
## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 5
```
``` r
# min.rank <= 5 means represents the union of the top 5 genes from each
# pairwise comparison between our chosen cluster and every other cluster.
chosen.markers.10x$Symbol[chosen.markers.10x$cohens.d.min.rank <= 5]
```
```
## [1] "LYZ" "FTL" "TYROBP" "S100A9"
## [5] "S100A8" "CTSS" "CST3" "FCN1"
## [9] "RP11-1143G9.4" "S100A12" "NEAT1"
```
The flexibility to choose between different summary statistics is one of the strengths of our pairwise strategy.
This allows us to explore different rankings of markers depending on our preferences.
For example, the min-rank is a conservative choice as it guarantees separation of our cluster of interest, at the cost of including weaker DE genes in the top set of markers;
while the minimum provides an aggressive ranking that focuses on markers that are uniquely expressed in our cluster (if any exist).
Pairwise comparisons are also robust to differences in the relative number of cells between clusters,
which ensures that a single large cluster does not dominate the calculation of effect sizes for all other clusters.
## Visualizing marker genes
At this point, we suppose that we ought to create some figures to keep everyone entertained.
We have already demonstrated how we can examine the distribution of expression values with violin plots in Figure \@ref(fig:marker-violins-pbmc).
Another option is to color our $t$-SNE plot according to the log-expression values of a specific marker in each cell (Figure \@ref(fig:marker-tsne-pbmc)).
Any heterogeneity in expression within our cluster might be indicative of internal structure.
``` r
gridExtra::grid.arrange(
plotReducedDim(sce.nn.10x, "TSNE", colour_by="clusters"),
plotReducedDim(sce.nn.10x, "TSNE", colour_by=chosen.markers.10x$Symbol[1], swap_rownames="Symbol"),
ncol=2
)
```
The heatmap is a classic visualization in genomics, and scRNA-seq is no exception (Figure \@ref(fig:marker-heat-pbmc)).
This provides a compact summary of the relative expression of multiple markers across the cell population.
Ideally, each marker should be consistently upregulated within our cluster of interest compared to the rest of the cells in the population.
``` r
plotHeatmap(sce.nn.10x, features=chosen.markers.10x$Symbol[1:10], order_columns_by="clusters", swap_rownames="Symbol", center=TRUE)
```
Another popular visualization is the *[seurat](https://CRAN.R-project.org/package=seurat)*-style "dot plot"^[
Back in my day, a "dot plot" referred to a visualization of a pairwise sequence alignment.
Bet most of you young'uns don't know about that.],
also known as a bubble plot (Figure \@ref(fig:marker-dot-pbmc)).
This is more concise than the heatmap as it uses the size of each dot/bubble to represent the proportion of cells with detected expression.
Personally, we disapprove of using variable areas to represent data as this can generate misleading visualizations^[
For starters, an $X$-fold change to the area of the bubble is only a $\sqrt{X}$-change to the radius, which requires more mental arithmetic when comparing adjacent bubbles.
Also, if a bubble is too small, we can't see its contents clearly, effectively discarding the data encoded by its color.
It gets even worse for certain configurations of the bubble plot where the color represents the mean of the _non-zero_ expression values;
here, we need to mentally integrate the color of each bubble by its area to get the mean expression for comparisons between clusters.],
but hey, to each their own.
``` r
plotDots(sce.nn.10x, features=chosen.markers.10x$Symbol[1:10], group="clusters", swap_rownames="Symbol", center=TRUE)
```
## Using a log-fold change threshold
The Cohen's $d$ and AUC consider both the magnitude of the difference between clusters as well as the variability within each cluster.
If the variability is low, it is possible for a gene to have a large effect size even if the magnitude of the difference is small.
These genes tend to be uninformative for cell type identification, e.g., ribosomal protein genes.
We would prefer genes with larger log-fold changes between clusters, even if they have higher variability [@mccarthy2009treat].
To favor the detection of such genes, we can compute the effect sizes relative to a log-fold change threshold.
The definition of Cohen's $d$ is generalized to the standardized difference between the observed log-fold change and the specified threshold.
Similarly, the AUC is redefined as the probability of randomly picking an expression value from one cluster that is greater than a random value from the other cluster plus the threshold.
A large positive Cohen's $d$ and an AUC above 0.5 can only be obtained if the observed log-fold change between clusters is significantly greater than the threshold.
(However, a negative Cohen's $d$ or AUC below 0.5 may not represent downregulation; it may just indicate that the observed log-fold change is less than the specified threshold.)
``` r
markers.threshold.10x <- scoreMarkers.se(
sce.nn.10x,
sce.nn.10x$clusters,
extra.columns='Symbol',
more.marker.args=list(threshold=2)
)
chosen.markers.threshold.10x <- markers.threshold.10x[[chosen.cluster]]
# Default ordering by the mean of Cohen's d.
previewMarkers(chosen.markers.threshold.10x, pre.columns="Symbol", post.columns="cohens.d.mean")
```
```
## DataFrame with 10 rows and 5 columns
## Symbol mean detected lfc cohens.d.mean
##
## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 3.777706
## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 2.556023
## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 2.411830
## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 1.190628
## ENSG00000087086 FTL 6.24421 1.000000 2.67120 1.158359
## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 0.836300
## ENSG00000101439 CST3 3.93963 0.998729 2.38272 0.703171
## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.686771
## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.614933
## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.517954
```
``` r
# Also looking at the order by the mean of the AUCs.
previewMarkers(chosen.markers.threshold.10x, pre.columns="Symbol", order.by="auc.mean")
```
```
## DataFrame with 10 rows and 5 columns
## Symbol mean detected lfc auc.mean
##
## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 0.929055
## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 0.928191
## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 0.880932
## ENSG00000087086 FTL 6.24421 1.000000 2.67120 0.807194
## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 0.713752
## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.686833
## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.678680
## ENSG00000101439 CST3 3.93963 0.998729 2.38272 0.662480
## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.656001
## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 0.640162
```
In general, we only use a threshold if irrelevant genes with low variances are interfering with our interpretation of the clusters.
Weakly expressed genes will often have low log-fold changes due to the pseudo-count shrinkage (Chapter \@ref(normalization)),
and genes that separate closely-related clusters will usually have smaller log-fold changes.
Prematurely using a large threshold will prevent the detection of these potentially interesting genes.
## Blocking on uninteresting factors {#marker-block}
Larger datasets may contain multiple blocks of cells where the differences between blocks are uninteresting, e.g., batch effects, variability between donors.
These differences can interfere with marker gene detection by
(i) inflating the variance within each cluster and
(ii) distorting the log-fold changes if the cluster composition varies between blocks.
To avoid these issues, we block on any uninteresting factors when computing the effect sizes.
Let's demonstrate on a mouse trophoblast dataset [@lun2017assessing] generated across two plates,
where any differences between plates are technical and should not be allowed to influence the marker statistics.
``` r
library(scRNAseq)
sce.tropho <- LunSpikeInData("tropho")
sce.tropho$block <- factor(sce.tropho$block)
table(sce.tropho$block) # i.e., plate of origin.
```
```
##
## 20160906 20170201
## 96 96
```
``` r
# Computing the QC metrics. For brevity, we'll skip the spike-ins.
library(scrapper)
is.mito.tropho <- which(any(seqnames(rowRanges(sce.tropho))=="MT"))
sce.qc.tropho <- quickRnaQc.se(sce.tropho, subsets=list(MT=is.mito.tropho), block=sce.tropho$block)
sce.qc.tropho <- sce.tropho[,sce.qc.tropho$keep]
# Computing log-normalized expression values.
sce.norm.tropho <- normalizeRnaCounts.se(sce.qc.tropho, size.factors=sce.qc.tropho$sum, block=sce.qc.tropho$block)
# We now choose the top HVGs.
sce.var.tropho <- chooseRnaHvgs.se(sce.norm.tropho, block=sce.norm.tropho$block)
# Running the PCA on the HVG submatrix.
sce.pca.tropho <- runPca.se(sce.var.tropho, features=rowData(sce.var.tropho)$hvg, block=sce.var.tropho$block)
# Doing some graph-based clustering.
sce.nn.tropho <- runAllNeighborSteps.se(sce.pca.tropho)
```
We set `block=` to instruct `scoreMarkers.se()` to perform the pairwise comparisons separately in each block, i.e., plate.
Specifically, for a comparison between two clusters, we compute one effect size per plate where we only use cells in that plate.
By performing comparisons within each plate, we cancel out any differences between plates so that they do not interfere with our effect sizes.
The per-plate effect sizes are then averaged across plates to obtain a single value per comparison,
using a weighted mean that accounts for the number of cells involved in the comparison in each plate.
A similar average across plates is computed for the mean log-expression and proportion of detected cells.
``` r
markers.tropho <- scoreMarkers.se(sce.nn.tropho, sce.nn.tropho$clusters, block=sce.nn.tropho$block)
previewMarkers(markers.tropho[["1"]])
```
```
## DataFrame with 10 rows and 3 columns
## mean detected lfc
##
## ENSMUSG00000027306 6.50487 0.914894 4.19102
## ENSMUSG00000006398 10.20393 1.000000 1.33125
## ENSMUSG00000084301 6.90226 1.000000 1.30211
## ENSMUSG00000083407 3.02553 0.936170 1.63986
## ENSMUSG00000083907 4.10552 1.000000 1.59130
## ENSMUSG00000001403 9.20855 1.000000 1.99050
## ENSMUSG00000074802 5.14600 0.872340 2.90355
## ENSMUSG00000030867 8.95573 1.000000 2.10757
## ENSMUSG00000048574 8.75497 1.000000 1.03204
## ENSMUSG00000030654 9.94568 1.000000 1.18929
```
By default, we do not explicitly penalize genes that behave inconsistently across blocks.
This is generally unnecessary as the average favors genes with large effect sizes in the same direction in all blocks.
That said, it is theoretically possible for a top marker to have highly variable effect sizes across plates, as long as the average is large.
We can check this by visualizing the expression profile of a gene of interest with respect to the plate (Figure \@ref(fig:markers-violin-block)).
Ideally, our gene would behave consistently across plates.
``` r
plotExpression(
sce.nn.tropho,
features=rownames(markers.tropho[["1"]])[1],
x="clusters",
colour_by="clusters",
other_fields="block"
) +
facet_grid(~block)
```
If we really want to enforce consistent DE across blocks, we can ask `scoreMarkers.se()` to instead compute a quantile instead of a weighted mean.
For example, rankings derived from the minimum effect size across blocks will focus on genes that exhibit large changes in the same direction within each block.
``` r
markers.min.tropho <- scoreMarkers.se(
sce.nn.tropho,
sce.nn.tropho$clusters,
block=sce.nn.tropho$block,
more.marker.args=list(
block.average.policy="quantile",
block.quantile=0 # i.e., minimum.
)
)
previewMarkers(markers.min.tropho[["1"]])
```
```
## DataFrame with 10 rows and 3 columns
## mean detected lfc
##
## ENSMUSG00000027306 6.46814 0.909091 4.083865
## ENSMUSG00000006398 9.99817 1.000000 1.114116
## ENSMUSG00000001403 9.19752 1.000000 1.782523
## ENSMUSG00000084301 6.78050 1.000000 1.113203
## ENSMUSG00000030867 8.73720 1.000000 1.855625
## ENSMUSG00000083907 3.94614 1.000000 1.346573
## ENSMUSG00000029177 8.42776 1.000000 0.690333
## ENSMUSG00000083407 2.61290 0.880000 1.247344
## ENSMUSG00000048574 8.52426 1.000000 0.933816
## ENSMUSG00000084133 7.38314 1.000000 0.894224
```
Blocking in `scoreMarkers.se()` assumes that each pair of clusters is present in at least one block in order to perform a comparison within the block.
In scenarios where cells from two clusters never co-occur in the same block,
the associated pairwise comparison will be impossible and is ignored during calculation of summary statistics.
This can be problematic in rare situations where the blocks are perfectly confounded with the clusters,
though marker detection is likely to be the least of our concerns with an experimental design of this calibre.
## More uses for the marker scores
Our discussion above focuses on genes that are upregulated in our cluster of interest, as these are the easiest to interpret and experimentally validate.
However, a cluster may occasionally be defined by downregulation of some genes relative to the rest of the cell population.
In such cases, we can reverse the rankings to see if there is any consistent downregulation compared to other clusters.
Alternatively, we can recognize that any downregulated genes in cluster $X$ should manifest as upregulated genes in other clusters when compared to $X$.
By using a summary like the min-rank, we guarantee that these genes will show up somewhere, i.e., as markers of other clusters.
(Other summaries are less effective as the upregulation only applies to comparisons against $X$ and may not cause a noticeable increase in the mean/median summary.)
``` r
# Ordering by increasing Cohen's d.
reversed.chosen.markers.10x <- chosen.markers.10x[order(chosen.markers.10x$cohens.d.mean),]
previewMarkers(reversed.chosen.markers.10x, pre.columns="Symbol", post.columns="cohens.d.mean")
```
```
## DataFrame with 10 rows and 5 columns
## Symbol mean detected lfc cohens.d.mean
##
## ENSG00000213741 RPS29 3.946430 1.000000 -1.062005 -2.31360
## ENSG00000177954 RPS27 5.043689 1.000000 -0.830939 -2.26035
## ENSG00000100316 RPL3 3.650995 0.997459 -0.953517 -1.94125
## ENSG00000198242 RPL23A 3.480505 0.996188 -0.996558 -1.93683
## ENSG00000168028 RPSA 1.923204 0.869123 -1.394956 -1.79849
## ENSG00000149273 RPS3 3.569838 0.996188 -0.870643 -1.70630
## ENSG00000227507 LTB 0.579215 0.428208 -1.448969 -1.67086
## ENSG00000071082 RPL31 3.228407 0.987294 -0.893312 -1.57149
## ENSG00000105372 RPS19 4.003734 0.998729 -0.727010 -1.53771
## ENSG00000137154 RPS6 4.130116 0.997459 -0.699234 -1.48973
```
Occasionally, we are only interested in markers for a subset of the clusters.
Imagine that we have a set of closely-related clusters and we want to identify the genes that distinguish these clusters from each other.
The summary statistics generated from all clusters might not be satisfactory as they will not prioritize genes with weak upregulation between related clusters.
(Except for min-rank, where these genes would at least show up near the top of the ranking.
But they would be surrounded by many irrelevant genes from comparisons to other clusters.)
Instead, we can just compute marker scores from the cells in the selected subset of clusters:
``` r
# Let's pretend that clusters 1, 2 and 3 are of particular interest and
# we want to find markers between them.
subset.clusters.10x <- sce.nn.10x$clusters %in% c("1", "2", "3")
subset.markers.10x <- scoreMarkers.se(
sce.nn.10x[,subset.clusters.10x],
sce.nn.10x$clusters[subset.clusters.10x],
extra.columns="Symbol"
)
# Now let's have a look at the top markers for cluster 2.
previewMarkers(subset.markers.10x[["2"]], pre.columns="Symbol")
```
```
## DataFrame with 10 rows and 4 columns
## Symbol mean detected lfc
##
## ENSG00000008517 IL32 3.02931 0.979466 1.977099
## ENSG00000227507 LTB 3.37330 0.991786 1.778829
## ENSG00000277734 TRAC 2.53751 0.973306 1.308343
## ENSG00000168685 IL7R 1.73999 0.839836 1.205725
## ENSG00000167286 CD3D 1.91705 0.942505 0.896106
## ENSG00000166710 B2M 6.25505 1.000000 0.487200
## ENSG00000213741 RPS29 5.40264 1.000000 0.632544
## ENSG00000206503 HLA-A 3.74897 0.997947 0.836841
## ENSG00000116824 CD2 1.21460 0.770021 0.801813
## ENSG00000234745 HLA-B 4.70730 1.000000 0.618141
```
For more control on the selected markers, we can filter our data frame on the available statistics.
For example, we might only consider genes as markers if they have detected proportions above 50%,
a mean log-expression greater than 1,
an average difference in the detected proportions above 50%,
and an average log-fold change above 1.
We tend to avoid _a priori_ filtering as it is difficult to choose thresholds that are generally applicable.
Nonetheless, it can be useful to refine the set of markers once we know what we're interested in.
``` r
filtered.markers.10x <- chosen.markers.10x[
chosen.markers.10x$detected >= 0.5 &
chosen.markers.10x$mean >= 1 &
chosen.markers.10x$delta.detected.mean >= 0.5 &
chosen.markers.10x$delta.mean.mean >= 1,
]
previewMarkers(filtered.markers.10x, pre.columns="Symbol", post.columns=c("delta.detected.mean"))
```
```
## DataFrame with 10 rows and 5 columns
## Symbol mean detected lfc delta.detected.mean
##
## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.718534
## ENSG00000204482 LST1 2.90740 0.988564 2.05035 0.581743
## ENSG00000121552 CSTA 2.30972 0.949174 1.97753 0.729536
## ENSG00000204472 AIF1 2.83022 0.978399 2.09265 0.615761
## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.758067
## ENSG00000163563 MNDA 2.50935 0.960610 2.13786 0.733194
## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.797889
## ENSG00000038427 VCAN 1.95980 0.885642 1.84785 0.787284
## ENSG00000100079 LGALS2 2.10228 0.885642 1.82007 0.705982
## ENSG00000025708 TYMP 2.15622 0.932656 1.56581 0.567270
```
## Invalidity of $p$-values {#marker-p-value-invalidity}
In the old days, we used to report $p$-values along with the effect sizes for the detected markers.
After all, Cohen's $d$ and the AUC are closely related to $t$-tests and the Wilcoxon ranked sum test, respectively.
Unfortunately, the statistical interpretation of $p$-values is compromised when identifying cluster-specific markers.
The first issue is that of "data dredging" (also known as fishing or data snooping) when the DE analysis is performed on the same data used to define the clusters.
We are more likely to get a positive result when we use a dataset to test a hypothesis generated from that data.
Or more simply - clustering will separate cells by expression, so of course we will get low $p$-values when we compare between clusters!
To illustrate, let's simulate i.i.d. normal values, perform $k$-means clustering and test for DE between clusters of cells with Wilcoxon ranked sum tests.
Our input data is random so we'd expect a uniform distribution of $p$-values under the null hypothesis, but instead it is skewed towards low values (Figure \@ref(fig:pval-dist)).
This means that we can detect "significant" differences between clusters even in the absence of any real substructure in the data.
``` r
set.seed(0)
y <- matrix(rnorm(1000000), ncol=200)
clusters <- kmeans(t(y), centers=2)$cluster
out <- apply(y, 1, FUN=function(x) {
wilcox.test(x[clusters==1], x[clusters==2])$p.value
})
hist(out, col="grey80", xlab="p-value", main="")
```
Another problem is that many $p$-value calculations treat counts from different cells in the same cluster as replicate observations.
This is not the most relevant level of replication when cells are derived from the same biological sample, i.e., cell culture, animal or patient.
DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability [@lun2017overcoming].
This is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be repeated.
In other words, if the experiment involved a single biological sample, the sample size is actually just 1, regardless of how many individual cells were assayed.
By treating cells as replicates, we overstate our sample size and obtain much lower $p$-values than would be appropriate.
In short, the $p$-values for marker genes don't make much sense from a statistical perspective.
We can still use them for ranking, but at that point, we might as well make our life simpler and use the effect sizes directly.
If we really want to determine whether some markers are "real", the best approach is to perform a separate validation experiment with an independent replicate cell population.
A typical strategy is to use different experimental techniques like FACS, FISH, qPCR or IHC to find a subpopulation that expresses the marker(s) of interest.
This confirms that the subpopulation actually exists and is not an artifact of the scRNA-seq protocol or the computational analysis.
## Gene set enrichment
We can summarize the biological functions of our top-ranked marker genes with gene set enrichment analyses.
Here, we extract predefined sets of genes for specific pathways or processes and check if any gene set is overrepresented among our set of top markers.
This reduces some of the hassle of manually examining the annotation for each gene to assign biological meaning to each cluster.
To illustrate, we'll use the gene ontology (GO)'s biological process (BP) subcategory,
which defines gene sets associated with known biological processes [@ashburner2000gene].
We might also consider other useful gene set collections like KEGG and REACTOME - see the [MSigDB overview](https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp) for details.
``` r
library(msigdbr)
go.bp.df <- msigdbr(species="Homo sapiens", collection="C5", subcollection="GO:BP")
go.bp.sets <- split(go.bp.df$ensembl_gene, go.bp.df$gs_name)
```
We use the hypergeometric test to quantify enrichment of each gene set among the top markers for cluster 1.
The $p$-value of each gene set is determined by the number of shared genes between each GO set and the top markers, relative to the size of the GO set.
More strongly enriched sets will have lower $p$-values and should be prioritized for interpretation.
Here, we have chosen the top 100 markers but different values can be used depending on how many markers are of interest.
``` r
# Choosing the top 100 genes with positive Cohen's d values.
top.chosen.10x <- head(rownames(chosen.markers.10x)[chosen.markers.10x$cohens.d.mean > 0], 100)
library(scrapper)
enrich.chosen.10x <- testEnrichment(top.chosen.10x, go.bp.sets, universe=rownames(chosen.markers.10x))
# Prioritizing the most enriched gene sets for examination.
enrich.chosen.10x <- enrich.chosen.10x[order(enrich.chosen.10x$p.value),,drop=FALSE]
head(enrich.chosen.10x)
```
```
## DataFrame with 6 rows and 3 columns
## overlap
##
## GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS 42
## GOBP_INFLAMMATORY_RESPONSE 30
## GOBP_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 31
## GOBP_REGULATION_OF_DEFENSE_RESPONSE 28
## GOBP_REGULATION_OF_IMMUNE_SYSTEM_PROCESS 35
## GOBP_POSITIVE_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 25
## size
##
## GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS 1863
## GOBP_INFLAMMATORY_RESPONSE 934
## GOBP_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 1146
## GOBP_REGULATION_OF_DEFENSE_RESPONSE 890
## GOBP_REGULATION_OF_IMMUNE_SYSTEM_PROCESS 1649
## GOBP_POSITIVE_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 671
## p.value
##
## GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS 1.21353e-26
## GOBP_INFLAMMATORY_RESPONSE 5.65548e-23
## GOBP_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 1.42341e-21
## GOBP_REGULATION_OF_DEFENSE_RESPONSE 3.51677e-21
## GOBP_REGULATION_OF_IMMUNE_SYSTEM_PROCESS 4.79163e-21
## GOBP_POSITIVE_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 1.16081e-20
```
``` r
# Focusing on some of the smaller, more specific gene sets.
head(enrich.chosen.10x[enrich.chosen.10x$size < 100,,drop=FALSE])
```
```
## DataFrame with 6 rows and 3 columns
## overlap
##
## GOBP_RESPIRATORY_BURST 7
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN 8
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN 7
## GOBP_RESPONSE_TO_FUNGUS 8
## GOBP_NEUTROPHIL_CHEMOTAXIS 8
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_ANTIGEN 7
## size
##
## GOBP_RESPIRATORY_BURST 40
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN 70
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN 43
## GOBP_RESPONSE_TO_FUNGUS 80
## GOBP_NEUTROPHIL_CHEMOTAXIS 83
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_ANTIGEN 52
## p.value
##
## GOBP_RESPIRATORY_BURST 2.81859e-11
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN 3.67052e-11
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN 4.83654e-11
## GOBP_RESPONSE_TO_FUNGUS 1.10001e-10
## GOBP_NEUTROPHIL_CHEMOTAXIS 1.48591e-10
## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_ANTIGEN 1.96477e-10
```
If we need more detail about a particular set, we can examine the behavior of its constituent genes.
``` r
top.set.10x <- rownames(enrich.chosen.10x)[1]
overlaps.top.set.10x <- intersect(go.bp.sets[[top.set.10x]], top.chosen.10x)
previewMarkers(
chosen.markers.10x[rownames(chosen.markers.10x) %in% overlaps.top.set.10x,],
pre.column="Symbol",
rows=NULL # List all of the top genes in this set.
)
```
```
## DataFrame with 42 rows and 4 columns
## Symbol mean detected lfc
##
## ENSG00000090382 LYZ 5.80862 0.998729 4.47263
## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311
## ENSG00000163220 S100A9 5.63461 0.996188 4.48435
## ENSG00000143546 S100A8 5.37071 1.000000 4.50586
## ENSG00000163131 CTSS 3.81179 0.997459 2.58808
## ... ... ... ... ...
## ENSG00000051523 CYBA 3.266806 0.993647 0.810757
## ENSG00000135046 ANXA1 1.701281 0.836086 0.846713
## ENSG00000116701 NCF2 0.778700 0.564168 0.594712
## ENSG00000103490 PYCARD 1.388051 0.815756 0.733532
## ENSG00000111729 CLEC4A 0.691981 0.519695 0.566143
```
Alternatively, we can aggregate the expression profiles of each set's genes into a single per-cell score.
Scores are defined as the column sums of a rank-1 approximation of the submatrix of the log-expression values corresponding to the genes in the set [@bueno2016comprehensive].
This effectively performs a PCA to collapse the submatrix into a single dimension, enriching for the biological signal associated with the set's annotated function.
The resulting scores are primarily useful for visualizing set activity (Figure \@ref(fig:marker-tsne-gsea)).
We tend not to use gene set scores for quantitative analyses as they are difficult to interpret.
Should the score be higher in a cell that weakly upregulates many genes in the set, or a cell that strongly upregulates a few genes in the set?
What if two cells have the same score but express different subsets of genes in the set?
These complications can be minimized by operating on individual genes whenever possible -
for example, instead of testing for differences in gene set scores between subpopulations,
we could examine the distribution of effect sizes for the same comparison across all genes in the set,
which is easier to interpret and more informative.
``` r
top.set.score.10x <- scoreGeneSet.se(sce.nn.10x, go.bp.sets[[top.set.10x]])
plotReducedDim(sce.nn.10x, "TSNE", colour_by=data.frame(Score=top.set.score.10x$scores))
```
Note that many Bioconductor packages implement methods for quantifying gene set enrichment, e.g., *[fgsea](https://bioconductor.org/packages/3.23/fgsea)*, *[goseq](https://bioconductor.org/packages/3.23/goseq)*, *[limma](https://bioconductor.org/packages/3.23/limma)*, to name a few.
We like the hypergeometric test as it is simple and focuses on the top markers, but any function can be used as long as it can accept a ranking of genes.
In all cases, we would recommend only using the enrichment $p$-values to rank the gene sets, not to make any statements about statistical significance.
Many of these methods compute their $p$-values by assuming that genes are independent under the null hypothesis.
In a biological system with highly coordinated pathways and processes, this is unlikely to be true, potentially inflating the type I error beyond the threshold for significance.
## Session information {-}
``` r
sessionInfo()
```
```
## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] msigdbr_26.1.0 ensembldb_2.35.0
## [3] AnnotationFilter_1.35.0 GenomicFeatures_1.63.2
## [5] AnnotationDbi_1.73.1 scRNAseq_2.25.0
## [7] scater_1.39.4 ggplot2_4.0.2
## [9] scuttle_1.21.6 scrapper_1.5.17
## [11] DropletUtils_1.31.1 SingleCellExperiment_1.33.2
## [13] SummarizedExperiment_1.41.1 Biobase_2.71.0
## [15] GenomicRanges_1.63.2 Seqinfo_1.1.0
## [17] IRanges_2.45.0 S4Vectors_0.49.1
## [19] BiocGenerics_0.57.0 generics_0.1.4
## [21] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [23] DropletTestFiles_1.21.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0
## [3] magrittr_2.0.5 gypsum_1.7.0
## [5] ggbeeswarm_0.7.3 farver_2.1.2
## [7] rmarkdown_2.31 BiocIO_1.21.0
## [9] vctrs_0.7.3 memoise_2.0.1
## [11] Rsamtools_2.27.2 DelayedMatrixStats_1.33.0
## [13] RCurl_1.98-1.18 htmltools_0.5.9
## [15] S4Arrays_1.11.1 AnnotationHub_4.1.0
## [17] curl_7.0.0 BiocNeighbors_2.5.4
## [19] Rhdf5lib_1.33.6 SparseArray_1.11.13
## [21] rhdf5_2.55.16 alabaster.base_1.11.4
## [23] sass_0.4.10 bslib_0.10.0
## [25] alabaster.sce_1.11.0 httr2_1.2.2
## [27] cachem_1.1.0 GenomicAlignments_1.47.0
## [29] lifecycle_1.0.5 pkgconfig_2.0.3
## [31] rsvd_1.0.5 Matrix_1.7-5
## [33] R6_2.6.1 fastmap_1.2.0
## [35] digest_0.6.39 dqrng_0.4.1
## [37] irlba_2.3.7 ExperimentHub_3.1.0
## [39] RSQLite_2.4.6 beachmat_2.27.5
## [41] filelock_1.0.3 labeling_0.4.3
## [43] httr_1.4.8 abind_1.4-8
## [45] compiler_4.6.0 bit64_4.6.0-1
## [47] withr_3.0.2 S7_0.2.1
## [49] BiocParallel_1.45.0 viridis_0.6.5
## [51] DBI_1.3.0 alabaster.ranges_1.11.0
## [53] alabaster.schemas_1.11.0 HDF5Array_1.39.1
## [55] R.utils_2.13.0 rappdirs_0.3.4
## [57] DelayedArray_0.37.1 rjson_0.2.23
## [59] tools_4.6.0 vipor_0.4.7
## [61] otel_0.2.0 beeswarm_0.4.0
## [63] R.oo_1.27.1 glue_1.8.0
## [65] h5mread_1.3.3 restfulr_0.0.16
## [67] rhdf5filters_1.23.3 grid_4.6.0
## [69] gtable_0.3.6 R.methodsS3_1.8.2
## [71] BiocSingular_1.27.1 ScaledMatrix_1.19.0
## [73] XVector_0.51.0 ggrepel_0.9.8
## [75] BiocVersion_3.23.1 pillar_1.11.1
## [77] babelgene_22.9 limma_3.67.1
## [79] dplyr_1.2.1 BiocFileCache_3.1.0
## [81] lattice_0.22-9 rtracklayer_1.71.3
## [83] bit_4.6.0 tidyselect_1.2.1
## [85] locfit_1.5-9.12 Biostrings_2.79.5
## [87] knitr_1.51 gridExtra_2.3
## [89] bookdown_0.46 ProtGenerics_1.43.0
## [91] edgeR_4.9.7 xfun_0.57
## [93] statmod_1.5.1 pheatmap_1.0.13
## [95] UCSC.utils_1.7.1 lazyeval_0.2.3
## [97] yaml_2.3.12 cigarillo_1.1.0
## [99] evaluate_1.0.5 codetools_0.2-20
## [101] tibble_3.3.1 alabaster.matrix_1.11.0
## [103] BiocManager_1.30.27 cli_3.6.6
## [105] jquerylib_0.1.4 GenomeInfoDb_1.47.2
## [107] dichromat_2.0-0.1 Rcpp_1.1.1
## [109] dbplyr_2.5.2 png_0.1-9
## [111] XML_3.99-0.23 parallel_4.6.0
## [113] assertthat_0.2.1 blob_1.3.0
## [115] sparseMatrixStats_1.23.0 bitops_1.0-9
## [117] alabaster.se_1.11.0 viridisLite_0.4.3
## [119] scales_1.4.0 purrr_1.2.2
## [121] crayon_1.5.3 rlang_1.2.0
## [123] cowplot_1.2.0 KEGGREST_1.51.1
```