[["index.html", "Single-cell RNA-seq analysis with scrapper Introduction", " Single-cell RNA-seq analysis with scrapper Aaron Lun Version: 0.99.3Last updated: 2026-02-11Last built: 2026-04-15 Introduction Single-cell RNA-sequencing (scRNA-seq) - the name says it all, really. Long story short, we isolate single cells and we sequence their transcriptomes to quantify the expression of each gene in each cell (Kołodziejczyk et al. 2015). Our aim is to explore heterogeneity in a cell population at the resolution of individual cells, typically to identify subpopulations or states that would not be apparent from population-level (i.e., “bulk”) assays. Since its inception, scRNA-seq has emerged as one of the premier techniques for publishing genomics papers. Occasionally, it is even used to do some actual science. This book describes a computational workflow for analyzing scRNA-seq data using the R/Bioconductor ecosystem (Huber et al. 2015). Most of the heavy lifting is performed using the scrapper package, while scater handles the plotting (McCarthy et al. 2017). We rely heavily on Bioconductor data structures like the SingleCellExperiment class, so readers should check out the associated documentation if they haven’t already. Each chapter is devoted to a particular step in the analysis where we provide its theoretical rationale, the associated code, and some typical results from real public datasets. This includes: Quality control, to filter out cells that were damaged or not properly sequenced. Normalization, to remove cell-specific biases. Feature selection, to identify genes with interesting biologial variation. Principal components analysis, to compact and denoise the data. Visualization, to generate the all-important Figure 1 of our manuscript. Clustering, to summarize the data into groups of similar cells. Marker detection, to assign biological meaning to each cluster based on its upregulated genes. Much of this content was scraped together from the “Orchestrating Single-Cell Analysis with Bioconductor” (OSCA) series of books (Amezquita et al. 2020) that were primarily based on the older scran package. scrapper is just a rewrite of the most important parts of scran with improved efficiency and less historical baggage. Similarly, this book is a more streamlined rewrite of OSCA books that (hopefully) will be easier to read and run. Truth be told, you don’t actually need to read this book if you don’t care about how/why things are done. Just copy and paste the following into your R session: # Pulling out an example dataset. library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() # Running the full analysis pipeline. library(scrapper) is.mito.zeisel &lt;- grep(&quot;^mt-&quot;, rownames(sce.zeisel)) res.zeisel &lt;- analyze.se(sce.zeisel, rna.qc.subsets=list(MT=is.mito.zeisel)) # Visualizing the cluster assignments for each cell: library(scater) plotReducedDim(res.zeisel$x, &quot;TSNE&quot;, colour_by=&quot;graph.cluster&quot;) # Looking at the top markers for cluster 1: previewMarkers(res.zeisel$markers$rna[[&quot;1&quot;]]) ## DataFrame with 10 rows and 3 columns ## mean detected lfc ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Gad1 4.79503 1.000000 4.56949 ## Gad2 4.44192 0.996503 4.25766 ## Ndrg4 4.40310 0.996503 2.59179 ## Vstm2a 2.94119 0.965035 2.67985 ## Stmn3 4.71546 0.993007 2.64538 ## Slc6a1 3.75820 0.993007 3.08908 ## Tspyl4 3.36568 1.000000 2.15128 ## Nap1l5 4.32495 1.000000 3.09812 ## Rab3c 3.91161 0.982517 2.98746 ## Slc32a1 2.04411 0.909091 2.01340 And that’s it. Sometimes, ignorance is bliss, and it’s better to not know how the sausage is made. But hey - you’re already here, so why not keep reading? Any questions can be posted at Bioconductor support site or the GitHub page for this book. References "],["quality-control.html", "Chapter 1 Quality Control 1.1 Motivation 1.2 Common choices of QC metrics 1.3 Identifying low-quality cells 1.4 Creating diagnostic plots 1.5 Blocking on experimental factors 1.6 Skipping quality control Session Info", " Chapter 1 Quality Control 1.1 Motivation In a typical scRNA-seq dataset, some of the cells will be of “low quality” for one reason or another. Perhaps the cells were damaged during dissociation, or maybe library preparation was not performed efficiently. Such low-quality libraries are problematic as they can contribute to misleading results in downstream analyses: They form their own distinct cluster(s), complicating interpretation of the results. Low-quality libraries generated from different cell types can cluster together based on similarities in the damage-induced expression profiles, creating artificial intermediate states or trajectories between otherwise distinct subpopulations. They interfere with quantification of population heterogeneity during variance estimation or principal components analysis. The first few principal components will capture differences in quality rather than biology, reducing the effectiveness of dimensionality reduction. Similarly, genes with the largest variances will be driven by differences between low- and high-quality cells. Many low-quality libraries have small total counts and are scaled up aggressively during normalization. This inflates the apparent expression of genes with non-zero counts in such libraries, which further contributes to inflated variances and formation of artificial clusters. To mitigate these problems, we remove the problematic cells before proceeding with the rest of our analysis. This step is commonly referred to as quality control (QC) on the cells. (We will use “library” and “cell” interchangeably in this chapter; the distinction is more important for droplet-based data, where some libraries may not contain cells.) 1.2 Common choices of QC metrics We use several common QC metrics to identify low-quality cells based on their expression profiles. These metrics are described below in terms of reads for Smart-seq2 data (Picelli et al. 2014), but the same definitions apply to UMI data generated by other technologies like MARS-seq and droplet-based protocols (Islam et al. 2014; Jaitin et al. 2014; Macosko et al. 2015; Klein et al. 2015). The library size is defined as the total sum of counts across all relevant features for each cell. Typically, the relevant features are the endogenous genes, excluding other feature types (e.g., spike-ins) or modalities (e.g., antibody-derived tags). Cells with small library sizes are likely to be of low quality as the RNA has been lost at some point during library preparation, either due to cell lysis or inefficient cDNA capture and amplification. The number of expressed features in each cell is defined as the number of endogenous genes with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The proportion of reads mapped to genes in the mitochondrial genome is defined relative to the library size in each cell (Islam et al. 2014; Ilicic et al. 2016). The reasoning is that, in the presence of modest damage, perforations in the cell membrane permit efflux of cytoplasmic transcript molecules but are too small to allow mitochondria to escape. This leads to a relative enrichment of mitochondrial transcripts in libraries corresponding to damaged cells. (For single-nuclei RNA-seq experiments, high proportions are also useful as they represent cells where the cytoplasm has not been successfully stripped.) If spike-in transcripts were used in the experiment, the proportion of reads mapped to spike-ins is defined relative to the library size plus the total spike-in count in each cell. As the same amount of spike-in RNA should have been added to each cell, any enrichment in spike-in counts indicates that endogenous RNA was lost. Thus, high spike-in proportions are indicative of poor-quality cells where endogenous RNA has been lost due to, e.g., partial cell lysis or RNA degradation during dissociation. To demonstrate, we’ll use a small scRNA-seq dataset from Lun et al. (2017), which is provided with no prior QC steps. Happily enough, this dataset also contains spike-in transcripts so we can compute the spike-in proportions for each cell. These days, spike-ins are rare as they don’t work well in high-throughput scRNA-seq protocols; but this 416B dataset was generated from a good old-fashioned plate-based protocol, so if we’ve got spike-in data, we might as well use it. library(scRNAseq) sce.416b &lt;- LunSpikeInData(&quot;416b&quot;) sce.416b ## class: SingleCellExperiment ## dim: 46604 192 ## metadata(0): ## assays(1): counts ## rownames(46604): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000095742 CBFB-MYH11-mcherry ## rowData names(1): Length ## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(8): cell line cell type ... spike-in addition block ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(2): ERCC SIRV # Finding all the mitochondrial transcripts first. location.416b &lt;- rowRanges(sce.416b) is.mito.416b &lt;- which(any(seqnames(location.416b)==&quot;MT&quot;)) length(is.mito.416b) ## [1] 37 # And then computing the QC metrics from our count matrix. library(scrapper) sce.qc.416b &lt;- quickRnaQc.se( sce.416b, subsets=list(MT=is.mito.416b), altexp.proportions=&quot;ERCC&quot; # omit this if no spike-ins are present. ) summary(sce.qc.416b$sum) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 27084 856350 1111252 1165865 1328301 4398883 summary(sce.qc.416b$detected) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5609 7502 8341 8397 9208 11380 summary(sce.qc.416b$subset.proportion.MT) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.04593 0.07294 0.08139 0.08146 0.09035 0.15617 summary(sce.qc.416b$subset.proportion.ERCC) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.02263 0.04389 0.06161 0.06622 0.08451 0.20765 A key assumption here is that these QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, such that the removal of the corresponding cells will not misrepresent the biology in downstream analyses. In heterogeneous datasets, this assumption is unlikely to be true: Some cell types have systematically less RNA content or more mitochondria (see Figure 3A of Germain, Sonrel, and Robinson (2020)). For example, CD8+ T cells increase RNA synthesis upon stimulation while hepatocyptes contain more mitochondria to power their metabolic activities. Even if all cell types have the same total RNA content and mitochondria counts, certain cell types may be less amenable to the scRNA-seq protocol. The obvious example is that of neurons, which are easily damaged during dissociation and often have poorer values for the QC metrics. Major violations of this assumption could result in the loss of entire cell types prior to downstream analysis. We can check for such violations using diagnostic plots described in Section 1.4, but for now, let’s just hope for the best and proceed through this chapter. 1.3 Identifying low-quality cells 1.3.1 With adaptive thresholds Once we have some QC metrics, we need to define thresholds with which we can separate low- and high-quality cells. With the adaptive threshold strategy, we assume that most of our dataset consists of high-quality cells. This is usually reasonable and can be experimentally verified in some situations, e.g., by visually checking that each cell is intact on a microwell plate. We then identify cells that are outliers for any of the QC metrics, based on the median absolute deviation (MAD) from the median value of each metric across all cells. By default, we consider a value to be an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution. qc.thresh.416b &lt;- metadata(sce.qc.416b)$qc$thresholds qc.thresh.416b ## $sum ## [1] 434082.9 ## ## $detected ## [1] 5231.468 ## ## $subset.proportion ## MT ERCC ## 0.1191734 0.1454460 This function computes a MAD-based outlier threshold for each metric. For the library sizes and number of expressed genes, a lower threshold is defined after log-transforming the metrics. This improves the normality of right-skewed distributions to justify the 99% rationale mentioned above. It also avoids defining a negative threshold that would be meaningless for a non-negative metric. Note that only the MAD calculations are on the log-scale - the thresholds reported in qc.thresh.416b are always on the original scale. For the mitochondrial/spike-in proportions, an upper threshold is defined without any transformation of the metrics. In particular, we do not log-transform as this would inflate the MAD by converting near-zero proportions to large negative log-values. We apply these thresholds to filter for high-quality cells where the relevant metrics are above/below their respective lower/upper thresholds: # The &#39;keep&#39; column in the colData is added by quickRnaQc.se() and indicates # whether a cell should be kept after QC filtering. summary(sce.qc.416b$keep) ## Mode FALSE TRUE ## logical 6 186 # Subsetting our SingleCellExperiment to only retain high-quality # cells in the downstream analysis steps. sce.filt.416b &lt;- sce.qc.416b[,sce.qc.416b$keep] ncol(sce.filt.416b) ## [1] 186 These outlier-based thresholds adapt to both the location and spread of the distribution of values for a given metric. This enables the QC procedure to adjust to changes in sequencing depth, cDNA capture efficiency, mitochondrial content, etc. without any user intervention or prior experience. The use of the MAD also improves robustness to dependencies between the QC metrics and the underlying biology, where some cell types have extreme QC metrics due to their biology. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the risk of inadvertently removing those cell types (at the cost of reducing power to remove actual low-quality cells). Keep in mind that the underlying assumption of a high-quality majority may not always be appropriate. If most cells are of (unacceptably) low quality, the adaptive thresholds will fail as - by definition - they cannot remove the majority of cells. Of course, what is “acceptable” or not is rather context-dependent, e.g., small library sizes for embryonic stem cells might be problematic but the same distribution would be perfectly satisfactory for a dataset of naive T cells. In practice, this assumption is convenient as it ensures that we always retain most cells for our downstream analyses. 1.3.2 With fixed thresholds A simpler approach to identify low-quality cells involves applying fixed thresholds to the QC metrics. For example, we might consider cells to be low quality if they have library sizes below 100000 reads; express fewer than 5000 genes; have spike-in proportions above 10%; or have mitochondrial proportions above 10%. We can supply these numbers directly to quickRnaQc.se() to force the function to use our thresholds: sce.fixed.416b &lt;- quickRnaQc.se( sce.416b, subsets=list(MT=is.mito.416b), altexp.proportions=&quot;ERCC&quot;, thresholds=list( sum = 1e5, detected = 5e3, subsets = c(MT = 0.1, ERCC = 0.1) ) ) summary(sce.fixed.416b$keep) ## Mode FALSE TRUE ## logical 39 153 This strategy is intuitive but requires some experience to determine appropriate thresholds for each experimental protocol and biological system. Thresholds for read count-based data are not applicable for UMI-based data, and vice versa. Differences in mitochondrial activity or total RNA content require constant adjustment of the mitochondrial and spike-in thresholds, respectively, for different biological systems. Even with the same protocol and system, the appropriate threshold can vary between runs due to fluctuations in cDNA capture efficiency and sequencing depth per cell. 1.4 Creating diagnostic plots It’s prudent to inspect the distributions of QC metrics (Figure 1.1) to identify possible problems. In the most ideal case, we would see normal distributions that would justify the 3 MAD threshold used in outlier detection. A large proportion of cells in another mode suggests that the QC metrics might be correlated with some biological state, potentially leading to the loss of distinct cell types during filtering; or that there were inconsistencies with library preparation for a subset of cells, which is not uncommon in plate-based protocols. library(scater) gridExtra::grid.arrange( plotColData(sce.qc.416b, y=&quot;sum&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.416b$sum, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(sce.qc.416b, y=&quot;detected&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.416b$detected, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(sce.qc.416b, y=&quot;subset.proportion.MT&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.416b$subset.proportion[&quot;MT&quot;], linetype=&quot;dashed&quot;, color=&quot;red&quot;) + ggtitle(&quot;Mito prop&quot;), plotColData(sce.qc.416b, y=&quot;subset.proportion.ERCC&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.416b$subset.proportion[&quot;ERCC&quot;], linetype=&quot;dashed&quot;, color=&quot;red&quot;) + ggtitle(&quot;ERCC prop&quot;), ncol=2 ) Figure 1.1: Distribution of QC metrics in the 416B dataset. Each point represents a cell and is colored according to whether it was retained after QC filtering. Dashed lines represent thresholds for each metric. For comparison, let’s look at a different dataset with stronger biological heterogeneity (Grun et al. 2016). This dataset contains a mixture of pancreatic cell types from different donors, resulting in a more complex distribution for the metrics in Figure 1.2. We might contemplate whether the clump of discarded cells corresponds to a genuine subpopulation, though all things considered, they are probably just damaged cells and removing them is the correct choice. library(scRNAseq) sce.grun &lt;- GrunPancreasData() # This dataset doesn&#39;t include any of the mitochondrial genes, unfortunately. # But it does contain some nice ERCC spike-ins, so let&#39;s compute those proportions. library(scrapper) sce.qc.grun &lt;- quickRnaQc.se(sce.grun, subsets=list(), altexp.proportions=&quot;ERCC&quot;) qc.thresh.grun &lt;- metadata(sce.qc.grun)$qc$thresholds qc.thresh.grun ## $sum ## [1] 99.01313 ## ## $detected ## [1] 133.9956 ## ## $subset.proportion ## ERCC ## 0.2666693 summary(sce.qc.grun$keep) ## Mode FALSE TRUE ## logical 437 1291 library(scater) gridExtra::grid.arrange( plotColData(sce.qc.grun, y=&quot;sum&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.grun$sum, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(sce.qc.grun, y=&quot;detected&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.grun$detected, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(sce.qc.grun, y=&quot;subset.proportion.ERCC&quot;, colour_by=&quot;keep&quot;) + geom_hline(yintercept=qc.thresh.grun$subset.proportion[&quot;ERCC&quot;], linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() + ggtitle(&quot;ERCC proportion&quot;), ncol=3 ) Figure 1.2: Distribution of QC metrics in the Grun dataset. Each point represents a cell and is colored according to whether it was retained after QC filtering. Dashed lines represent thresholds for each metric. Another useful diagnostic involves comparing the proportion of mitochondrial counts against some of the other QC metrics. Libraries with both large total counts and large mitochondrial counts may represent high-quality cells that happen to be highly metabolically active (e.g., hepatocytes, muscle cells). A similar interpretation can be applied to libraries with high mitochondrial percentages and low spike-in percentages, if these are available. Low-quality cells with small mitochondrial percentages, large spike-in percentages and small library sizes are likely to be stripped nuclei, i.e., they have been so extensively damaged that they have lost all cytoplasmic content. For single-nuclei studies, the stripped nuclei become the libraries of interest while the undamaged cells are of low quality. We demonstrate on data from a larger experiment involving the mouse brain (Zeisel et al. 2015). Figure 1.3 shows that the mitochondrial proportion is negatively correlated to the total count and positively correlated with the spike-in proportion. This is consistent with a common underlying effect of cell damage and indicates that we are not removing metabolically active, undamaged cells. library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() is.mito.zeisel &lt;- rowData(sce.zeisel)$featureType==&quot;mito&quot; summary(is.mito.zeisel) ## Mode FALSE TRUE ## logical 19972 34 # This dataset also contains spike-ins, so we might as well use them. library(scrapper) sce.qc.zeisel &lt;- quickRnaQc.se(sce.zeisel, subsets=list(MT=is.mito.zeisel), altexp.proportions=&quot;ERCC&quot;) qc.thresh.zeisel &lt;- metadata(sce.qc.zeisel)$qc$thresholds qc.thresh.zeisel ## $sum ## [1] 1928.56 ## ## $detected ## [1] 845.7155 ## ## $subset.proportion ## MT ERCC ## 0.2022321 0.7627049 summary(sce.qc.zeisel$keep) ## Mode FALSE TRUE ## logical 139 2866 library(scater) gridExtra::grid.arrange( plotColData(sce.qc.zeisel, x=&quot;sum&quot;, y=&quot;subset.proportion.MT&quot;, colour_by=&quot;keep&quot;), plotColData(sce.qc.zeisel, x=&quot;subset.proportion.ERCC&quot;, y=&quot;subset.proportion.MT&quot;, colour_by=&quot;keep&quot;), ncol=2 ) Figure 1.3: Percentage of UMIs assigned to mitochondrial transcripts in the Zeisel brain dataset, plotted against the total number of UMIs (left) or the ERCC proportions (right). Each point represents a cell and is colored according to whether it was considered high-quality. We can also check for inappropriate removal of cell types by comparing the expression profiles of the discarded and retained cells. If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. To illustrate, we’ll use the classic PBMC dataset from 10X Genomics (Zheng et al. 2017) where we perform some additional QC after cell calling. Examination of the upregulated genes in Figure 1.4 reveals PF4, PPBP and SDPR, which (spoiler alert!) indicates that there is a platelet subpopulation that was removed by our QC filter. # Loading in raw data from the 10X output files. library(DropletTestFiles) raw.path.10x &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) dir.path.10x &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path.10x, exdir=dir.path.10x) library(DropletUtils) fname.10x &lt;- file.path(dir.path.10x, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.10x &lt;- read10xCounts(fname.10x, col.names=TRUE) # Cell calling to distinguish real cells from empty droplets. Normally this # would be handled by the CellRanger pipeline, but older versions of CellRanger # would already remove interesting cells with low total counts before we could # even make any QC decisions. So for the purposes of this example, we&#39;ll handle # cell calling ourselves using the unfiltered count data. set.seed(100) ed.10x &lt;- emptyDrops(counts(sce.10x)) sce.10x &lt;- sce.10x[,which(ed.10x$FDR &lt;= 0.001)] sce.10x ## class: SingleCellExperiment ## dim: 33694 4402 ## metadata(1): Samples ## assays(1): counts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(2): ID Symbol ## colnames(4402): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): # Applying our default QC with outlier-based thresholds. is.mito.10x &lt;- grepl(&quot;^MT-&quot;, rowData(sce.10x)$Symbol) sce.qc.10x &lt;- quickRnaQc.se(sce.10x, subsets=list(MT=is.mito.10x)) # Summing counts for the pools of retained or discarded cells. aggregate.10x &lt;- aggregateAcrossCells.se( sce.10x, list(status=ifelse(sce.qc.10x$keep, &quot;retain&quot;, &quot;discard&quot;)) ) sums.10x &lt;- assay(aggregate.10x, &quot;sums&quot;) colnames(sums.10x) &lt;- aggregate.10x$factor.status head(sums.10x) ## discard retain ## ENSG00000243485 0 0 ## ENSG00000237613 0 0 ## ENSG00000186092 0 0 ## ENSG00000238009 0 9 ## ENSG00000239945 0 2 ## ENSG00000239906 0 0 # Computing log-fold changes between retained and discarded pools. library(edgeR) logged.10x &lt;- cpm(sums.10x, log=TRUE, prior.count=2) logFC.10x &lt;- logged.10x[,&quot;discard&quot;] - logged.10x[,&quot;retain&quot;] abundance.10x &lt;- rowMeans(logged.10x) plot(abundance.10x, logFC.10x, xlab=&quot;Average abundance&quot;, ylab=&quot;Log-FC (discarded/retained)&quot;, pch=16, cex=0.5) platelet &lt;- match(c(&quot;PF4&quot;, &quot;PPBP&quot;, &quot;SDPR&quot;), rowData(sce.10x)$Symbol) points(abundance.10x[platelet], logFC.10x[platelet], col=&quot;red&quot;, pch=16) Figure 1.4: Log-fold changes between discarded and retained cells in the PBMC dataset against the average abundance. Each point represents a gene, with platelet-related genes highlighted in red. If we suspect that cell types have been incorrectly discarded by our QC procedure, the most direct solution is to relax the QC filters. This is easily achieved for the outlier-based thresholds by increasing num.mads= in the quickRnaQc.se() call. Alternatively, we can disable filtering for particular metrics by setting the threshold to Inf or -Inf for upper and lower thresholds, respectively. We might even think about skipping the filtering altogether1, as discussed in Section 1.6. # Effectively just filtering on the mitochondrial proportions. relaxed.thresh.10x &lt;- metadata(sce.qc.10x)$qc$thresholds relaxed.thresh.10x$sum &lt;- -Inf relaxed.thresh.10x$detected &lt;- -Inf sce.relaxed.10x &lt;- quickRnaQc.se( sce.10x, subsets=list(MT=is.mito.10x), thresholds=relaxed.thresh.10x ) summary(sce.relaxed.10x$keep) ## Mode FALSE TRUE ## logical 322 4080 1.5 Blocking on experimental factors More complex studies may involve multiple blocks of cells generated with different experimental parameters, e.g., sequencing depth. In such cases, it makes little sense to compute medians and MADs from a mixture distribution containing samples from multiple blocks. For example, if the sequencing coverage is lower in one block compared to the others, the median will be dragged down and the MAD will be inflated. This will reduce the suitability of the adaptive threshold for each block. A possibly better approach is to compute an adaptive threshold separately for each block, under the assumption that most cells in each block are of high quality. We illustrate using our 416B dataset again, which actually contains two experimental factors that we previously ignored: the microwell plate in which each cell was processed, and whether the expression of a CBFB-MYH11 oncogene was induced by doxycycline treatment. For the purposes of QC, we will consider each unique combination of these factors to be an experimental block, as both have the potential to alter the QC metrics, e.g., different sequencing coverage in each run or different RNA content after treatment. Setting block= in quickRnaQc.se() yields a separate threshold for each block (Figure 1.5), which may be more appropriate than a common threshold across all blocks. # Making a combined factor for easier reading. plate.416b &lt;- sce.416b$block # i.e., the plate of origin. pheno.416b &lt;- ifelse(sce.416b$phenotype == &quot;wild type phenotype&quot;, &quot;WT&quot;, &quot;induced&quot;) block.416b &lt;- paste0(pheno.416b, &quot;-&quot;, plate.416b) sce.block.416b &lt;- quickRnaQc.se( sce.416b, subsets=list(MT=is.mito.416b), altexp.proportions=&quot;ERCC&quot;, block=block.416b ) qc.thresh.block.416b &lt;- metadata(sce.block.416b)$qc$thresholds qc.thresh.block.416b ## $sum ## WT-20160113 WT-20160325 induced-20160113 induced-20160325 ## 599794.9 370316.5 461073.1 399133.7 ## ## $detected ## WT-20160113 WT-20160325 induced-20160113 induced-20160325 ## 7215.887 7586.402 5399.240 6519.740 ## ## $subset.proportion ## $subset.proportion$MT ## WT-20160113 WT-20160325 induced-20160113 induced-20160325 ## 0.1175331 0.1169890 0.1175679 0.1289473 ## ## $subset.proportion$ERCC ## WT-20160113 WT-20160325 induced-20160113 induced-20160325 ## 0.08995810 0.08105749 0.15504768 0.12718583 ## ## ## $block.ids ## [1] &quot;WT-20160113&quot; &quot;WT-20160325&quot; &quot;induced-20160113&quot; &quot;induced-20160325&quot; summary(sce.block.416b$keep) ## Mode FALSE TRUE ## logical 9 183 library(scater) sce.block.416b$combined.block &lt;- block.416b gridExtra::grid.arrange( plotColData(sce.block.416b, y=&quot;sum&quot;, x=&quot;combined.block&quot;, colour_by=&quot;keep&quot;) + categoricalHlinesNamed(qc.thresh.block.416b$sum, levels=NULL, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_x_discrete(guide = guide_axis(angle = 45)) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(sce.block.416b, y=&quot;detected&quot;, x=&quot;combined.block&quot;, colour_by=&quot;keep&quot;) + categoricalHlinesNamed(qc.thresh.block.416b$detected, levels=NULL, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_x_discrete(guide = guide_axis(angle = 45)) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(sce.block.416b, y=&quot;subset.proportion.MT&quot;, x=&quot;combined.block&quot;, colour_by=&quot;keep&quot;) + categoricalHlinesNamed(qc.thresh.block.416b$subset.proportion$MT, levels=NULL, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_x_discrete(guide = guide_axis(angle = 45)) + ggtitle(&quot;Mito proportion&quot;), plotColData(sce.block.416b, y=&quot;subset.proportion.ERCC&quot;, x=&quot;combined.block&quot;, colour_by=&quot;keep&quot;) + categoricalHlinesNamed(qc.thresh.block.416b$subset.proportion$ERCC, levels=NULL, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_x_discrete(guide = guide_axis(angle = 45)) + ggtitle(&quot;ERCC proportion&quot;), ncol=2 ) Figure 1.5: Distribution of QC metrics in the 416B dataset, separated according to each cell’s combination of experimental factors. Each point represents a cell and is colored according to whether it was retained after QC filtering. Dashed lines represent thresholds for each metric in each combination of factors. That said, outlier detection will not be effective if a block does not contain a majority of high-quality cells. For example, some donors in the Grun et al. (2016) human pancreas dataset have higher ERCC proportions (Figure 1.6), probably corresponding to damaged cells. This inflates the median and MAD and reduces the effectiveness of the QC filtering in those blocks. sce.block.grun &lt;- quickRnaQc.se( sce.grun, subsets=list(), altexp.proportions=&quot;ERCC&quot;, block=sce.grun$donor ) qc.thresh.block.grun &lt;- metadata(sce.block.grun)$qc$thresholds qc.thresh.block.grun ## $sum ## D10 D17 D2 D3 D7 ## 1.222967 1002.354420 943.017540 3.977602 883.043512 ## ## $detected ## D10 D17 D2 D3 D7 ## 2.002534 815.811991 866.701785 5.936999 656.102224 ## ## $subset.proportion ## $subset.proportion$ERCC ## D10 D17 D2 D3 D7 ## 0.73610696 0.07599947 0.06010975 1.13105828 0.15216956 ## ## ## $block.ids ## [1] &quot;D10&quot; &quot;D17&quot; &quot;D2&quot; &quot;D3&quot; &quot;D7&quot; summary(sce.block.grun$keep) ## Mode FALSE TRUE ## logical 132 1596 plotColData(sce.block.grun, x=&quot;donor&quot;, y=&quot;subset.proportion.ERCC&quot;, colour_by=&quot;keep&quot;) + categoricalHlinesNamed(qc.thresh.block.grun$subset.proportion$ERCC, levels=NULL, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + ggtitle(&quot;ERCC prop&quot;) Figure 1.6: Distribution of the proportion of ERCC transcripts in each donor of the Grun pancreas dataset. Each point represents a cell and is coloured according to whether it was considered high-quality across all metrics. Dashed lines represent donor-specific thresholds. For such problematic blocks, some manual intervention may be necessary to set an appropriate threshold. A simple solution is to just derive a threshold from the other blocks, e.g., by taking the average (Figure 1.7). This restores some semblance of QC to remove the bulk of damaged cells in the affected donors. Hopefully, those cells really are damaged and we aren’t accidentally removing a real subpopulation of small cells that are unique to those donors2. okay.donors &lt;- c(&quot;D17&quot;, &quot;D2&quot;, &quot;D7&quot;) bad.donors &lt;- setdiff(unique(sce.grun$donor), okay.donors) qc.thresh.fixed.grun &lt;- qc.thresh.block.grun qc.thresh.fixed.grun$sum[bad.donors] &lt;- mean(qc.thresh.block.grun$sum[okay.donors]) qc.thresh.fixed.grun$detected[bad.donors] &lt;- mean(qc.thresh.block.grun$detected[okay.donors]) qc.thresh.fixed.grun$subset.proportion$ERCC[bad.donors] &lt;- mean(qc.thresh.block.grun$subset.proportion$ERCC[okay.donors]) sce.fixed.grun &lt;- quickRnaQc.se( sce.grun, subsets=list(), altexp.proportions=&quot;ERCC&quot;, block=sce.grun$donor, thresholds=qc.thresh.fixed.grun ) plotColData(sce.fixed.grun, x=&quot;donor&quot;, y=&quot;subset.proportion.ERCC&quot;, colour_by=&quot;keep&quot;) + categoricalHlinesNamed(qc.thresh.fixed.grun$subset.proportion$ERCC, levels=NULL, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + ggtitle(&quot;ERCC prop&quot;) Figure 1.7: Distribution of the proportion of ERCC transcripts in each donor of the Grun pancreas dataset. Each point represents a cell and is coloured according to whether it was considered high-quality across all metrics. Dashed lines represent donor-specific thresholds, some of which are manually set for donors with a majority of low-quality cells. 1.6 Skipping quality control If we don’t want to risk discarding real cell types, we could simply mark the low-quality cells as such and retain them in the downstream analysis. The aim here is to allow clusters of low-quality cells to form, and then to identify and ignore such clusters during interpretation of the results. This approach avoids discarding cell types that have poor values for the QC metrics, deferring the decision on whether a cluster of such cells represents a genuine biological state. So, in our 416B example, we would just continue with the unfiltered sce.416b for downstream analysis instead of using sce.qc.416b. The downside is that it shifts the burden of QC to the manual interpretation of the clusters, which is already a major bottleneck in scRNA-seq data analysis (Chapters 6 and 7). If we don’t trust the QC metrics, we would have to distinguish between genuine cell types and low-quality cells based only on the cluster-specific marker genes… but if we had good markers for low-quality cells, we would have already used them as QC metrics! In practice, this usually becomes a time-consuming process of elimination whereby the clusters of low-quality cells are identified because they don’t fit any other characterization. Additionally, retention of low-quality cells may compromise the accuracy of other steps in the analysis, as discussed in Section 1.1. Personally, I’d suggest removing low-quality cells by default to avoid complications. This allows most of the population structure to be characterized with fewer concerns about its validity. Once the initial analysis is done, and if there are any concerns about discarded cell types, a more thorough re-analysis can be performed where the low-quality cells are only marked. This recovers cell types with low RNA content, high mitochondrial proportions, etc. that only need to be interpreted insofar as they “fill the gaps” in the initial analysis. Session Info sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] edgeR_4.9.7 limma_3.67.1 ## [3] DropletUtils_1.31.1 DropletTestFiles_1.21.0 ## [5] scater_1.39.4 ggplot2_4.0.2 ## [7] scuttle_1.21.6 scrapper_1.5.17 ## [9] ensembldb_2.35.0 AnnotationFilter_1.35.0 ## [11] GenomicFeatures_1.63.2 AnnotationDbi_1.73.1 ## [13] scRNAseq_2.25.0 SingleCellExperiment_1.33.2 ## [15] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [17] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [19] IRanges_2.45.0 S4Vectors_0.49.1 ## [21] BiocGenerics_0.57.0 generics_0.1.4 ## [23] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 ## [3] magrittr_2.0.5 ggbeeswarm_0.7.3 ## [5] gypsum_1.7.0 farver_2.1.2 ## [7] rmarkdown_2.31 BiocIO_1.21.0 ## [9] vctrs_0.7.3 DelayedMatrixStats_1.33.0 ## [11] memoise_2.0.1 Rsamtools_2.27.2 ## [13] RCurl_1.98-1.18 htmltools_0.5.9 ## [15] S4Arrays_1.11.1 AnnotationHub_4.1.0 ## [17] curl_7.0.0 BiocNeighbors_2.5.4 ## [19] Rhdf5lib_1.33.6 SparseArray_1.11.13 ## [21] rhdf5_2.55.16 sass_0.4.10 ## [23] alabaster.base_1.11.4 bslib_0.10.0 ## [25] alabaster.sce_1.11.0 httr2_1.2.2 ## [27] cachem_1.1.0 GenomicAlignments_1.47.0 ## [29] lifecycle_1.0.5 pkgconfig_2.0.3 ## [31] rsvd_1.0.5 Matrix_1.7-5 ## [33] R6_2.6.1 fastmap_1.2.0 ## [35] digest_0.6.39 dqrng_0.4.1 ## [37] irlba_2.3.7 ExperimentHub_3.1.0 ## [39] RSQLite_2.4.6 beachmat_2.27.5 ## [41] labeling_0.4.3 filelock_1.0.3 ## [43] httr_1.4.8 abind_1.4-8 ## [45] compiler_4.6.0 bit64_4.6.0-1 ## [47] withr_3.0.2 S7_0.2.1 ## [49] BiocParallel_1.45.0 viridis_0.6.5 ## [51] DBI_1.3.0 R.utils_2.13.0 ## [53] HDF5Array_1.39.1 alabaster.ranges_1.11.0 ## [55] alabaster.schemas_1.11.0 rappdirs_0.3.4 ## [57] DelayedArray_0.37.1 rjson_0.2.23 ## [59] tools_4.6.0 vipor_0.4.7 ## [61] otel_0.2.0 beeswarm_0.4.0 ## [63] R.oo_1.27.1 glue_1.8.0 ## [65] h5mread_1.3.3 restfulr_0.0.16 ## [67] rhdf5filters_1.23.3 grid_4.6.0 ## [69] gtable_0.3.6 R.methodsS3_1.8.2 ## [71] BiocSingular_1.27.1 ScaledMatrix_1.19.0 ## [73] XVector_0.51.0 ggrepel_0.9.8 ## [75] BiocVersion_3.23.1 pillar_1.11.1 ## [77] dplyr_1.2.1 BiocFileCache_3.1.0 ## [79] lattice_0.22-9 rtracklayer_1.71.3 ## [81] bit_4.6.0 tidyselect_1.2.1 ## [83] locfit_1.5-9.12 Biostrings_2.79.5 ## [85] knitr_1.51 gridExtra_2.3 ## [87] bookdown_0.46 ProtGenerics_1.43.0 ## [89] xfun_0.57 statmod_1.5.1 ## [91] UCSC.utils_1.7.1 lazyeval_0.2.3 ## [93] yaml_2.3.12 evaluate_1.0.5 ## [95] codetools_0.2-20 cigarillo_1.1.0 ## [97] tibble_3.3.1 alabaster.matrix_1.11.0 ## [99] BiocManager_1.30.27 cli_3.6.6 ## [101] jquerylib_0.1.4 dichromat_2.0-0.1 ## [103] Rcpp_1.1.1 GenomeInfoDb_1.47.2 ## [105] dbplyr_2.5.2 png_0.1-9 ## [107] XML_3.99-0.23 parallel_4.6.0 ## [109] blob_1.3.0 sparseMatrixStats_1.23.0 ## [111] bitops_1.0-9 viridisLite_0.4.3 ## [113] alabaster.se_1.11.0 scales_1.4.0 ## [115] purrr_1.2.2 crayon_1.5.3 ## [117] rlang_1.2.0 cowplot_1.2.0 ## [119] KEGGREST_1.51.1 References "],["normalization.html", "Chapter 2 Normalization 2.1 Motivation 2.2 Library size factors 2.3 Normalized expression values 2.4 Blocking on experimental batches 2.5 Normalization by spike-ins Session information", " Chapter 2 Normalization 2.1 Motivation Large, systematic differences in sequencing coverage between cells are often present in single-cell RNA sequencing datasets (Stegle, Teichmann, and Marioni 2015). These are typically caused by variation in the efficiency of various experimental steps (e.g., cDNA capture, PCR amplification) across libraries. Normalization removes these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population is driven by biology and not technical biases. Here, we’ll focus on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). Our assumption is that any cell-specific bias (caused by a change in efficiency of, e.g., library preparation or sequencing) affects all genes equally by scaling the expected count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. 2.2 Library size factors The simplest definition of a size factor is based on the library size for each cell, i.e., the total sum of counts across all genes. Larger library sizes are attributed to technical differences in library preparation or sequencing that should be equalized across cells. Alternatively, a cell may have a larger library size because it contains more total RNA; this is also treated as an uninteresting biological effect that should be normalized away. To demonstrate, let’s fetch our old friend, the Zeisel et al. (2015) dataset of the mouse brain: library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() is.mito.zeisel &lt;- rowData(sce.zeisel)$featureType==&quot;mito&quot; # Performing some QC to set up the dataset prior to normalization. library(scrapper) sce.qc.zeisel &lt;- quickRnaQc.se( sce.zeisel, subsets=list(MT=is.mito.zeisel), altexp.proportions=&quot;ERCC&quot; ) sce.qc.zeisel &lt;- sce.qc.zeisel[,sce.qc.zeisel$keep] sce.qc.zeisel ## class: SingleCellExperiment ## dim: 20006 2866 ## metadata(1): qc ## assays(1): counts ## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l ## rowData names(1): featureType ## colnames(2866): 1772071015_C02 1772071017_G12 ... 1772063068_D01 ## 1772066098_A12 ## colData names(14): tissue group # ... subset.proportion.ERCC keep ## reducedDimNames(0): ## mainExpName: gene ## altExpNames(2): repeat ERCC We use the normalizeRnaCounts.se() function to derive library size factors from the column sums of the count matrix. This involves centering the library sizes so that the mean size factor across all cells is equal to 1. Centering ensures that the normalized expression values are on the same scale as the original counts, which is useful for interpretation. We see that the library size factors differ by up to 10-fold across cells (Figure 2.1), which is typical of the variability in coverage in scRNA-seq data. # As it happens, we get the library sizes for free from the QC metrics, but if # we didn&#39;t, we could set size.factors=NULL and the function will compute it for us. sce.norm.zeisel &lt;- normalizeRnaCounts.se(sce.qc.zeisel, size.factors=sce.qc.zeisel$sum) summary(sce.norm.zeisel$sizeFactor) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1676 0.5584 0.8710 1.0000 1.2819 4.1354 hist(log10(sce.norm.zeisel$sizeFactor), xlab=&quot;Log10[Size factor]&quot;, col=&#39;grey80&#39;) Figure 2.1: Distribution of size factors derived from the library size in the Zeisel brain dataset. Strictly speaking, the use of library size factors assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells. That is, any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes. This avoids composition biases where upregulation of some genes reduces the sequencing resources available for other genes. If balanced DE is not present, more sophisticated methods for computing size factors may be required (Robinson and Oshlack 2010; Anders and Huber 2010; Lun, Bach, and Marioni 2016). In practice, library size factors perform well for exploratory scRNA-seq data analyses despite their theoretical inaccuracy: Gene expression profiles often contain constitutively-expressed genes with large counts, e.g., histones, ribosomal proteins3. These stabilize the library sizes in the presence of imbalanced DE. Composition biases do not usually affect the partitioning of cells into clusters (Chapter 6). If the DE was strong enough to introduce a significant composition bias, it is also strong enough to separate the cell subpopulations during clustering. Experience from bulk RNA-seq suggests that composition biases are actually quite small, typically introducing a spurious 10-20% difference in expression. This is negligible compared to the order-of-magnitude fold-changes for the marker genes (Chapter 7). 2.3 Normalized expression values 2.3.1 Scaling and log-transforming The normalizeRnaCounts.se() function uses the size factors to compute normalized expression values from the count matrix. This is done by dividing all counts for each cell with its corresponding size factor and then applying a log-transformation. These (log-transformed) normalized expression values will be the basis of our downstream analyses in the following chapters. logcounts(sce.norm.zeisel) ## &lt;20006 x 2866&gt; sparse LogNormalizedMatrix object of type &quot;double&quot;: ## 1772071015_C02 1772071017_G12 ... 1772063068_D01 1772066098_A12 ## Tspan12 0.0000000 0.0000000 . 0 0 ## Tshz1 1.6139535 0.7411389 . 0 0 ## Fnbp1l 1.6139535 0.7411389 . 0 0 ## Adamts15 0.0000000 0.0000000 . 0 0 ## Cldn12 0.7544300 0.7411389 . 0 0 ## ... . . . . . ## mt-Co2 5.512629 6.047342 . 7.930569 7.176952 ## mt-Co1 5.714860 6.239091 . 8.339569 6.645379 ## mt-Rnr2 6.775320 7.780752 . 9.215748 7.486956 ## mt-Rnr1 4.478707 5.909090 . 7.274070 5.917989 ## mt-Nd4l 3.311873 3.378640 . 5.816615 3.987852 The log-transformation is useful as the difference between log-values (i.e., from subtraction) is an estimate of the log-fold change in normalized expression. When comparing expression values, log-fold changes are usually preferable to absolute differences in expression. To illustrate, which one is more interesting - a gene that is expressed at an average count of 50 in cell type \\(A\\) and 10 in cell type \\(B\\), or a gene that is expressed at an average count of 1100 in \\(A\\) and 1000 in \\(B\\)? Hopefully, we can all agree that the former is more interesting for explaining differences between \\(A\\) and \\(B\\). The interpretation of the difference in log-values is important in downstream procedures based on distances between cells, which includes many clustering and dimensionality reduction algorithms. By operating on log-transformed data, we ensure that the distances reflect log-fold changes in expression, e.g., the squared Euclidean distance between two cells is the sum of squared log-fold changes across all genes. During the log-transformation, normalizeRnaCounts.se() will add a pseudo-count to avoid undefined values at zero counts. This addition has the side effect of shrinking log-fold changes between cells towards zero for low-abundance genes. Such shrinkage is desirable as the log-differences are not reliable when the counts are low due to random sampling noise. So, if we were to use a pseudo-count of 1, the log-fold change between normalized expression values of 1 and 2 would be \\(log_2((2 + 1)/(1 + 1)) \\approx 0.58\\), while the log-fold change between normalized expression values of 10 and 20 would be \\(log_2((20 + 1)/(10 + 1)) \\approx 0.93\\), i.e., larger counts will contribute more to the distances between cells. A pseudo-count of 1 is common for the pragmatic reason that it preserves sparsity, i.e., zeroes in the input remain zeroes after transformation. However, larger values can also be used for greater shrinkage if low-abundance genes are problematic (Lun 2018). 2.3.2 Why center the size factors? As previously mentioned, centering ensures that the normalized expression values are on roughly the same scale as the original counts for easier interpretation. For example, Figure 2.2 shows that interneurons have a median Snap25 log-expression from 5 - 6; this roughly translates to an original count of 30 - 60 UMIs in each cell, which gives us some confidence that it is actually expressed. This relationship to the original data would be less obvious if the centering were not performed. library(scater) plotExpression(sce.norm.zeisel, x=&quot;level1class&quot;, features=&quot;Snap25&quot;, colour=&quot;level1class&quot;) + scale_x_discrete(guide = guide_axis(angle = 45)) Figure 2.2: Distribution of log-expression values for Snap25 in each cell type of the Zeisel brain dataset. Centering also ensures that the effect of the pseudo-count decreases with greater sequencing coverage across the dataset. Because we preserve the scale of the input data, the normalized expression values will increase with deeper coverage while the pseudo-count remains the same. This reduces the shrinkage effect and improves the accuracy of the log-fold changes between cells. Which is important, because otherwise, why would we invest in deeper sequencing if our analysis won’t take advantage of it? For comparison, consider the situation where we applied a constant pseudo-count to some counts-per-million (CPM)-like measure. The accuracy of the subsequent log-fold changes would never improve regardless of how much additional sequencing was performed; scaling to a constant library size of a million means that the pseudo-count will have the same shrinkage effect for all datasets. The same criticism applies to popular metrics like the “counts-per-10K” used in, e.g., seurat. Personally, we rarely use CPM-like measures in scRNA-seq analyses; they are occasionally convenient for rough comparisons between datasets that were processed separately, but in such cases, normalization is the least of our problems (Chapter 8). 2.3.3 Comments on other transformations We might see a variety of other transformations in the wild: The square root, which is the variance stabilizing transformation for Poisson-distributed counts. This is motivated by the observation that sequencing noise is typically Poisson-distributed (Marioni et al. 2008). In practice, this transformation gives too much weight to small differences at high-abundance genes. The inverse hyperbolic sine (arcsinh), which is very similar to the log-transformation on non-negative values. This is commonly used in flow cytometry4 as it handles negative values after compensation. The main practical difference for scRNA-seq is a larger initial jump from zero to non-zero values. Variance-stabilizing transformations such as DESeq2::vst() or sctransform, which aim to remove the mean-variance dependency in (sc)RNA-seq count data. This ensures that genes of varying abundance contribute equally to downstream analyses. However, stabilization is challenging in heterogeneous datasets where biological variation interferes with the distributional assumptions of these methods. In practice, the log-transformation is a good default choice due to its simplicity and interpretability, and is what we will be using for all downstream analyses. 2.4 Blocking on experimental batches Sometimes, our dataset consists of multiple batches of cells that were generated at different sequencing depths. Library size normalization is still applicable but some care is required if the coverage differs dramatically between batches. Specifically, we want to scale down all batches to match the coverage of the lowest-coverage batch. This sacrifices some of the information in high-coverage batches to make them more comparable to the low-coverage batches. By comparison, scaling up the low-coverage batches just amplifies noise as they don’t have the available information to match the high-coverage batches. Differences in coverage between batches are most obvious when attempting to analyze datasets involving both read and UMI counts5. For two brain datasets generated with different technologies (Zeisel et al. 2015; Tasic et al. 2016), library sizes differ by several orders of magnitude (Figure 2.3). If we scaled up the Zeisel counts by ~100-fold, we would overstate the reliability of low-abundance genes by reducing the pseudo-count shrinkage (Section 2.3.1). This inflates the sampling noise and masks biological signal from higher-abundance genes. library(scRNAseq) sce.tasic &lt;- TasicBrainData() common.brain &lt;- intersect(rownames(sce.zeisel), rownames(sce.tasic)) sce.brain &lt;- combineCols(sce.zeisel[common.brain,], sce.tasic[common.brain,]) sce.brain$study &lt;- rep(c(&quot;zeisel&quot;, &quot;tasic&quot;), c(ncol(sce.zeisel), ncol(sce.tasic))) # Batch-specific filters, as discussed in the QC chapter. # Tasic doesn&#39;t have mitochondrial genes so we&#39;ll skip that. library(scrapper) sce.qc.brain &lt;- quickRnaQc.se(sce.brain, subsets=list(), block=sce.brain$study) sce.qc.brain &lt;- sce.qc.brain[,sce.qc.brain$keep] library(scater) plotColData(sce.qc.brain, x=&#39;study&#39;, y=&#39;sum&#39;) + scale_y_log10() Figure 2.3: Distribution of library sizes for the Tasic and Zeisel brain datasets. Each point represents a cell, separated by the study of origin. A more conservative approach is to scale down the Tasic counts to match the coverage of the Zeisel data. We specify the study of origin for each cell in block=, which adjusts the size factors to scale all cells down to the lowest-coverage batch (Figure 2.4). This avoids amplication of sampling noise at the cost of greater shrinkage when counts are scaled down relative to the pseudo-count (Section 2.3.2). In effect, we are forcibly reducing the contribution of low-abundance genes in the Tasic data to be consistent with their behavior in the Zeisel data. Hopefully, there’s still enough signal among the higher-abundance genes to find something interesting in later analysis steps. sce.norm.brain &lt;- normalizeRnaCounts.se( sce.qc.brain, size.factors=sce.qc.brain$sum, block=sce.qc.brain$study ) plotColData(sce.norm.brain, x=&#39;study&#39;, y=&#39;sizeFactor&#39;) + geom_hline(yintercept=1, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() Figure 2.4: Distribution of block-centered size factors for the Tasic and Zeisel brain datasets. Each point represents a cell, separated by the study of origin. Note that our batch-aware normalization is not the same as batch correction. The former will only remove scaling biases between cells whereas the latter considers more potential axes of uninteresting variation between batches. For example, differences in processing of one batch may result in more or less activity in certain pathways that cannot be removed with per-cell scaling factors. Normalization with block= is still helpful as it eliminates one of the differences between batches, but it usually needs to be followed by more sophisticated methods (Chapter 8) to avoid batch effects in later analysis steps. 2.5 Normalization by spike-ins Occasionally, we might come across an scRNA-seq dataset with spike-in counts, which gives us the opportunity to perform spike-in normalization (Lun et al. 2017)6. This assumes that (i) the same amount of spike-in RNA was added to each cell and (ii) the spike-in transcripts respond to experimental biases in the same relative manner as endogenous genes. Thus, any systematic difference in the coverage of the spike-in transcripts can attributed to cell-specific biases, e.g., in capture efficiency or sequencing depth. To remove these biases, we equalize spike-in coverage across cells by scaling with “spike-in size factors”. Let’s demonstrate on a scRNA-seq dataset of T cells after stimulation with T cell receptor ligands of varying affinity (Richard et al. 2018). Specifically, we compute the total count across all spike-in transcripts in each cell and center them to a mean of 1 across all cells (see Section 2.3.2). These size factors can then be plugged into normalizeRnaCounts.se() to obtain normalized expression values for endogenous genes. library(scRNAseq) sce.richard &lt;- RichardTCellData() # For brevity, we&#39;ll just re-use the authors&#39; QC calls. sce.qc.richard &lt;- sce.richard[,sce.richard$`single cell quality`==&quot;OK&quot;] sce.qc.richard ## class: SingleCellExperiment ## dim: 46603 528 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ... ## SLX-12612.i712_i522. SLX-12612.i714_i522. ## colData names(13): age individual ... stimulus time ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC # Pulling the spike-in counts out of the alternative experiment. sce.ercc.richard &lt;- altExp(sce.qc.richard, &quot;ERCC&quot;) sce.ercc.richard ## class: SingleCellExperiment ## dim: 92 528 ## metadata(0): ## assays(1): counts ## rownames(92): ERCC-00002 ERCC-00003 ... ERCC-00170 ERCC-00171 ## rowData names(3): subgroup concentration molecules ## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ... ## SLX-12612.i712_i522. SLX-12612.i714_i522. ## colData names(0): ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): library(scrapper) ercc.sums.richard &lt;- colSums(counts(sce.ercc.richard)) spike.factor.richard &lt;- centerSizeFactors(ercc.sums.richard) summary(spike.factor.richard) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1247 0.4282 0.6274 1.0000 1.0699 23.3161 hist(log10(spike.factor.richard), xlab=&quot;Log10[Size factor]&quot;, col=&#39;grey80&#39;) Figure 2.5: Distribution of size factors derived from the library size in the Richard T cell dataset. Practically, spike-in normalization should be used if differences in the total RNA content between cells are of interest and must be preserved in downstream analyses. For a given cell, an increase in its overall amount of endogenous RNA will not increase its spike-in size factor. This ensures that the effects of total RNA content on expression across the population will not be removed upon scaling. By comparison, library size normalization will consider any change in total RNA content as part of the bias and remove it. The differences between these two normalization strategies in illustrated in Figure 2.6. The spike-in size factors and deconvolution size factors are positively correlated within each treatment condition, indicating that they are capturing similar technical biases in sequencing depth and capture efficiency. However, increasing stimulation of the T cell receptor (in terms of increasing affinity or time) results in a decrease in the spike-in factors relative to the library size factors. This is consistent with an increase in total RNA content during stimulation, which increases the coverage of endogenous genes at the expense of the spike-in transcripts. rna.sums.richard &lt;- colSums(counts(sce.qc.richard)) lib.factor.richard &lt;- centerSizeFactors(rna.sums.richard) to.plot &lt;- data.frame( LibFactor=lib.factor.richard, SpikeFactor=spike.factor.richard, Stimulus=sce.qc.richard$stimulus, Time=sce.qc.richard$time ) library(ggplot2) ggplot(to.plot, aes(x=LibFactor, y=SpikeFactor, color=Time)) + geom_point() + facet_wrap(~Stimulus) + scale_x_log10() + scale_y_log10() + geom_abline(intercept=0, slope=1, color=&quot;red&quot;) Figure 2.6: Size factors from spike-in normalization, plotted against the library size factors for all cells in the T cell dataset. Each plot represents a different ligand treatment and each point is a cell coloured according by time from stimulation. These differences have real consequences for downstream interpretation. If the spike-in size factors were applied to the counts, the expression values in unstimulated cells would be scaled up while expression in stimulated cells would be scaled down. However, the opposite would occur if the deconvolution size factors were used. This can manifest as shifts in the magnitude and direction of DE between conditions when we switch between normalization strategies, as shown below for our most beloved gene Malat1 (Figure 2.7). # Setting center=FALSE because we already centered the size factors. sce.lib.richard &lt;- normalizeRnaCounts.se(sce.qc.richard, size.factors=lib.factor.richard, center=FALSE) sce.spike.richard &lt;- normalizeRnaCounts.se(sce.qc.richard, size.factors=spike.factor.richard, center=FALSE) library(scater) gridExtra::grid.arrange( plotExpression(sce.lib.richard, x=&quot;stimulus&quot;, colour_by=&quot;time&quot;, features=&quot;ENSMUSG00000092341&quot;) + scale_x_discrete(guide = guide_axis(angle = 70)) + ggtitle(&quot;After library size normalization&quot;), plotExpression(sce.spike.richard, x=&quot;stimulus&quot;, colour_by=&quot;time&quot;, features=&quot;ENSMUSG00000092341&quot;) + scale_x_discrete(guide = guide_axis(angle = 70)) + ggtitle(&quot;After spike-in normalization&quot;), ncol=2 ) Figure 2.7: Distribution of log-normalized expression values for Malat1 after normalization with the deconvolution size factors (left) or spike-in size factors (right). Cells are stratified by the ligand affinity and colored by the time after stimulation. Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In most cases, changes in total RNA content are not interesting and can be normalized out with the library size factors. However, this may not always be appropriate if differences in total RNA are associated with a biological process of interest, e.g., cell cycle activity or T cell activation. Spike-in normalization will preserve these differences such that any changes in expression between biological groups have the correct sign. Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] ensembldb_2.35.0 AnnotationFilter_1.35.0 ## [3] GenomicFeatures_1.63.2 AnnotationDbi_1.73.1 ## [5] scater_1.39.4 ggplot2_4.0.2 ## [7] scuttle_1.21.6 scrapper_1.5.17 ## [9] scRNAseq_2.25.0 SingleCellExperiment_1.33.2 ## [11] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [13] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [15] IRanges_2.45.0 S4Vectors_0.49.1 ## [17] BiocGenerics_0.57.0 generics_0.1.4 ## [19] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [21] BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.5 ## [4] ggbeeswarm_0.7.3 gypsum_1.7.0 farver_2.1.2 ## [7] rmarkdown_2.31 BiocIO_1.21.0 vctrs_0.7.3 ## [10] memoise_2.0.1 Rsamtools_2.27.2 RCurl_1.98-1.18 ## [13] htmltools_0.5.9 S4Arrays_1.11.1 AnnotationHub_4.1.0 ## [16] curl_7.0.0 BiocNeighbors_2.5.4 Rhdf5lib_1.33.6 ## [19] SparseArray_1.11.13 rhdf5_2.55.16 sass_0.4.10 ## [22] alabaster.base_1.11.4 bslib_0.10.0 alabaster.sce_1.11.0 ## [25] httr2_1.2.2 cachem_1.1.0 GenomicAlignments_1.47.0 ## [28] lifecycle_1.0.5 pkgconfig_2.0.3 rsvd_1.0.5 ## [31] Matrix_1.7-5 R6_2.6.1 fastmap_1.2.0 ## [34] digest_0.6.39 irlba_2.3.7 ExperimentHub_3.1.0 ## [37] RSQLite_2.4.6 beachmat_2.27.5 labeling_0.4.3 ## [40] filelock_1.0.3 httr_1.4.8 abind_1.4-8 ## [43] compiler_4.6.0 bit64_4.6.0-1 withr_3.0.2 ## [46] S7_0.2.1 BiocParallel_1.45.0 viridis_0.6.5 ## [49] DBI_1.3.0 HDF5Array_1.39.1 alabaster.ranges_1.11.0 ## [52] alabaster.schemas_1.11.0 rappdirs_0.3.4 DelayedArray_0.37.1 ## [55] rjson_0.2.23 tools_4.6.0 vipor_0.4.7 ## [58] otel_0.2.0 beeswarm_0.4.0 glue_1.8.0 ## [61] h5mread_1.3.3 restfulr_0.0.16 rhdf5filters_1.23.3 ## [64] grid_4.6.0 gtable_0.3.6 BiocSingular_1.27.1 ## [67] ScaledMatrix_1.19.0 XVector_0.51.0 ggrepel_0.9.8 ## [70] BiocVersion_3.23.1 pillar_1.11.1 dplyr_1.2.1 ## [73] BiocFileCache_3.1.0 lattice_0.22-9 rtracklayer_1.71.3 ## [76] bit_4.6.0 tidyselect_1.2.1 Biostrings_2.79.5 ## [79] knitr_1.51 gridExtra_2.3 bookdown_0.46 ## [82] ProtGenerics_1.43.0 xfun_0.57 UCSC.utils_1.7.1 ## [85] lazyeval_0.2.3 yaml_2.3.12 evaluate_1.0.5 ## [88] codetools_0.2-20 cigarillo_1.1.0 tibble_3.3.1 ## [91] alabaster.matrix_1.11.0 BiocManager_1.30.27 cli_3.6.6 ## [94] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.1.1 ## [97] GenomeInfoDb_1.47.2 dbplyr_2.5.2 png_0.1-9 ## [100] XML_3.99-0.23 parallel_4.6.0 blob_1.3.0 ## [103] bitops_1.0-9 viridisLite_0.4.3 alabaster.se_1.11.0 ## [106] scales_1.4.0 purrr_1.2.2 crayon_1.5.3 ## [109] rlang_1.2.0 cowplot_1.2.0 KEGGREST_1.51.1 References "],["feature-selection.html", "Chapter 3 Feature selection 3.1 Motivation 3.2 Selecting highly variable genes 3.3 Blocking on uninteresting factors 3.4 Refining the trend fit 3.5 Selecting a priori genes of interest 3.6 Quantifying technical noise Session information", " Chapter 3 Feature selection 3.1 Motivation We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across the cell population. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. In addition to improving signal, this reduces the size of the data to improve computational efficiency of later steps. 3.2 Selecting highly variable genes 3.2.1 Modelling the mean-variance trend Our aim is to select the most highly variable genes (HVGs) based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). To demonstrate, we’ll load the classic PBMC dataset from 10X Genomics (Zheng et al. 2017): # Loading in raw data from the 10X output files. library(DropletTestFiles) raw.path.10x &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/filtered.tar.gz&quot;) dir.path.10x &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path.10x, exdir=dir.path.10x) library(DropletUtils) fname.10x &lt;- file.path(dir.path.10x, &quot;filtered_gene_bc_matrices/GRCh38&quot;) sce.10x &lt;- read10xCounts(fname.10x, col.names=TRUE) # Applying our default QC with outlier-based thresholds. library(scrapper) is.mito.10x &lt;- grepl(&quot;^MT-&quot;, rowData(sce.10x)$Symbol) sce.qc.10x &lt;- quickRnaQc.se(sce.10x, subsets=list(MT=is.mito.10x)) sce.qc.10x &lt;- sce.qc.10x[,sce.qc.10x$keep] # Computing log-normalized expression values. sce.norm.10x &lt;- normalizeRnaCounts.se(sce.qc.10x, size.factors=sce.qc.10x$sum) sce.norm.10x ## class: SingleCellExperiment ## dim: 33694 4147 ## metadata(2): Samples qc ## assays(2): counts logcounts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(2): ID Symbol ## colnames(4147): AAACCTGAGACAGACC-1 AAACCTGAGCGCCTCA-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(7): Sample Barcode ... keep sizeFactor ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): We compute the variance of the log-normalized expression values for each gene across all cells (Lun, McCarthy, and Marioni 2016). We then fit a trend to the variances with respect to the mean (Figure 3.1). HVGs are identified as the top \\(H\\) genes with the largest residuals above the trend, where \\(H\\) is typically between 1000 and 5000. We use the variances of the log-transformed values to ensure that the feature selection is based on the same matrix values that are used in downstream steps. Genes with the largest variances will contribute most to the distances between cells during procedures like clustering and dimensionality reduction. sce.var.10x &lt;- chooseRnaHvgs.se(sce.norm.10x) # Let&#39;s have a peek at the statistics for the top HVGs. rd.10x &lt;- rowData(sce.var.10x) ordered.residual.10x &lt;- order(rd.10x$residuals, decreasing=TRUE) rd.10x[head(ordered.residual.10x),c(&quot;Symbol&quot;, &quot;means&quot;, &quot;variances&quot;, &quot;fitted&quot;, &quot;residuals&quot;, &quot;hvg&quot;)] ## DataFrame with 6 rows and 6 columns ## Symbol means variances fitted residuals hvg ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;logical&gt; ## ENSG00000090382 LYZ 1.86235 4.92849 0.700584 4.22790 TRUE ## ENSG00000163220 S100A9 1.82739 4.34678 0.703548 3.64323 TRUE ## ENSG00000143546 S100A8 1.60171 4.18943 0.714048 3.47538 TRUE ## ENSG00000204287 HLA-DRA 2.08390 3.74121 0.676334 3.06487 TRUE ## ENSG00000019582 CD74 2.89021 3.41935 0.590209 2.82914 TRUE ## ENSG00000101439 CST3 1.40312 2.89004 0.714812 2.17523 TRUE # Look at the number of top HVGs. is.hvg.10x &lt;- rowData(sce.var.10x)$hvg sum(is.hvg.10x) ## [1] 4000 plot(rd.10x$means, rd.10x$variances, col=ifelse(is.hvg.10x, &quot;black&quot;, &quot;grey&quot;), pch=16, cex=1, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) legend(&quot;topright&quot;, col=c(&quot;black&quot;, &quot;grey&quot;), pch=16, legend=c(&quot;HVG&quot;, &quot;not HVG&quot;)) # Just using approxfun() to make a nice-looking curve for us. trend.10x &lt;- approxfun(rd.10x$means, rd.10x$fitted) curve(trend.10x, add=TRUE, col=&quot;dodgerblue&quot;, lwd=2) Figure 3.1: Variance of the log-normalized expression values across all genes in the PBMC data set, as a function of the mean. Each point represents a gene, colored according to whether it was chosen as a HVG. The blue line represents the trend fitted to all genes. We use residuals to select HVGs to account for the mean-variance relationship in scRNA-seq data. Our assumption is that, at any given mean, the variation in expression for most genes is driven by uninteresting processes like sampling noise. The fitted value of the trend at any given gene’s mean represents a mean-dependent estimate of its uninteresting variation, while the residuals represent the “interesting” variation for each gene and can be used as the metric for HVG selection. By comparison, if we just used the total variance without any trend, the choice of HVGs would be driven more by the gene’s abundance than its biological heterogeneity. (In other words, the log-transformation is not a variance-stabilizing transformation7.) This would cause us to neglect lower-abundance genes that exhibit increased variation. Once we have our top HVGs, we can use them in downstream steps like principal components analysis. We’ll discuss this more in Chapter 4, but it is as simple as using only the subset of HVGs in the analysis: sce.pcs.10x &lt;- runPca.se(sce.var.10x, features=is.hvg.10x) 3.2.2 Choosing the number of HVGs How many HVGs should we use in our downstream analyses, i.e., what is the “best” value of \\(H\\)? A larger set of HVGs will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of adding noise from irrelevant genes that might obscure that signal. It’s difficult to determine the optimal trade-off for any given application as the distinction between noise and signal is context-dependent. For example, variation in the activation status of certain immune cells may not be interesting when we only want to identify the cell types; the former can even interfere with the latter by encouraging the formation of clusters based on activation strength instead. Our recommendation is to simply pick a “reasonable” \\(H\\) - usually somewhere between 1000 and 5000 - and proceed with the rest of the analysis. If we can answer our scientific question, then our choice is good enough; if not, we can just try another value. There’s nothing wrong with trying different parameters during data exploration8. In fact, different choices of \\(H\\) can provide new perspectives of the same dataset by changing the balance between signal and noise, so we might discover new population structure that would not be apparent with other parameters. Don’t spend too much time worrying about obtaining the “optimal” value. If we really want to ensure that all biological structure is preserved, we could define the set of HVGs as all genes with variances above the trend. This avoids any judgement calls about the definition of “interesting” variation, giving an opportunity for weaker population structure to manifest. It is most useful for rare and/or weakly-separated subpopulations where the relevant marker genes are not variable enough to sneak into the top \\(H\\) genes. The obvious cost is that more noise is also captured, which can reduce the resolution of subpopulations; and we need to perform more computational work in each downstream step, as more genes are involved. # Setting top=Inf to select all genes with positive residuals. hvgs.all.10x &lt;- chooseHighlyVariableGenes(rd.10x$residuals, top=Inf) length(hvgs.all.10x) ## [1] 11679 # This can be used just like our other HVGs. sce.all.10x &lt;- runPca.se(sce.var.10x, features=hvgs.all.10x) 3.3 Blocking on uninteresting factors Larger datasets may contain multiple blocks of cells that exhibit uninteresting differences in gene expression, e.g., batch effects, variability between donors. We are not interested in HVGs that are driven by these differences; instead, we want to focus on genes that are highly variable within each block. We demonstrate using some trophoblast scRNA-seq data generated across two plates (Lun et al. 2017): library(scRNAseq) sce.tropho &lt;- LunSpikeInData(&quot;tropho&quot;) table(sce.tropho$block) # i.e., the plate of origin. ## ## 20160906 20170201 ## 96 96 # Computing the QC metrics. library(scrapper) is.mito.tropho &lt;- which(any(seqnames(rowRanges(sce.tropho))==&quot;MT&quot;)) sce.qc.tropho &lt;- quickRnaQc.se( sce.tropho, subsets=list(MT=is.mito.tropho), altexp.proportions=&quot;ERCC&quot;, block=sce.tropho$block ) sce.qc.tropho &lt;- sce.qc.tropho[,sce.qc.tropho$keep] # Computing log-normalized expression values. sce.norm.tropho &lt;- normalizeRnaCounts.se( sce.qc.tropho, size.factors=sce.qc.tropho$sum, block=sce.qc.tropho$block ) Setting block= instructs chooseRnaHvgs.se() to compute the mean and variance for each gene within each plate. This ensures that any systematic technical differences between plates (e.g., in sequencing depth) will not inflate the variance estimates. It will also fit a separate trend for each plate, which accommodates differences in the mean-variance relationships between plates. In this case, there are only minor differences between the trends in Figure 3.2, which indicates that the experiment was tightly replicated across plates. sce.var.tropho &lt;- chooseRnaHvgs.se( sce.norm.tropho, block=sce.qc.tropho$block, include.per.block=TRUE # only needed for plotting. ) per.block &lt;- rowData(sce.var.tropho)$per.block par(mfrow=c(1,2)) for (block in colnames(per.block)) { current &lt;- per.block[[block]] plot(current$means, current$variances, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;, main=block, pch=16, cex=0.5) trend &lt;- approxfun(current$means, current$fitted) curve(trend, add=TRUE, col=&quot;dodgerblue&quot;, lwd=2) } Figure 3.2: Variance of the log-normalized expression values across all genes in the trophoblast data set, as a function of the mean after blocking on the plate of origin. Each plot represents the results for a single plate. Each point represents a gene and the fitted trend is shown in blue. chooseRnaHVgs.se() also combines information across blocks by reporting the weighted mean of each statistic, where the weight is determined by the size of number of cells in each plate. We use the mean residuals to select our top HVGs as described previously. This ensures that each block contributes some information about the variability of each gene. High variability in any block can increase the residual for a gene, giving it an opportunity to be selected as a HVG. rowData(sce.var.tropho)[,c(&quot;means&quot;, &quot;variances&quot;, &quot;fitted&quot;, &quot;residuals&quot;)] # mean across blocks. ## DataFrame with 46603 rows and 4 columns ## means variances fitted residuals ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSMUSG00000102693 0.0000000 0.000000 0.000000 0.0000000 ## ENSMUSG00000064842 0.0000000 0.000000 0.000000 0.0000000 ## ENSMUSG00000051951 0.0351964 0.190773 0.145564 0.0452092 ## ENSMUSG00000102851 0.0000000 0.000000 0.000000 0.0000000 ## ENSMUSG00000103377 0.0993010 0.536793 0.389798 0.1469946 ## ... ... ... ... ... ## ENSMUSG00000094431 0 0 0 0 ## ENSMUSG00000094621 0 0 0 0 ## ENSMUSG00000098647 0 0 0 0 ## ENSMUSG00000096730 0 0 0 0 ## ENSMUSG00000095742 0 0 0 0 hvgs.tropho &lt;- rowData(sce.var.tropho)$hvg sum(hvgs.tropho) ## [1] 4000 Alternatively, we could focus on genes are consistently variable within each block by asking chooseRnaHvgs.se() to compute a quantile instead of a weighted mean. For example, we could report the minimum residual across blocks, which means that genes will only be considered as HVGs if they have large positive residuals in each block. This tends to scale poorly as it becomes too stringent with a large number of blocks. sce.var.min.tropho &lt;- chooseRnaHvgs.se( sce.norm.tropho, block=sce.qc.tropho$block, more.var.args=list( block.average.policy=&quot;quantile&quot;, block.quantile=0 # i.e., minimum. ) ) rowData(sce.var.min.tropho)[,c(&quot;means&quot;, &quot;variances&quot;, &quot;fitted&quot;, &quot;residuals&quot;)] # minimum across blocks. ## DataFrame with 46603 rows and 4 columns ## means variances fitted residuals ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSMUSG00000102693 0.0000000 0.000000 0.000000 0.000000 ## ENSMUSG00000064842 0.0000000 0.000000 0.000000 0.000000 ## ENSMUSG00000051951 0.0000000 0.000000 0.000000 0.000000 ## ENSMUSG00000102851 0.0000000 0.000000 0.000000 0.000000 ## ENSMUSG00000103377 0.0780524 0.469098 0.322909 0.146189 ## ... ... ... ... ... ## ENSMUSG00000094431 0 0 0 0 ## ENSMUSG00000094621 0 0 0 0 ## ENSMUSG00000098647 0 0 0 0 ## ENSMUSG00000096730 0 0 0 0 ## ENSMUSG00000095742 0 0 0 0 It is generally expected that block= will be used for uninteresting factors of variation. In this case, the plate of origin is a technical factor that should be ignored. However, imagine instead that each plate corresponds to a different treatment condition. In such cases, we might not use block= to ensure that our variance estimates can capture the differences between treatments. We discuss these considerations in more detail in Chapter 8. As an aside, the wave-like shape observed in Figure 3.2 is typical of the mean-variance relationship in log-expression values. A linear increase in the variance is observed as the mean increases from zero, as larger variances are obviously possible when the counts are not all equal to zero. In contrast, the relative contribution of sampling noise decreases at high abundances, resulting in a downward trend. The peak represents the point at which these two competing effects cancel each other out. 3.4 Refining the trend fit The trend fit in chooseRnaHvgs.se() uses the LOWESS non-parametric smoother (Cleveland 1979) by default. This slides a window across the x-coordinates and performs a linear regression within each window to obtain the fitted value for the point at the window’s center. The size of the window varies between points, expanding or contracting until it contains some proportion of all points in the dataset (by default, 30%). This mostly works well but can be suboptimal in very sparse intervals of the x-axis. To demonstrate, let’s have a look at a human pancreas dataset from Segerstolpe et al. (2016): library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() # For simplicity, we&#39;ll focus on one of the donors. sce.seger &lt;- sce.seger[,sce.seger$individual==&quot;H2&quot;] # For reasons unknown to us, the data supplied by the authors contain # duplicated row names, so we&#39;ll just get rid of those to avoid confusion. sce.seger &lt;- sce.seger[!duplicated(rownames(sce.seger)),] # Running QC. Seems like they don&#39;t have any data for the mitochondrial genes, # unfortunately, but they do have spike-ins so we&#39;ll just use those instead. library(scrapper) sce.qc.seger &lt;- quickRnaQc.se(sce.seger, subsets=list(), altexp.proportions=&quot;ERCC&quot;) # Computing log-normalized expression values. sce.norm.seger &lt;- normalizeRnaCounts.se(sce.qc.seger, size.factors=sce.qc.seger$sum) There are very few genes at high abundances, which forces the LOWESS window to expand to contain more points. This reduces sensitivity of the fitted trend to the behavior of the high-abundance genes (Figure 3.3). We can improve the fit by tinkering with some of the trend fitting options. In particular, setting use.min.width=TRUE will switch to a different strategy to defining the window around each point, which improves sensitivity in sparse intervals at the risk of overfitting. sce.default.seger &lt;- chooseRnaHvgs.se(sce.norm.seger) sce.minw.seger &lt;- chooseRnaHvgs.se(sce.norm.seger, more.var.args=list(use.min.width=TRUE)) rd.default.seger &lt;- rowData(sce.default.seger) plot(rd.default.seger$means, rd.default.seger$variances, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;, pch=16, cex=0.5) trend.default.seger &lt;- approxfun(rd.default.seger$means, rd.default.seger$fitted) curve(trend.default.seger, add=TRUE, col=&quot;dodgerblue&quot;, lwd=2) rd.minw.seger &lt;- rowData(sce.minw.seger) trend.minw.seger &lt;- approxfun(rd.minw.seger$means, rd.minw.seger$fitted) curve(trend.minw.seger, add=TRUE, col=&quot;salmon&quot;, lwd=2) legend(&quot;topright&quot;, lwd=2, col=c(&quot;dodgerblue&quot;, &quot;salmon&quot;), legend=c(&quot;default&quot;, &quot;min-width&quot;)) Figure 3.3: Variance of the log-normalized expression values across all genes in one donor of the Segerstople pancreas data set, as a function of the mean. Each point represents a gene while the lines represent trends fitted with different parameters. In practice, the parametrization of the trend fitting doesn’t matter all that much. There are so few genes in these sparse intervals that their (lack of) selection as HVGs won’t have a major effect on downstream analyses. But sometimes it’s just nice to look at some well-fitted curves. 3.5 Selecting a priori genes of interest A blunt yet effective feature selection strategy is to use pre-defined sets of interesting genes. The aim is to focus on specific aspects of biological heterogeneity that may be masked by other factors when using unsupervised methods for HVG selection. For example, to study transcriptional changes during the earliest stages of cell fate commitment (Messmer et al. 2019), we might focus only on lineage markers to avoid interference from variability in other pathways (e.g., cell cycle, metabolism). Using scRNA-seq data in this manner is conceptually equivalent to a fluorescence activated cell sorting (FACS) experiment, with the convenience of being able to (re)define the features of interest at any time. We provide some examples of a priori selection based on MSigDB gene sets (Liberzon et al. 2015) below: library(msigdbr) c7.sets &lt;- msigdbr(species = &quot;Homo sapiens&quot;, category = &quot;C7&quot;) head(unique(c7.sets$gs_name)) ## [1] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN&quot; ## [2] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_UP&quot; ## [3] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_3DY_DN&quot; ## [4] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_3DY_UP&quot; ## [5] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_DN&quot; ## [6] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP&quot; # Using the Goldrath sets to distinguish CD8 subtypes cd8.sets &lt;- c7.sets[grep(&quot;GOLDRATH&quot;, c7.sets$gs_name),] cd8.genes &lt;- rownames(sce.10x) %in% cd8.sets$ensembl_gene summary(cd8.genes) ## Mode FALSE TRUE ## logical 32851 843 # Using GSE11924 to distinguish between T helper subtypes th.sets &lt;- c7.sets[grep(&quot;GSE11924&quot;, c7.sets$gs_name),] th.genes &lt;- rownames(sce.10x) %in% th.sets$ensembl_gene summary(th.genes) ## Mode FALSE TRUE ## logical 31722 1972 # Using GSE11961 to distinguish between B cell subtypes b.sets &lt;- c7.sets[grep(&quot;GSE11961&quot;, c7.sets$gs_name),] b.genes &lt;- rownames(sce.10x) %in% b.sets$ensembl_gene summary(b.genes) ## Mode FALSE TRUE ## logical 27995 5699 Don’t be ashamed to take advantage of prior biological knowledge during feature selection to address specific hypotheses! We say this because a common refrain in genomics is that the data analysis should be “unbiased”, i.e., free from any biological preconceptions. Which is fine and all, but such “biases” are already present at every stage, starting with experimental design and ending with the interpretation of the data. So if we already know what we’re looking for, why not make life simpler and just go for it? Of course, the downside of focusing on pre-defined genes is that it will limit our capacity to detect novel or unexpected aspects of variation. Thus, this kind of focused analysis should be complementary to (rather than a replacement for) the unsupervised feature selection strategies discussed above. Alternatively, we can invert this reasoning to remove genes that are unlikely to be of interest prior to downstream analyses. This eliminates unwanted variation that could mask relevant biology and interfere with interpretation of the results. Ribosomal protein genes or mitochondrial genes are common candidates for removal, especially in situations with varying levels of cell damage within a population. For immune cell subsets, we might also be inclined to remove immunoglobulin genes and T cell receptor genes for which clonal expression introduces (possibly irrelevant) population structure. # Identifying ribosomal proteins: ribo.discard &lt;- grepl(&quot;^RP[SL]\\\\d+&quot;, rowData(sce.10x)$Symbol) sum(ribo.discard) ## [1] 99 # A more curated approach for identifying ribosomal protein genes: c2.sets &lt;- msigdbr(species = &quot;Homo sapiens&quot;, category = &quot;C2&quot;) ribo.set &lt;- c2.sets[c2.sets$gs_name==&quot;KEGG_RIBOSOME&quot;,]$ensembl_gene ribo.discard &lt;- rownames(sce.10x) %in% ribo.set sum(ribo.discard) ## [1] 87 library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rowData(sce.10x)$ID, keytype=&quot;GENEID&quot;, columns=&quot;TXBIOTYPE&quot;) # Removing immunoglobulin variable chains: igv.set &lt;- anno$GENEID[anno$TXBIOTYPE %in% c(&quot;IG_V_gene&quot;, &quot;IG_V_pseudogene&quot;)] igv.discard &lt;- rownames(sce.10x) %in% igv.set sum(igv.discard) ## [1] 326 # Removing TCR variable chains: tcr.set &lt;- anno$GENEID[anno$TXBIOTYPE %in% c(&quot;TR_V_gene&quot;, &quot;TR_V_pseudogene&quot;)] tcr.discard &lt;- rownames(sce.10x) %in% tcr.set sum(tcr.discard) ## [1] 138 In practice, we tend to err on the side of caution and abstain from preemptive filtering on biological function until these genes are demonstrably problematic in downstream analyses. 3.6 Quantifying technical noise Back in the old days, everyone was obsessed with modelling the gene-wise variability in scRNA-seq data (Brennecke et al. 2013; Vallejos, Marioni, and Richardson 2015; Kim et al. 2015). Spike-in transcripts were critical to this effort as they allowed us to decompose each gene’s variance into technical and biological components. As spike-ins should not be subject to biological effects, the variance in spike-in expression could be used as an estimate of the technical component. Subtracting this from the variance of an endogenous gene at a similar abundance would then yield an estimate of the biological component. Sadly, those days are gone and people don’t care about variance decomposition anymore. But for old times’ sake, we’ll demonstrate how to do this with the Zeisel et al. (2015) dataset: library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() is.mito.zeisel &lt;- rowData(sce.zeisel)$featureType==&quot;mito&quot; # Performing some QC to set up the dataset prior to normalization. library(scrapper) sce.qc.zeisel &lt;- quickRnaQc.se(sce.zeisel, subsets=list(MT=is.mito.zeisel), altexp.proportions=&quot;ERCC&quot;) sce.qc.zeisel &lt;- sce.qc.zeisel[,sce.qc.zeisel$keep] We compute log-normalized expression values for endogenous genes and spike-in transcripts with their respective size factors. We still use the library size factors for the endogenous genes as we are not interested in changes in total RNA content. Both sets of size factors are centered to preserve the scale of the original counts, ensuring that normalized abundances are comparable between genes and spike-ins. (This is a bit more complicated with blocking, as the mean of the spike-in factors within each block must be scaled to the mean of the library size factors in that block; for brevity, we won’t show that here.) sce.norm.zeisel &lt;- normalizeRnaCounts.se(sce.qc.zeisel, size.factors=sce.qc.zeisel$sum) sce.ercc &lt;- altExp(sce.norm.zeisel, &quot;ERCC&quot;) sce.norm.ercc &lt;- normalizeRnaCounts.se(sce.ercc, size.factors=sce.ercc$sum) We fit separate mean-dependent trends to the endogenous genes and spike-in transcripts (Figure 3.4). At any given mean, the fitted value of the spike-in trend represents an estimate of the techical component of the variance. This assumes that an endogenous gene is subject to the same technical noise as a spike-in transcript of the same abundance. sce.var.zeisel &lt;- chooseRnaHvgs.se(sce.norm.zeisel, more.var.args=list(use.min.width=TRUE)) sce.var.ercc &lt;- chooseRnaHvgs.se(sce.norm.ercc) var.zeisel &lt;- rowData(sce.var.zeisel) plot(var.zeisel$means, var.zeisel$variances, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;, pch=16, cex=0.5) trend.gene.zeisel &lt;- approxfun(var.zeisel$means, var.zeisel$fitted) curve(trend.gene.zeisel, add=TRUE, col=&quot;dodgerblue&quot;, lwd=2) var.ercc &lt;- rowData(sce.var.ercc) points(var.ercc$means, var.ercc$variances, pch=4) trend.spike.zeisel &lt;- approxfun(var.ercc$means, var.ercc$fitted, rule=2) curve(trend.spike.zeisel, add=TRUE, col=&quot;salmon&quot;, lwd=2, lty=2) legend(&quot;topright&quot;, c(&quot;gene&quot;, &quot;spike-in&quot;), pch=c(16, 4)) Figure 3.4: Variance of endogenous genes and spike-in transcripts in the Zeisel brain dataset, as a function of the mean. Separate trends are fitted to the genes (blue) and spike-ins (red). To decompose each gene’s variance into the technical and biological components, we estimate the fitted value of the spike-in trend at that gene’s mean and subtract it from the gene’s total variance. We could then use the biological component to select HVGs via chooseHighlyVariableGenes(). tech.var.zeisel &lt;- trend.spike.zeisel(var.zeisel$means) summary(tech.var.zeisel) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0398 0.1453 0.1478 0.2691 0.3582 0.7162 bio.var.zeisel &lt;- var.zeisel$variances - tech.var.zeisel summary(bio.var.zeisel) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.14529 -0.13304 -0.00356 0.07844 0.12841 15.08152 In practice, this doesn’t provide much benefit over the residuals from the trend fitted to the endogenous genes. Oh well. Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] AnnotationHub_4.1.0 BiocFileCache_3.1.0 ## [3] dbplyr_2.5.2 msigdbr_26.1.0 ## [5] ensembldb_2.35.0 AnnotationFilter_1.35.0 ## [7] GenomicFeatures_1.63.2 AnnotationDbi_1.73.1 ## [9] scRNAseq_2.25.0 scrapper_1.5.17 ## [11] DropletUtils_1.31.1 SingleCellExperiment_1.33.2 ## [13] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [15] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [17] IRanges_2.45.0 S4Vectors_0.49.1 ## [19] BiocGenerics_0.57.0 generics_0.1.4 ## [21] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [23] DropletTestFiles_1.21.0 ## ## loaded via a namespace (and not attached): ## [1] DBI_1.3.0 bitops_1.0-9 ## [3] httr2_1.2.2 rlang_1.2.0 ## [5] magrittr_2.0.5 otel_0.2.0 ## [7] gypsum_1.7.0 compiler_4.6.0 ## [9] RSQLite_2.4.6 DelayedMatrixStats_1.33.0 ## [11] png_0.1-9 vctrs_0.7.3 ## [13] ProtGenerics_1.43.0 pkgconfig_2.0.3 ## [15] crayon_1.5.3 fastmap_1.2.0 ## [17] XVector_0.51.0 scuttle_1.21.6 ## [19] Rsamtools_2.27.2 rmarkdown_2.31 ## [21] UCSC.utils_1.7.1 purrr_1.2.2 ## [23] bit_4.6.0 xfun_0.57 ## [25] cachem_1.1.0 beachmat_2.27.5 ## [27] cigarillo_1.1.0 GenomeInfoDb_1.47.2 ## [29] jsonlite_2.0.0 blob_1.3.0 ## [31] rhdf5filters_1.23.3 DelayedArray_0.37.1 ## [33] Rhdf5lib_1.33.6 BiocParallel_1.45.0 ## [35] parallel_4.6.0 R6_2.6.1 ## [37] bslib_0.10.0 limma_3.67.1 ## [39] rtracklayer_1.71.3 jquerylib_0.1.4 ## [41] assertthat_0.2.1 Rcpp_1.1.1 ## [43] bookdown_0.46 knitr_1.51 ## [45] R.utils_2.13.0 Matrix_1.7-5 ## [47] tidyselect_1.2.1 abind_1.4-8 ## [49] yaml_2.3.12 codetools_0.2-20 ## [51] curl_7.0.0 alabaster.sce_1.11.0 ## [53] lattice_0.22-9 tibble_3.3.1 ## [55] withr_3.0.2 KEGGREST_1.51.1 ## [57] evaluate_1.0.5 alabaster.schemas_1.11.0 ## [59] ExperimentHub_3.1.0 Biostrings_2.79.5 ## [61] pillar_1.11.1 BiocManager_1.30.27 ## [63] filelock_1.0.3 RCurl_1.98-1.18 ## [65] BiocVersion_3.23.1 alabaster.base_1.11.4 ## [67] sparseMatrixStats_1.23.0 alabaster.ranges_1.11.0 ## [69] glue_1.8.0 lazyeval_0.2.3 ## [71] alabaster.matrix_1.11.0 tools_4.6.0 ## [73] BiocIO_1.21.0 BiocNeighbors_2.5.4 ## [75] GenomicAlignments_1.47.0 locfit_1.5-9.12 ## [77] babelgene_22.9 XML_3.99-0.23 ## [79] rhdf5_2.55.16 grid_4.6.0 ## [81] edgeR_4.9.7 HDF5Array_1.39.1 ## [83] restfulr_0.0.16 cli_3.6.6 ## [85] rappdirs_0.3.4 S4Arrays_1.11.1 ## [87] dplyr_1.2.1 alabaster.se_1.11.0 ## [89] R.methodsS3_1.8.2 sass_0.4.10 ## [91] digest_0.6.39 SparseArray_1.11.13 ## [93] dqrng_0.4.1 rjson_0.2.23 ## [95] memoise_2.0.1 htmltools_0.5.9 ## [97] R.oo_1.27.1 lifecycle_1.0.5 ## [99] h5mread_1.3.3 httr_1.4.8 ## [101] statmod_1.5.1 bit64_4.6.0-1 References "],["principal-components-analysis.html", "Chapter 4 Principal components analysis 4.1 Motivation 4.2 Getting the top PCs 4.3 How many PCs? 4.4 Blocking on uninteresting factors 4.5 Visualizing the PCs Session information", " Chapter 4 Principal components analysis 4.1 Motivation Principal components analysis (PCA) is commonly used to clean up and compact the log-normalized expression matrix. Consider each gene as a dimension of our dataset where the cells are the observations, i.e., each cell’s expression profile defines its location in the high-dimensional expression space. PCA discovers axes in this high-dimensional space that capture the largest amount of variation (Pearson 1901). Each principal component (PC) corresponds to an axis in this space, where the earliest PCs capture the dominant factors of heterogeneity in our data. The idea is to use the first few PCs to approximate our original dataset9. Similarly, the Euclidean distances between cells in the PC space approximate the same distances in the original dataset. This effectively compresses multiple genes into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007), and allows us to use a much smaller matrix in downstream steps like clustering. 4.2 Getting the top PCs Our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. In contrast, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. By retaining the earlier PCs, we can focus on the biological signal while removing random noise. To demonstrate, we’ll pull out our favorite mouse brain dataset from Zeisel et al. (2015): library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() is.mito.zeisel &lt;- rowData(sce.zeisel)$featureType==&quot;mito&quot; # Performing some QC to set up the dataset prior to normalization. library(scrapper) sce.qc.zeisel &lt;- quickRnaQc.se(sce.zeisel, subsets=list(MT=is.mito.zeisel), altexp.proportions=&quot;ERCC&quot;) sce.qc.zeisel &lt;- sce.qc.zeisel[,sce.qc.zeisel$keep] # Computing log-normalized expression values. sce.norm.zeisel &lt;- normalizeRnaCounts.se(sce.qc.zeisel, size.factors=sce.qc.zeisel$sum) # Computing the variances. sce.var.zeisel &lt;- chooseRnaHvgs.se(sce.norm.zeisel, more.var.args=list(use.min.width=TRUE)) We run the PCA on our HVG-filtered log-normalized expression matrix, compacting the dataset into the top 25 PCs. This yields a matrix of “PC scores”, i.e., the coordinates for each cell in the new low-dimensional space, which can be used in clustering, visualization, etc. As discussed in Chapter 3, we restrict this step to the top HVGs to reduce the impact of random noise. While PCA is robust to noise, too much of it may cause the earlier PCs to ignore meaningful structure (Johnstone and Lu 2009). hvgs.zeisel &lt;- rowData(sce.var.zeisel)$hvg summary(hvgs.zeisel) ## Mode FALSE TRUE ## logical 16006 4000 sce.pca.zeisel &lt;- runPca.se(sce.var.zeisel, features=hvgs.zeisel, number=25) dim(reducedDim(sce.pca.zeisel, &quot;PCA&quot;)) ## [1] 2866 25 4.3 How many PCs? The million dollar question is, how many of the top PCs should we retain for downstream analyses? Using more PCs will retain more biological signal at the cost of including more noise that might mask that signal. As with the number of HVGs, it is hard to determine whether an “optimal” choice exists here. Sure, technical variation is almost always uninteresting, but there is no straightforward way to automatically determine which aspects of biological variation are relevant to a particular scientific question. For example, heterogeneity within a population might be interesting when studying continuous processes like metabolic flux or differentiation potential, but could be considered noise in applications that only aim to distinguish between distinct cell types. Most practitioners will simply set \\(d\\) to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory as the later PCs explain so little variance that their inclusion or omission has no major effect. For example, in the Zeisel dataset, few PCs explain more than 1% of the variance in the entire dataset (Figure 4.1). Choosing between, say, 20 and 40 PCs would not even amount to 5 percentage points’ worth of difference in variance. In fact, the main consequence of using more PCs is simply that downstream calculations take longer as they need to compute over more dimensions, but most PC-related calculations are fast enough that this is not a practical concern. sce.more.zeisel &lt;- runPca.se(sce.var.zeisel, features=hvgs.zeisel, number=50) pca.meta &lt;- metadata(sce.more.zeisel)$PCA percent.var &lt;- pca.meta$variance.explained / pca.meta$total.variance * 100 plot(percent.var, log=&quot;y&quot;, xlab=&quot;PC&quot;, ylab=&quot;Variance explained (%)&quot;, type=&quot;b&quot;) Figure 4.1: Percentage of variance explained by successive PCs in the Zeisel dataset, shown on a log-scale. If we really must try to guess the “best” number of PCs10, here are a few approaches: We can choose the elbow point in the scree plot (Figure 4.1), e.g., using the findElbowPoint() function from the PCAtools package. The assumption is that there should be a sharp drop in the percentage of variance explained when we move past the last PC corresponding to biological structure. However, the ideal cut-off can be difficult to gauge when there are sources of weaker biological variation. We can keep the number of PCs that cumulatively explain variance equal to the sum of the biological components among the HVGs. This relies on the decomposition of each gene’s variance into biological and technical components (see Chapter 3). In practice, the distinction between biological and technical variation is usually not so clear as they will not be isolated to the earlier and later PCs, respectively. We can use random matrix theory to select an appropriate number of PCs. This might involve the Marchenko-Pastur limit (Shekhar et al. 2016), Horn’s parallel analysis (Horn 1965), or the Gavish-Donoho threshold for optimal reconstruction (Gavish and Donoho 2014) (see relevant functions in PCAtools). Each of these methods has its own limitations, e.g., requirement for i.i.d. noise. But if we’re really concerned about the number of PCs, it’s probably just better to repeat the analysis with different number of PCs. This allows us explore other perspectives of the data at different trade-offs between biological signal and technical noise. 4.4 Blocking on uninteresting factors Larger datasets typically consist of multiple blocks of cells with uninteresting differences between them, e.g.. batch effects, variability between donors. We don’t want to waste our top PCs on capturing these differences - instead, we want our PCA to focus on the biological structure within each block. To demonstrate, let’s look at a dataset consisting of two plates of wild-type and oncogene-induced 416B cells (Lun et al. 2017). Differences in expression due to the plate of origin are obviously technical and should be ignored. To make life more exciting, we will also consider the oncogene induction status to be an uninteresting experimental factor11 that should not be allowed to dominate the PCA. library(scRNAseq) sce.416b &lt;- LunSpikeInData(&quot;416b&quot;) # Combining the plate of origin and oncogene induction status into a single # blocking factor of &#39;uninteresting&#39; variation. plate.416b &lt;- sce.416b$block pheno.416b &lt;- ifelse(sce.416b$phenotype == &quot;wild type phenotype&quot;, &quot;WT&quot;, &quot;induced&quot;) sce.416b$block &lt;- factor(paste0(pheno.416b, &quot;-&quot;, plate.416b)) # Computing the QC metrics. library(scrapper) is.mito.416b &lt;- which(any(seqnames(rowRanges(sce.416b)) == &quot;MT&quot;)) sce.qc.416b &lt;- quickRnaQc.se( sce.416b, subsets=list(MT=is.mito.416b), altexp.proportions=&quot;ERCC&quot;, block=sce.416b$block ) sce.qc.416b &lt;- sce.qc.416b[,sce.qc.416b$keep] # Computing log-normalized expression values. sce.norm.416b &lt;- normalizeRnaCounts.se( sce.qc.416b, size.factors=sce.qc.416b$sum, block=sce.qc.416b$block ) # Choosing the top VGs after blocking on the uninteresting factors. sce.var.416b &lt;- chooseRnaHvgs.se( sce.norm.416b, more.choose.args=list(top=1000), # just picking a cool-looking number of top genes here. block=sce.norm.416b$block ) We set block= to instruct runPca.se() to focus on the variation within each block. This is equivalent to centering each block at the origin and then finding the axes of largest variation among the residuals. The expression values for each cell are then projected onto these axes to obtain that cell’s PC scores. Blocking removes the shift between the induced and wild-type subpopulations on the first two PCs (Figure 4.2), allowing our subsequent analyses to focus on heterogeneity within each subpopulation. is.hvg.416b &lt;- rowData(sce.var.416b)$hvg sce.pca.416b &lt;- runPca.se(sce.var.416b, features=is.hvg.416b, number=20) sce.block.416b &lt;- runPca.se( sce.var.416b, features=is.hvg.416b, number=20, block=sce.var.416b$block ) library(scater) gridExtra::grid.arrange( plotReducedDim(sce.pca.416b, dimred=&quot;PCA&quot;, colour_by=&quot;block&quot;) + ggtitle(&quot;Without blocking&quot;), plotReducedDim(sce.block.416b, dimred=&quot;PCA&quot;, colour_by=&quot;block&quot;) + ggtitle(&quot;Blocked&quot;), ncol=2 ) Figure 4.2: First two PCs for the 416B dataset, before and after blocking on uninteresting experimental factors. Each point represents a cell, colored by its combination of experimental factors. If we’re lucky, all of the uninteresting differences between blocks are orthogonal to the major biological variation, such that taking the first few PCs will focus on the latter and remove the former. In practice, blocking during PCA is usually not sufficient to remove differences between blocks, as they tend to have some biological component that will be preserved within the first few PCs. Removal requires some additional effort (see Chapter 8) prior to downstream steps like clustering. Nonetheless, blocking is still helpful as it eliminates at least some of these differences and preserves more biological signal in the top PCs. To be more precise, the default behavior of block= is to use the residuals to compute the rotation matrix but not the PC scores. This reduces the influence of block-to-block differences on the low-dimensional embedding but does not explicitly remove it. In contrast, setting components.from.residuals=TRUE yields PC scores that are also derived from the residuals. This removes the differences between blocks but is only correct in very limiting circumstances, e.g., assuming all blocks have the same subpopulation composition and the difference between blocks is consistent for all cell subpopulations. Such assumptions may be appropriate in some situations (e.g., technical replicates) but are not generally applicable. In our 416B example, the two subpopulations are now forced together (Figure 4.3) for better or worse. sce.resid.416b &lt;- runPca.se( sce.var.416b, features=is.hvg.416b, number=20, block=sce.var.416b$block, more.pca.args=list(components.from.residuals=TRUE) ) library(scater) plotReducedDim(sce.resid.416b, dimred=&quot;PCA&quot;, colour_by=&quot;block&quot;) Figure 4.3: First two PCs for the 416B dataset with blocking, where PC scores are computed from residuals. Each point represents a cell, colored by its combination of experimental factors. As with HVGs, we should only use block= for experimental factors that are not interesting. If we were interested in the effects of oncogene induction, we should not block on it to ensure that the PCA can capture the associated changes in expression. Sometimes, though, it is not obvious whether something is “interesting” or not, as we may wish to ignore some biological differences to obtain a consistent set of clusters across treatment conditions, tissues, etc. Check out Chapter 8 for a more detailed discussion. 4.5 Visualizing the PCs We might as well touch on another common use of PCA, which is visualization of high-dimensional data. This is used in a variety of fields and applications (including bulk RNA-seq) but is not so effective for scRNA-seq data. If we’re lucky, our population structure is simple enough that the first two PCs capture most of the relevant biology (Figures 4.2 and 4.3). However, in most cases, relevant biological heterogeneity is spread throughout 10-50 PCs that are much harder to visualize. For example, examination of the top 4 PCs is still insufficient to resolve all subpopulations identified by Zeisel et al. (2015) (Figure 4.4). library(scater) plotReducedDim( sce.pca.zeisel, dimred=&quot;PCA&quot;, ncomponents=4, colour_by=&quot;level1class&quot; ) Figure 4.4: PCA plot of the first 4 PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors. The problem here is that PCA is a linear technique, i.e., only variation along a line in high-dimensional space is captured by each PC. As such, it cannot efficiently represent high-dimensional differences in the first 2 PCs. If the first PC is devoted to resolving the biggest difference between subpopulations, and the second PC is devoted to resolving the next biggest difference, then the remaining differences will not be visible in the plot. That said, PCA is still useful as the top PCs are often used as input to more sophisticated algorithms for dimensionality reduction (Chapter 5). Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] scater_1.39.4 ggplot2_4.0.2 ## [3] scuttle_1.21.6 ensembldb_2.35.0 ## [5] AnnotationFilter_1.35.0 GenomicFeatures_1.63.2 ## [7] AnnotationDbi_1.73.1 scrapper_1.5.17 ## [9] scRNAseq_2.25.0 SingleCellExperiment_1.33.2 ## [11] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [13] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [15] IRanges_2.45.0 S4Vectors_0.49.1 ## [17] BiocGenerics_0.57.0 generics_0.1.4 ## [19] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [21] BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.5 ## [4] ggbeeswarm_0.7.3 gypsum_1.7.0 farver_2.1.2 ## [7] rmarkdown_2.31 BiocIO_1.21.0 vctrs_0.7.3 ## [10] memoise_2.0.1 Rsamtools_2.27.2 RCurl_1.98-1.18 ## [13] htmltools_0.5.9 S4Arrays_1.11.1 AnnotationHub_4.1.0 ## [16] curl_7.0.0 BiocNeighbors_2.5.4 Rhdf5lib_1.33.6 ## [19] SparseArray_1.11.13 rhdf5_2.55.16 sass_0.4.10 ## [22] alabaster.base_1.11.4 bslib_0.10.0 alabaster.sce_1.11.0 ## [25] httr2_1.2.2 cachem_1.1.0 GenomicAlignments_1.47.0 ## [28] lifecycle_1.0.5 pkgconfig_2.0.3 rsvd_1.0.5 ## [31] Matrix_1.7-5 R6_2.6.1 fastmap_1.2.0 ## [34] digest_0.6.39 irlba_2.3.7 ExperimentHub_3.1.0 ## [37] RSQLite_2.4.6 beachmat_2.27.5 labeling_0.4.3 ## [40] filelock_1.0.3 httr_1.4.8 abind_1.4-8 ## [43] compiler_4.6.0 bit64_4.6.0-1 withr_3.0.2 ## [46] S7_0.2.1 BiocParallel_1.45.0 viridis_0.6.5 ## [49] DBI_1.3.0 HDF5Array_1.39.1 alabaster.ranges_1.11.0 ## [52] alabaster.schemas_1.11.0 rappdirs_0.3.4 DelayedArray_0.37.1 ## [55] rjson_0.2.23 tools_4.6.0 vipor_0.4.7 ## [58] otel_0.2.0 beeswarm_0.4.0 glue_1.8.0 ## [61] h5mread_1.3.3 restfulr_0.0.16 rhdf5filters_1.23.3 ## [64] grid_4.6.0 gtable_0.3.6 BiocSingular_1.27.1 ## [67] ScaledMatrix_1.19.0 XVector_0.51.0 ggrepel_0.9.8 ## [70] BiocVersion_3.23.1 pillar_1.11.1 dplyr_1.2.1 ## [73] BiocFileCache_3.1.0 lattice_0.22-9 rtracklayer_1.71.3 ## [76] bit_4.6.0 tidyselect_1.2.1 Biostrings_2.79.5 ## [79] knitr_1.51 gridExtra_2.3 bookdown_0.46 ## [82] ProtGenerics_1.43.0 xfun_0.57 UCSC.utils_1.7.1 ## [85] lazyeval_0.2.3 yaml_2.3.12 evaluate_1.0.5 ## [88] codetools_0.2-20 cigarillo_1.1.0 tibble_3.3.1 ## [91] alabaster.matrix_1.11.0 BiocManager_1.30.27 cli_3.6.6 ## [94] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.1.1 ## [97] GenomeInfoDb_1.47.2 dbplyr_2.5.2 png_0.1-9 ## [100] XML_3.99-0.23 parallel_4.6.0 blob_1.3.0 ## [103] bitops_1.0-9 viridisLite_0.4.3 alabaster.se_1.11.0 ## [106] scales_1.4.0 purrr_1.2.2 crayon_1.5.3 ## [109] rlang_1.2.0 cowplot_1.2.0 KEGGREST_1.51.1 References "],["visualization.html", "Chapter 5 Visualization 5.1 Motivation 5.2 \\(t\\)-stochastic neighbor embedding 5.3 More comments on interpretation 5.4 Other visualization methods Session information", " Chapter 5 Visualization 5.1 Motivation One of the major aims of scRNA-seq data analysis is to generate a pretty figure that visualizes the distribution of cells12 This is not straightforward as our data contains too many dimensions, even after PCA (Chapter 4). We can’t just create FACS-style biaxial plots of each gene/PC against another as there would be too many plots to examine. Instead, we use more aggressive dimensionality reduction methods that can represent our population structure in a two-dimensional embedding. The idea is to facilitate interpretation of the data by creating a visual “map” of its heterogeneity, where similar cells are placed next to each other in the embedding while dissimilar cells are further apart. 5.2 \\(t\\)-stochastic neighbor embedding Historically, scRNA-seq data analyses were synonymous with \\(t\\)-stochastic neighbor embedding (\\(t\\)-SNE) (Van der Maaten and Hinton 2008). This method attempts to find a low-dimensional representation of the data that preserves the relationships between each point and its neighbors in the high-dimensional space. Unlike PCA, \\(t\\)-SNE is not restricted to linear transformations, nor is it obliged to accurately represent distances between distant populations. This means that it has much more freedom in how it arranges cells in low-dimensional space, enabling it to separate many distinct clusters in a complex population. To demonstrate, let’s pull out the Zeisel et al. (2015) dataset again: library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() is.mito.zeisel &lt;- rowData(sce.zeisel)$featureType==&quot;mito&quot; # Performing some QC to set up the dataset prior to normalization. library(scrapper) sce.qc.zeisel &lt;- quickRnaQc.se(sce.zeisel, subsets=list(MT=is.mito.zeisel), altexp.proportions=&quot;ERCC&quot;) sce.qc.zeisel &lt;- sce.qc.zeisel[,sce.qc.zeisel$keep] # Computing log-normalized expression values. sce.norm.zeisel &lt;- normalizeRnaCounts.se(sce.qc.zeisel, size.factors=sce.qc.zeisel$sum) # Computing the variances and choosing top HVGs. sce.var.zeisel &lt;- chooseRnaHvgs.se( sce.norm.zeisel, more.var.args=list(use.min.width=TRUE), more.choose.args=list(top=2000) ) # Performing the PCA. sce.pcs.zeisel &lt;- runPca.se(sce.var.zeisel, features=rowData(sce.var.zeisel)$hvg, number=25) We compute the \\(t\\)-SNE from the PC scores to take advantage of the data compaction and noise removal of the PCA. Specifically, we calculate distances in the PC space13 to identify the nearest neighbors for each cell, which are then used for \\(t\\)-SNE layout optimization. This yields a 2-dimensional embedding in which neighboring cells have similar expression profiles. In Figure 5.1, we see that cells organize into distinct subpopulations corresponding to their cell types. library(scater) sce.tsne.zeisel &lt;- runTsne.se(sce.pcs.zeisel) plotReducedDim(sce.tsne.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) Figure 5.1: \\(t\\)-SNE plot constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, colored according to the authors’ published annotation. As with everything in scRNA-seq, \\(t\\)-SNE is sensitive to a variety of different parameter choices (discussed here in some depth). One obvious parameter is the random seed used to initialize the coordinates for each cell in the two-dimensional space. Changing the seed will yield a different embedding (Figure 5.2), though they usually have enough qualitative similarities that the interpretation of the plot is unaffected. sce.tsne.seed.zeisel &lt;- runTsne.se(sce.pcs.zeisel, more.tsne.args=list(seed=123456)) plotReducedDim(sce.tsne.seed.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) Figure 5.2: \\(t\\)-SNE plot constructed from the top PCs in the Zeisel brain dataset with a different seed. Each point represents a cell, colored according to the authors’ published annotation. The perplexity is another important parameter that determines the granularity of the visualization. Low perplexities will favor resolution of finer structure while higher values focus on the broad organization of subpopulations. We can test different perplexity values to obtain different perspectives of the data (Figure 5.3), depending on whether we are interested in local or global structure. sce.tsne.p5.zeisel &lt;- runTsne.se(sce.pcs.zeisel, perplexity=5) sce.tsne.p50.zeisel &lt;- runTsne.se(sce.pcs.zeisel, perplexity=50) gridExtra::grid.arrange( plotReducedDim(sce.tsne.p5.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 5&quot;), plotReducedDim(sce.tsne.p50.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 50&quot;), ncol=2 ) Figure 5.3: \\(t\\)-SNE plots constructed from the top PCs in the Zeisel brain dataset, using a range of perplexity values. Each point represents a cell, coloured according to its annotation. We’d recommend interpreting these \\(t\\)-SNE plots with a grain of salt. \\(t\\)-SNE will inflate dense clusters and compress sparse ones, so we cannot use the relative size on the plot as a measure of subpopulation heterogeneity. The algorithm is not obliged to preserve the relative locations of non-neighboring clusters, so we cannot use their positions to determine relationships between distant clusters. Many liberties were taken with the data in order to squish it into a two-dimensional representation, so it’s worth being skeptical of the fidelity of that representation. That said, the \\(t\\)-SNE plots are pretty and historically popular so get used to seeing them14. 5.2.1 Uniform manifold approximation and projection These days, \\(t\\)-SNE has largely been supplanted in the community’s consciousness by uniform manifold approximation and projection (UMAP) (McInnes and Healy 2018). UMAP is roughly similar to \\(t\\)-SNE in that it also tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space. However, the two methods are based on different theoretical principles that manifest as different visualizations. Compared to \\(t\\)-SNE, UMAP tends to produce more compact visual clusters with more empty space between them. We demonstrate on the Zeisel dataset where the UMAP is calculated on nearest neighbors identified from the PCs (Figure 5.4). sce.umap.zeisel &lt;- runUmap.se(sce.pcs.zeisel) plotReducedDim(sce.umap.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) Figure 5.4: UMAP plot constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation. Like \\(t\\)-SNE, UMAP has its own suite of parameters that affect the visualization (see the documentation here). The number of neighbors is most analogous to \\(t\\)-SNE’s perplexity, where lower values focus on the local structure around each cell (Figure 5.5). sce.umap.n3.zeisel &lt;- runUmap.se(sce.pcs.zeisel, num.neighbors=3) sce.umap.n30.zeisel &lt;- runUmap.se(sce.pcs.zeisel, num.neighbors=30) gridExtra::grid.arrange( plotReducedDim(sce.umap.n3.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;neighbors = 3&quot;), plotReducedDim(sce.umap.n30.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;neighbors = 30&quot;), ncol=2 ) Figure 5.5: UMAP plots constructed from the top PCs in the Zeisel brain dataset, using a range of neighbors. Each point represents a cell, coloured according to its annotation. Another influential parameter is the minimum distance between points in the embedding. Larger values will generally inflate the visual clusters and reduce the amount of whitespace in the plot (Figure 5.6). Sometimes it’s worth fiddling around with some of these parameters to get a prettier plot. sce.umap.d01.zeisel &lt;- runUmap.se(sce.pcs.zeisel, min.dist=0.01) sce.umap.d5.zeisel &lt;- runUmap.se(sce.pcs.zeisel, min.dist=0.5) gridExtra::grid.arrange( plotReducedDim(sce.umap.d01.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;mindist = 0.01&quot;), plotReducedDim(sce.umap.d5.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;mindist = 0.5&quot;), ncol=2 ) Figure 5.6: UMAP plots constructed from the top PCs in the Zeisel brain dataset, using a range of minimum distances. Each point represents a cell, coloured according to its annotation. The choice between UMAP or \\(t\\)-SNE is mostly down to personal preference. It seems that most people find the UMAP to be nicer to look at, possibly because of the cleaner separation between clusters. UMAP also has an advantage in that its default initialization (derived from the nearest-neighbors graph) is better at capturing the global structure (Kobak and Linderman 2021); however, this is not applicable when distant subpopulations are completely disconnected in the graph. From a practical perspective, UMAP is often faster than \\(t\\)-SNE, which is an important consideration for large datasets. In any case, much of the same skepticism that we expressed for \\(t\\)-SNE is still applicable to UMAP, as a great deal of information is lost when flattening the data into two dimensions. If we can’t decide between a UMAP and a \\(t\\)-SNE, we can just compute them both15 via the runAllNeighborSteps.se() function. This runs both functions in parallel, along with the graph-based clustering described in Chapter 6. It also optimizes the nearest neighbor search by performing it once and re-using the results across multiple graph-related functions. sce.nn.zeisel &lt;- runAllNeighborSteps.se(sce.pcs.zeisel) reducedDimNames(sce.nn.zeisel) ## [1] &quot;PCA&quot; &quot;TSNE&quot; &quot;UMAP&quot; 5.3 More comments on interpretation All of these visualizations necessarily distort the relationships between cells to fit high-dimensional data into a 2-dimensional space. It’s fair to question whether the results of such distortions can be trusted. As a general rule, focusing on local neighborhoods provides the safest interpretation of \\(t\\)-SNE and UMAP plots. These methods spend considerable effort to ensure that each cell’s nearest neighbors in the input high-dimensional space are still its neighbors in the output two-dimensional embedding. Thus, if we see multiple cell types or clusters in a single unbroken “island” in the embedding, we could infer that those populations were also close neighbors in higher-dimensional space. Less can be said about non-neighboring cells/clusters as there is no guarantee that large distances are faithfully recapitulated in the embedding. We can conclude that cells in distinct visual clusters are indeed different, but comparing distances between clusters is usually pointless16. As a thought exercise, imagine a dataset with 4 cell types arranged in three-dimensional space as a regular tetrahedron17. All cell types are equally distant from each other, but it is impossible to preserve this property in a two-dimensional embedding. This can lead to some incorrect conclusions about the relative (dis)similarity of the different cell types if we are not careful with our interpretation of the plot. Personally, we only use the \\(t\\)-SNE/UMAP coordinates for visualization. Other steps like clustering still use the higher-rank representation (i.e., the PCs) to leverage all of the information in the data without any of the compromises required to obtain a two-dimensional embedding. In theory, we could use the \\(t\\)-SNE/UMAP coordinates directly for clustering to ensure that any results are directly consistent with the visualization18. We don’t do this as we don’t want our analysis results to change whenever we tweak the parameters to beautify our visualizations. 5.4 Other visualization methods Here’s a non-exhaustive list of other visualization methods in R/Bioconductor packages: Interpolation-based \\(t\\)-SNE (Linderman et al. 2019) from the snifter package. Density-preserving \\(t\\)-SNE and UMAP (Narayan, Berger, and Cho 2021) from the densvis package. All of these packages will happily accept a matrix of PC scores and are plug-and-play replacements for runTsne.se() and runUmap.se(). Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] scater_1.39.4 ggplot2_4.0.2 ## [3] scuttle_1.21.6 scrapper_1.5.17 ## [5] scRNAseq_2.25.0 SingleCellExperiment_1.33.2 ## [7] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [9] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [11] IRanges_2.45.0 S4Vectors_0.49.1 ## [13] BiocGenerics_0.57.0 generics_0.1.4 ## [15] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [17] BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.5 ## [4] ggbeeswarm_0.7.3 GenomicFeatures_1.63.2 gypsum_1.7.0 ## [7] farver_2.1.2 rmarkdown_2.31 BiocIO_1.21.0 ## [10] vctrs_0.7.3 memoise_2.0.1 Rsamtools_2.27.2 ## [13] RCurl_1.98-1.18 htmltools_0.5.9 S4Arrays_1.11.1 ## [16] AnnotationHub_4.1.0 curl_7.0.0 BiocNeighbors_2.5.4 ## [19] Rhdf5lib_1.33.6 SparseArray_1.11.13 rhdf5_2.55.16 ## [22] sass_0.4.10 alabaster.base_1.11.4 bslib_0.10.0 ## [25] alabaster.sce_1.11.0 httr2_1.2.2 cachem_1.1.0 ## [28] GenomicAlignments_1.47.0 lifecycle_1.0.5 pkgconfig_2.0.3 ## [31] rsvd_1.0.5 Matrix_1.7-5 R6_2.6.1 ## [34] fastmap_1.2.0 digest_0.6.39 AnnotationDbi_1.73.1 ## [37] irlba_2.3.7 ExperimentHub_3.1.0 RSQLite_2.4.6 ## [40] beachmat_2.27.5 labeling_0.4.3 filelock_1.0.3 ## [43] httr_1.4.8 abind_1.4-8 compiler_4.6.0 ## [46] bit64_4.6.0-1 withr_3.0.2 S7_0.2.1 ## [49] BiocParallel_1.45.0 viridis_0.6.5 DBI_1.3.0 ## [52] HDF5Array_1.39.1 alabaster.ranges_1.11.0 alabaster.schemas_1.11.0 ## [55] rappdirs_0.3.4 DelayedArray_0.37.1 rjson_0.2.23 ## [58] tools_4.6.0 vipor_0.4.7 otel_0.2.0 ## [61] beeswarm_0.4.0 glue_1.8.0 h5mread_1.3.3 ## [64] restfulr_0.0.16 rhdf5filters_1.23.3 grid_4.6.0 ## [67] gtable_0.3.6 ensembldb_2.35.0 BiocSingular_1.27.1 ## [70] ScaledMatrix_1.19.0 XVector_0.51.0 ggrepel_0.9.8 ## [73] BiocVersion_3.23.1 pillar_1.11.1 dplyr_1.2.1 ## [76] BiocFileCache_3.1.0 lattice_0.22-9 rtracklayer_1.71.3 ## [79] bit_4.6.0 tidyselect_1.2.1 Biostrings_2.79.5 ## [82] knitr_1.51 gridExtra_2.3 bookdown_0.46 ## [85] ProtGenerics_1.43.0 xfun_0.57 UCSC.utils_1.7.1 ## [88] lazyeval_0.2.3 yaml_2.3.12 evaluate_1.0.5 ## [91] codetools_0.2-20 cigarillo_1.1.0 tibble_3.3.1 ## [94] alabaster.matrix_1.11.0 BiocManager_1.30.27 cli_3.6.6 ## [97] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.1.1 ## [100] GenomeInfoDb_1.47.2 dbplyr_2.5.2 png_0.1-9 ## [103] XML_3.99-0.23 parallel_4.6.0 blob_1.3.0 ## [106] AnnotationFilter_1.35.0 bitops_1.0-9 viridisLite_0.4.3 ## [109] alabaster.se_1.11.0 scales_1.4.0 crayon_1.5.3 ## [112] rlang_1.2.0 cowplot_1.2.0 KEGGREST_1.51.1 References "],["clustering.html", "Chapter 6 Clustering 6.1 Motivation 6.2 Graph-based clustering 6.3 \\(k\\)-means clustering 6.4 Choosing the clustering parameters 6.5 Clustering diagnostics 6.6 Subclustering Session information", " Chapter 6 Clustering 6.1 Motivation Clustering is an unsupervised learning technique that partitions a dataset into groups (clusters) based on the similarities between observations. In the context of scRNA-seq, cells in the same cluster will have similar expression profiles while cells in different clusters will be less similar. By assigning cells into clusters, we summarize our complex scRNA-seq data into discrete categories for easier human interpretation. The idea is to attribute some biological meaning to each cluster, typically based on its upregulated marker genes (Chapter 7). We can then treat the clusters as proxies for actual cell types/states in the rest of the analysis, which is more intuitive than describing population heterogeneity as some high-dimensional distribution. 6.2 Graph-based clustering Popularized by its use in Seurat, graph-based clustering is a flexible and scalable technique for clustering large scRNA-seq datasets. We build a graph where each node is a cell that is connected to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related. Clusters are then identified as “communities” of nodes that are more strongly interconnected in the graph, i.e., edges are concentrated between cells in the same cluster. To demonstrate, let’s use the PBMC dataset from 10X Genomics (Zheng et al. 2017): # Loading in raw data from the 10X output files. library(DropletTestFiles) raw.path.10x &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/filtered.tar.gz&quot;) dir.path.10x &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path.10x, exdir=dir.path.10x) library(DropletUtils) fname.10x &lt;- file.path(dir.path.10x, &quot;filtered_gene_bc_matrices/GRCh38&quot;) sce.10x &lt;- read10xCounts(fname.10x, col.names=TRUE) # Applying our default QC with outlier-based thresholds. library(scrapper) is.mito.10x &lt;- grepl(&quot;^MT-&quot;, rowData(sce.10x)$Symbol) sce.qc.10x &lt;- quickRnaQc.se(sce.10x, subsets=list(MT=is.mito.10x)) sce.qc.10x &lt;- sce.qc.10x[,sce.qc.10x$keep] # Computing log-normalized expression values. sce.norm.10x &lt;- normalizeRnaCounts.se(sce.qc.10x, size.factors=sce.qc.10x$sum) # We now choose the top HVGs. sce.var.10x &lt;- chooseRnaHvgs.se(sce.norm.10x) # Running the PCA on the HVG submatrix. sce.pca.10x &lt;- runPca.se(sce.var.10x, features=rowData(sce.var.10x)$hvg) # Running a t-SNE for visualization purposes. sce.tsne.10x &lt;- runTsne.se(sce.pca.10x) We build a “shared nearest neighbor” (SNN) graph where the cells are the nodes. Each cell’s set of nearest neighbors is identified based on distances in the low-dimensional PC space, taking advantage of the compaction and denoising of the PCA (Chapter 4. Two cells are connected by an edge if they share any of their nearest neighbors, where the weight of the edge is defined from the number/rank of the shared neighbors (Xu and Su 2015). We then apply a community detection algorithm on the SNN graph - in this case, the “multi-level” algorithm, also known as Louvain clustering. Each node in the graph becomes a member of a community, giving us a cluster assignment for each cell (Figure 6.1). sce.louvain.10x &lt;- clusterGraph.se(sce.tsne.10x, method=&quot;multilevel&quot;) table(sce.louvain.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 ## 797 570 1007 127 382 503 224 36 183 118 200 library(scater) plotReducedDim(sce.louvain.10x, &quot;TSNE&quot;, colour_by=&quot;clusters&quot;) Figure 6.1: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from graph-based clustering. If we’re not satisfied with this clustering, we can fiddle with a large variety of parameters until we get what we want. (Also see discussion in Section 6.4.) This includes: The number of neighbors used in SNN graph construction (num.neighbors=). More neighbors increases the connectivity of the graph, resulting in broader clusters. The edge weighting scheme used in SNN graph construction. For example, we could mimic seurat’s behavior by using the Jaccard index to weight the edges. The resolution used by the community detection algorithm. Higher values will favor the creation of smaller, finer clusters. The community detection algorithm itself. For example, we could switch to the Leiden algorithm, which typically results in finer clusters. sce.louvain20.10x &lt;- clusterGraph.se(sce.tsne.10x, num.neighbors=20, method=&quot;multilevel&quot;) table(sce.louvain20.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 ## 808 508 1013 124 606 547 36 185 110 210 sce.jaccard.10x &lt;- clusterGraph.se(sce.tsne.10x, more.build.args=list(weight.scheme=&quot;jaccard&quot;)) table(sce.jaccard.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 742 537 1010 126 383 527 232 36 183 126 198 47 sce.lowres.10x &lt;- clusterGraph.se(sce.tsne.10x, method=&quot;multilevel&quot;, resolution=0.1) table(sce.lowres.10x$clusters) ## ## 1 2 3 4 5 6 ## 1042 534 1746 606 36 183 sce.leiden.10x &lt;- clusterGraph.se(sce.tsne.10x, method=&quot;leiden&quot;, more.cluster.args=list(leiden.objective=&quot;cpm&quot;)) table(sce.leiden.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 130 126 149 46 57 74 16 171 151 151 102 72 115 125 91 88 36 152 161 140 ## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ## 23 45 85 123 129 190 71 56 80 81 95 8 3 111 116 133 34 47 75 9 ## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ## 45 104 4 48 85 13 1 5 71 29 35 13 4 3 15 1 1 1 1 1 Graph-based clustering has several appealing features that contribute to its popularity. It only requires a nearest neighbor search and is relatively efficient compared to, say, hierachical clustering methods that need a full distance matrix. Each cell is always connected to some neighbors in the graph, reducing the risk of generating many uninformative clusters consisting of one or two outlier cells. Community detection does not need a priori specification of the number of clusters, making it more robust for use across multiple datasets with different numbers of cell subpopulations. (Note that the number of clusters is still dependent on an arbitrary resolution parameter, so this shouldn’t be treated as an objective truth; but at least we avoid egregious cases of over- or underclustering that we might encounter with other methods like \\(k\\)-means.) One drawback of graph-based methods is that, after graph construction, no information is retained about relationships beyond the neighboring cells19. This has some practical consequences in datasets that exhibit differences in cell density. More steps through the graph are required to traverse through a region of higher cell density. During community detection, this effect “inflates” the high-density regions such that any internal substructure is more likely to cause formation of subclusters. Thus, the resolution of the clustering becomes dependent on the density of cells, which can occasionally be misleading if it overstates the heterogeneity in the data. On a practical note, the runAllNeighborSteps.se() function performs graph-based clustering alongside the \\(t\\)-SNE and UMAP. This is more efficient than calling each function separately, though results may be slightly different due to how the neighbor search results are shared across steps. We can sacrifice some speed for exact equality to the clusterGraph.se() results by setting collapse.search=FALSE. sce.nn.10x &lt;- runAllNeighborSteps.se(sce.pca.10x) table(sce.nn.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 ## 787 487 1022 135 383 563 223 36 182 129 200 reducedDimNames(sce.nn.10x) ## [1] &quot;PCA&quot; &quot;TSNE&quot; &quot;UMAP&quot; 6.3 \\(k\\)-means clustering \\(k\\)-means clustering is a classic technique for partitioning cells into a pre-specified number of clusters. Briefly, \\(k\\) cluster centroids are selected during initialization, each cell is assigned to its closest centroid, the centroids are then updated based on the means of its assigned cells, and this is repeated until convergence. This is simple, fast, and gives us exactly the desired number of clusters (Figure 6.2). Again, we use the per-cell PC scores for efficiency and denoising. sce.kmeans.10x &lt;- clusterKmeans.se(sce.tsne.10x, k=10) table(sce.kmeans.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 ## 484 177 226 1059 247 267 517 446 345 379 plotReducedDim(sce.kmeans.10x, &quot;TSNE&quot;, colour_by=&quot;clusters&quot;) Figure 6.2: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \\(k\\)-means clustering. If we’re not satisfied with the results, we can just tinker with the parameters. Most obviously, we could just increase \\(k\\) to obtain a greater number of smaller clusters. We could also alter the initialization and refinement strategies, though the effects of doing so are less clear. (By default, our initialization uses variance partitioning (Su and Dy 2007), which avoids the randomness of other approaches.) sce.kmeans20.10x &lt;- clusterKmeans.se(sce.tsne.10x, k=20) table(sce.kmeans20.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 432 172 154 402 117 206 217 277 63 266 157 122 308 286 36 134 185 170 64 379 sce.kpp.10x &lt;- clusterKmeans.se(sce.tsne.10x, k=10, more.kmeans.args=list(init.method=&quot;kmeans++&quot;)) table(sce.kpp.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 ## 444 225 267 335 704 1059 177 518 36 382 sce.lloyd.10x &lt;- clusterKmeans.se(sce.tsne.10x, k=10, more.kmeans.args=list(refine.method=&quot;lloyd&quot;)) table(sce.lloyd.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 ## 483 177 226 1055 245 283 510 442 346 380 The major drawback of \\(k\\)-means clustering is that we need to specify \\(k\\) in advance. It is difficult to select a default value that works well for a variety of datasets. If \\(k\\) is larger than the number of distinct subpopulations, we will overcluster, i.e., split subpopulations into smaller clusters; but if \\(k\\) is smaller than the number of subpopulations, we will undercluster, i.e., group multiple subpopulations into a single cluster. We might consider some methods to automatically determine a “suitable” value for \\(k\\), e.g., by maximizing the gap statistic (Tibshirani, Walther, and Hastie 2001). This can be computationally intensive as it involves repeated clusterings at a variety of possible \\(k\\). # Gap statistic involves the random generation of a simulated dataset, # so we need to set the seed to get a reproducible result. set.seed(999) library(cluster) gap.10x &lt;- clusGap( reducedDim(sce.tsne.10x, &quot;PCA&quot;), FUNcluster=function(x, k) { # clusterKmeans() is the low-level function used by clusterKmeans.se(). # We transpose the input as lusterKmeans expects cells in the columns. list(cluster=as.integer(clusterKmeans(t(x), k)$clusters)) }, K.max=50, B=5, verbose=FALSE ) # Choosing the number of clusters that maximizes the gap statistic. maxSE(f = gap.10x$Tab[,&quot;gap&quot;], SE.f = gap.10x$Tab[,&quot;SE.sim&quot;]) ## [1] 26 In practice, we mostly use \\(k\\)-means clustering for vector quantization. Instead of attempting to interpret the clusters, we treat each centroid as a “pseudo-cell” that represents all of its assigned cells. These representatives are used as the input data of computationally intensive procedures, which is more efficient than operating on the per-cell data. We usually set \\(k\\) to a large value such as the square root of the number of cells. This yields a set of fine-grained clusters that approximates the underlying distribution of cells in downstream steps, e.g., hierarchical clustering (Figure 6.3). A similar approach is used in SingleR to compact large references prior to cell type annotation. sce.vq.10x &lt;- clusterKmeans.se(sce.tsne.10x, k=sqrt(ncol(sce.tsne.10x)), meta.name=&quot;kmeans&quot;) table(sce.vq.10x$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 69 71 68 85 83 43 65 51 38 54 63 56 81 143 11 56 64 51 79 112 ## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ## 15 29 15 74 11 93 10 39 176 44 53 2 86 65 128 118 30 43 7 104 ## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ## 19 69 41 45 129 66 31 107 16 50 142 116 89 1 57 58 117 119 59 89 ## 61 62 63 64 ## 2 66 72 102 vq.centers.10x &lt;- metadata(sce.vq.10x)$kmeans$centers dim(vq.centers.10x) ## [1] 25 64 # Using centroids for something expensive, e.g., hierarchical clustering. This # involves creating a distance matrix that would be too large if we did it for # each pair of cells; so instead we do it between pairs of k-means centroids. dist.vq.10x &lt;- dist(t(vq.centers.10x)) hclust.vq.10x &lt;- hclust(dist.vq.10x, method=&quot;ward.D2&quot;) plot(hclust.vq.10x, xlab=&quot;&quot;, sub=&quot;&quot;) Figure 6.3: Dendrogram of the \\(k\\)-means cluster centroids from the PBMC dataset. Each leaf represents a centroid from \\(k\\)-means clustering. # Cutting the dendrogram at a dynamic height to cluster our centroids. library(dynamicTreeCut) cutree.vq.10x &lt;- cutreeDynamic( hclust.vq.10x, distM=as.matrix(dist.vq.10x), minClusterSize=1, verbose=0 ) table(cutree.vq.10x) ## cutree.vq.10x ## 1 2 3 4 5 6 7 8 ## 14 11 11 9 9 4 3 3 # Now extrapolating to all cells assigned to each k-means cluster. hclust.full.10x &lt;- cutree.vq.10x[sce.vq.10x$clusters] table(hclust.full.10x) ## hclust.full.10x ## 1 2 3 4 5 6 7 8 ## 1030 1180 607 606 518 12 36 158 6.4 Choosing the clustering parameters What is the “right” number of clusters? Which clustering algorithm is “correct”? These thoughts have haunted us ever since we did our first scRNA-seq analysis20. But with a decade of experience under our belt, our advice is to not worry too much about an “optimal” clustering. Just proceed with the rest of the analysis and attempt to assign biological meaning to each cluster (Chapter 7). If the clusters represent our cell types/states of interest, great; if not, we can always come back here and fiddle with the parameters. It is helpful to realize that clustering, like a microscope, is simply a tool to explore the data. We can zoom in and out by changing the resolution-related clustering parameters, and we can experiment with different clustering algorithms to obtain alternative perspectives. Perhaps we just want to resolve the major cell types, in which case a lower resolution would be appropriate; or maybe we want to distinguish finer subtypes or cell states (e.g., metabolic activity, stress), which would require higher resolution. The best clustering really depends on the scientific aims, which are difficult to translate into an a priori choice of parameters or algorithms. So, we can just try again if we don’t get what we want on the first pass21. For what it’s worth, there exist many more clustering algorithms that we have not discussed here. Off the top of our head, we could suggest hierarchical clustering, a classic technique that builds a dendrogram to summarize relationships between clusters; density-based clustering, which adapts to unusual cluster shapes and can ignore outlier points; and affinity propagation, which identifies exemplars based on similarities between points. All of these methods have been applied successfully to scRNA-seq data and might be worth considering if graph-based clustering isn’t satisfactory. For larger datasets, any scalability issues for these methods can be overcome by clustering on \\(k\\)-means centroids instead (Section 6.3). 6.5 Clustering diagnostics If we really need some “objective” metric of cluster quality22, we can evaluate the stability of each cluster using bootstrap replicates. Ideally, the clustering should be stable to perturbations to the input data (Von Luxburg 2010), which increases the likelihood that they can be reproduced in an independent study. To quantify stability, we create a “bootstrap replicate” dataset by sampling cells with replacement from the original dataset. The same clustering procedure is applied to this replicate to determine if the clusters from the original dataset can be reproduced. In Figure 6.4, a diagonal entry near 1 indicates that the corresponding cluster is not split apart in the bootstrap replicates, while an off-diagonal entry near 1 indicates that the corresponding pair of clusters are always separated. Unstable clusters or unstable separation between pairs of clusters warrant some caution during interpretation. # Bootstrapping involves random sampling so we need to set # the seed to get a reproducible result. set.seed(888) library(bluster) bootstrap.10x &lt;- bootstrapStability( reducedDim(sce.louvain.10x, &quot;PCA&quot;), FUN=function(x) { # i.e., our clustering procedure. These are the low-level functions # that are called by clusterGraph.se(). Note that we transpose the # input as buildSnnGraph expects the cells to be in the columns. g &lt;- buildSnnGraph(t(x)) clusterGraph(g, method=&quot;multilevel&quot;)$membership }, clusters=sce.louvain.10x$clusters, adjusted=FALSE ) library(pheatmap) pheatmap( bootstrap.10x, cluster_row=FALSE, cluster_col=FALSE, color=viridis::magma(100), breaks=seq(0, 1, length.out=101) ) Figure 6.4: Heatmap of probabilities of co-clustering from bootstrapping of graph-based clustering in the PBMC dataset. Each row and column represents an original cluster and each entry is colored according to the probability that two cells from their respective row/column clusters are clustered together (diagonal) or separated (off-diagonal) in the bootstrap replicates. If even more clustering diagnostics are required, we can choose from a variety of measures of cluster “quality” in the bluster package: The silhouette width, as implemented in the approxSilhouette() function. For each cell, we compute the average distance to all cells in the same cluster. We also find the minimum of the average distances to all cells in any other cluster. The silhouette width for each cell is defined as the difference between these two values divided by their maximum. Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in the nearest other cluster. Thus, clusters with large positive silhouette widths are well-separated from other clusters. The clustering purity, as implemented in the clusterPurity() function. The purity is defined for each cell as the proportion of neighboring cells that are assigned to the same cluster, after some weighting to adjust for differences in the number of cells between clusters. This quantifies the degree to which cells from multiple clusters intermingle in expression space. Well-separated clusters should exhibit little intermingling and thus high purity values for all member cells. The root mean-squared deviation (RMSD), as implemented in the clusterRSMD() function. This is root of the mean of the squared differences from the cluster centroid across across all cells in the cluster. It is closely related to the within-cluster sum of squares (WCSS) and is a natural diagnostic for \\(k\\)-means clustering. A large RMSD suggests that a cluster has some internal structure and should be prioritized for further subclustering. The modularity scores of the communities in the graph, as implemented in the pairwiseModularity() function. For each community,tThis is defined as the difference between the observed and expected number of edges between cells in that community. The expected number of edges is computed from a null model where edges are randomly distributed among cells. Communities with high modularity scores are mostly disconnected from other communities in the graph. In general, we find these diagnostics to be more helpful for understanding the properties of each cluster than to identify “good” or “bad” clusters. For example, a low average silhouette width indicates that the cluster is weakly separated from its nearest neighboring clusters. This is not necessarily a bad thing if we’re looking at subtypes or states that exhibit relatively subtle changes in expression23. One might be tempted to objectively define a “best” clustering by adjusting the clustering parameters to optimize one of these metrics, e.g., maximum silhouette width. While there’s nothing wrong with this approach, it may not yield clusters that correspond to our cell types/states of interest. Anecdotally, we have observed that these optimal clusterings only separate broad cell types as any attempt to define weakly-separated clusters will be penalized. 6.6 Subclustering On occasion, we may want to investigate internal structure within a particular cluster, e.g., to find fine-grained cell subtypes. We could just increase the resolution of our clustering algorithm but (i) this is not guaranteed to split our cluster of interest and (ii) it could alter the distribution of cells in other clusters that we did not want to change. In such cases, a simple alternative is to repeat the feature selection and clustering within the cluster of interest. This selects HVGs and PCs that are more relevant to the cluster’s internal variation, improving resolution by avoiding noise from unnecessary features. The absence of distinct subpopulations also encourages clustering methods to separate cells according to more modest intra-cluster heterogeneity. Let’s demonstrate on cluster 2 of our PBMC dataset: chosen.cluster &lt;- &quot;2&quot; sce.sub.10x &lt;- sce.louvain.10x[,sce.louvain.10x$clusters == chosen.cluster] dim(sce.sub.10x) ## [1] 33694 570 sce.subvar.10x &lt;- chooseRnaHvgs.se(sce.sub.10x) sce.subpca.10x &lt;- runPca.se(sce.subvar.10x, features=rowData(sce.subvar.10x)$hvg) sce.subtsne.10x &lt;- runTsne.se(sce.subpca.10x) We perform a new round of clustering on all of the cells in this subset of the data (Figure 6.5). This effectively increases our resolution of cluster 2 by breaking it into further subclusters. Importantly, we can increase resolution without changing the parameters of the parent clustering, which is convenient if we’re already satisfied with those clusters. # We don&#39;t necessarily have to use the same parameters that we used to cluster # the full dataset, but there&#39;s no reason to change either, so whatever. sce.subgraph.10x &lt;- clusterGraph.se(sce.subtsne.10x) table(sce.subgraph.10x$clusters) ## ## 1 2 3 4 5 ## 155 145 34 146 90 plotReducedDim(sce.subgraph.10x, &quot;TSNE&quot;, colour_by=&quot;clusters&quot;) Figure 6.5: \\(t\\)-SNE plot of cells in cluster 2 of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned subcluster from graph-based clustering. Subclustering can simplify the interpretation of the subclusters, as these only need to be considered in the context of the parent cluster’s biological identity. For example, if we knew that the parent cluster contained T cells, we could treat all of the subclusters as T cell subtypes. However, this requires some care if there is any uncertainty in the identification for the parent cluster. If the underlying cell types/states span cluster boundaries, conditioning on the putative identity of the parent cluster may be premature, e.g., a subcluster actually represents contamination from a cell type in a neighboring parent cluster. Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] pheatmap_1.0.13 bluster_1.21.1 ## [3] dynamicTreeCut_1.63-1 cluster_2.1.8.2 ## [5] scater_1.39.4 ggplot2_4.0.2 ## [7] scuttle_1.21.6 scrapper_1.5.17 ## [9] DropletUtils_1.31.1 SingleCellExperiment_1.33.2 ## [11] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [13] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [15] IRanges_2.45.0 S4Vectors_0.49.1 ## [17] BiocGenerics_0.57.0 generics_0.1.4 ## [19] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [21] DropletTestFiles_1.21.0 BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] DBI_1.3.0 gridExtra_2.3 ## [3] httr2_1.2.2 rlang_1.2.0 ## [5] magrittr_2.0.5 otel_0.2.0 ## [7] compiler_4.6.0 RSQLite_2.4.6 ## [9] DelayedMatrixStats_1.33.0 png_0.1-9 ## [11] vctrs_0.7.3 pkgconfig_2.0.3 ## [13] crayon_1.5.3 fastmap_1.2.0 ## [15] dbplyr_2.5.2 XVector_0.51.0 ## [17] labeling_0.4.3 rmarkdown_2.31 ## [19] ggbeeswarm_0.7.3 purrr_1.2.2 ## [21] bit_4.6.0 xfun_0.57 ## [23] cachem_1.1.0 beachmat_2.27.5 ## [25] jsonlite_2.0.0 blob_1.3.0 ## [27] rhdf5filters_1.23.3 DelayedArray_0.37.1 ## [29] Rhdf5lib_1.33.6 BiocParallel_1.45.0 ## [31] irlba_2.3.7 parallel_4.6.0 ## [33] R6_2.6.1 bslib_0.10.0 ## [35] RColorBrewer_1.1-3 limma_3.67.1 ## [37] jquerylib_0.1.4 Rcpp_1.1.1 ## [39] bookdown_0.46 knitr_1.51 ## [41] R.utils_2.13.0 igraph_2.2.3 ## [43] Matrix_1.7-5 tidyselect_1.2.1 ## [45] viridis_0.6.5 dichromat_2.0-0.1 ## [47] abind_1.4-8 yaml_2.3.12 ## [49] codetools_0.2-20 curl_7.0.0 ## [51] lattice_0.22-9 tibble_3.3.1 ## [53] S7_0.2.1 withr_3.0.2 ## [55] KEGGREST_1.51.1 evaluate_1.0.5 ## [57] BiocFileCache_3.1.0 ExperimentHub_3.1.0 ## [59] Biostrings_2.79.5 pillar_1.11.1 ## [61] BiocManager_1.30.27 filelock_1.0.3 ## [63] BiocVersion_3.23.1 sparseMatrixStats_1.23.0 ## [65] scales_1.4.0 glue_1.8.0 ## [67] tools_4.6.0 AnnotationHub_4.1.0 ## [69] BiocNeighbors_2.5.4 ScaledMatrix_1.19.0 ## [71] locfit_1.5-9.12 cowplot_1.2.0 ## [73] rhdf5_2.55.16 grid_4.6.0 ## [75] AnnotationDbi_1.73.1 edgeR_4.9.7 ## [77] beeswarm_0.4.0 BiocSingular_1.27.1 ## [79] HDF5Array_1.39.1 vipor_0.4.7 ## [81] rsvd_1.0.5 cli_3.6.6 ## [83] rappdirs_0.3.4 viridisLite_0.4.3 ## [85] S4Arrays_1.11.1 dplyr_1.2.1 ## [87] gtable_0.3.6 R.methodsS3_1.8.2 ## [89] sass_0.4.10 digest_0.6.39 ## [91] ggrepel_0.9.8 SparseArray_1.11.13 ## [93] dqrng_0.4.1 farver_2.1.2 ## [95] memoise_2.0.1 htmltools_0.5.9 ## [97] R.oo_1.27.1 lifecycle_1.0.5 ## [99] h5mread_1.3.3 httr_1.4.8 ## [101] statmod_1.5.1 bit64_4.6.0-1 References "],["marker-detection.html", "Chapter 7 Marker gene detection 7.1 Motivation 7.2 Scoring marker genes 7.3 Visualizing marker genes 7.4 Using a log-fold change threshold 7.5 Blocking on uninteresting factors 7.6 More uses for the marker scores 7.7 Invalidity of \\(p\\)-values 7.8 Gene set enrichment Session information", " Chapter 7 Marker gene detection 7.1 Motivation Now that we’ve got a clustering from Chapter 6, our next step is to identify the genes that drive separation between clusters. Genes that are strongly upregulated in a particular cluster are called “markers” as they define the corresponding cell type/state relative to other cells in the population. By examining the annotated functions of the marker genes, we can assign biological meaning to each cluster. In the simplest case, if we know that certain genes are upregulated in a particular cell type, a cluster with increased expression of those genes can be treated as a proxy for that cell type. More subtle cell states (e.g., activation status, stress) can also be identified based on the behavior of genes in the affected pathways. 7.2 Scoring marker genes 7.2.1 Comparing pairs of clusters Our general strategy is to test for differential expression (DE) between clusters and examine the top DE genes from each comparison. Specifically, we quantify the DE between each pair of clusters by computing an effect size for each gene (Section 7.2.2). We then summarize the effect sizes across comparisons for each cluster into a single statistic per gene (Section 7.2.3). Sorting on one of the effect size summaries yields a ranking of potential marker genes for each cluster. To illustrate, let’s load our old friend, the PBMC dataset from 10X Genomics (Zheng et al. 2017). # Loading in raw data from the 10X output files. library(DropletTestFiles) raw.path.10x &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/filtered.tar.gz&quot;) dir.path.10x &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path.10x, exdir=dir.path.10x) library(DropletUtils) fname.10x &lt;- file.path(dir.path.10x, &quot;filtered_gene_bc_matrices/GRCh38&quot;) sce.10x &lt;- read10xCounts(fname.10x, col.names=TRUE) # Applying our default QC with outlier-based thresholds. library(scrapper) is.mito.10x &lt;- grepl(&quot;^MT-&quot;, rowData(sce.10x)$Symbol) sce.qc.10x &lt;- quickRnaQc.se(sce.10x, subsets=list(MT=is.mito.10x)) sce.qc.10x &lt;- sce.qc.10x[,sce.qc.10x$keep] # Computing log-normalized expression values. sce.norm.10x &lt;- normalizeRnaCounts.se(sce.qc.10x, size.factors=sce.qc.10x$sum) # We now choose the top HVGs. sce.var.10x &lt;- chooseRnaHvgs.se(sce.norm.10x, more.choose.args=list(top=4000)) # Running the PCA on the HVG submatrix. sce.pca.10x &lt;- runPca.se(sce.var.10x, features=rowData(sce.var.10x)$hvg, number=25) # Doing some graph-based clustering, t-SNEs, etc. sce.nn.10x &lt;- runAllNeighborSteps.se(sce.pca.10x) sce.nn.10x ## class: SingleCellExperiment ## dim: 33694 4147 ## metadata(3): Samples qc PCA ## assays(2): counts logcounts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(7): ID Symbol ... residuals hvg ## colnames(4147): AAACCTGAGACAGACC-1 AAACCTGAGCGCCTCA-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(8): Sample Barcode ... sizeFactor clusters ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): Given the clustering and the log-expression values, the scoreMarkers.se() function returns a data frame of marker statistics for each cluster. Each data frame contains the mean log-expression and the proportion of cells with detected (i.e., non-zero) expression in a particular cluster. It also contains multiple columns representing effect size summaries, where each column is named as &lt;effect size&gt;.&lt;summary type&gt;, e.g., cohens.d.mean contains the mean of the Cohen’s \\(d\\) across all comparisons involving that cluster. Genes with larger cohens.d.mean values exhibit stronger upregulation in the current cluster compared to the average of the other clusters. By default, scoreMarkers.se() orders the rows on cohens.d.mean, which represents one possible ranking of markers for each cluster. markers.10x &lt;- scoreMarkers.se(sce.nn.10x, sce.nn.10x$clusters, extra.columns=&quot;Symbol&quot;) names(markers.10x) ## [1] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; &quot;5&quot; &quot;6&quot; &quot;7&quot; &quot;8&quot; &quot;9&quot; &quot;10&quot; &quot;11&quot; # Examining the statistics for cluster 1. chosen.cluster &lt;- &quot;1&quot; chosen.markers.10x &lt;- markers.10x[[chosen.cluster]] head(chosen.markers.10x) ## DataFrame with 6 rows and 23 columns ## Symbol mean detected cohens.d.min cohens.d.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 0.8145154 6.63007 ## ENSG00000087086 FTL 6.24421 1.000000 -0.3996035 4.98157 ## ENSG00000011600 TYROBP 4.05728 0.997459 0.4176424 4.65609 ## ENSG00000163220 S100A9 5.63461 0.996188 2.1347048 4.56463 ## ENSG00000143546 S100A8 5.37071 1.000000 2.4422557 4.30814 ## ENSG00000163131 CTSS 3.81179 0.997459 -0.0492655 3.88013 ## cohens.d.median cohens.d.max cohens.d.min.rank auc.min ## &lt;numeric&gt; &lt;numeric&gt; &lt;integer&gt; &lt;numeric&gt; ## ENSG00000090382 7.52540 8.84770 1 0.760525 ## ENSG00000087086 5.59914 7.23427 1 0.406228 ## ENSG00000011600 6.13868 7.15137 2 0.639057 ## ENSG00000163220 4.98685 5.33239 2 0.927441 ## ENSG00000143546 4.61885 4.73656 1 0.944957 ## ENSG00000163131 4.11857 5.45536 2 0.494780 ## auc.mean auc.median auc.max auc.min.rank delta.mean.min ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;integer&gt; &lt;numeric&gt; ## ENSG00000090382 0.973853 0.998880 0.998974 1 0.8230802 ## ENSG00000087086 0.940033 0.999995 1.000000 1 -0.1740122 ## ENSG00000011600 0.941868 0.997856 0.998416 3 0.2037323 ## ENSG00000163220 0.988210 0.996637 0.996751 3 3.0134895 ## ENSG00000143546 0.990912 0.997315 0.997937 2 3.3958255 ## ENSG00000163131 0.944007 0.995314 0.997834 2 -0.0269158 ## delta.mean.mean delta.mean.median delta.mean.max ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 4.47263 5.12259 5.24414 ## ENSG00000087086 2.67120 3.09281 3.28471 ## ENSG00000011600 2.65311 3.68333 3.78321 ## ENSG00000163220 4.48435 4.80883 4.85246 ## ENSG00000143546 4.50586 4.71901 4.77685 ## ENSG00000163131 2.58808 2.89583 3.34762 ## delta.mean.min.rank delta.detected.min delta.detected.mean ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 1 -0.00127065 0.38767601 ## ENSG00000087086 4 0.00000000 0.00365274 ## ENSG00000011600 4 -0.00254130 0.44956424 ## ENSG00000163220 2 0.04063250 0.31324397 ## ENSG00000143546 1 0.05925926 0.42422861 ## ENSG00000163131 5 -0.00254130 0.34497914 ## delta.detected.median delta.detected.max ## &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 0.47919580 0.586641 ## ENSG00000087086 0.00102669 0.010989 ## ENSG00000011600 0.71614818 0.755306 ## ENSG00000163220 0.37554637 0.512672 ## ENSG00000143546 0.51712877 0.604396 ## ENSG00000163131 0.36907130 0.623682 ## delta.detected.min.rank ## &lt;integer&gt; ## ENSG00000090382 43 ## ENSG00000087086 33694 ## ENSG00000011600 14 ## ENSG00000163220 66 ## ENSG00000143546 37 ## ENSG00000163131 41 # A more concise overview. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;) ## DataFrame with 10 rows and 4 columns ## Symbol mean detected lfc ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 ## ENSG00000204482 LST1 2.90740 0.988564 2.05035 ## ENSG00000197956 S100A6 4.49784 0.998729 2.28517 We examine the top genes for some annotated biological function or cell type specificity that could be used to identify cluster 124. Usually the first 10-20 genes are sufficient to assign some biological meaning to the cluster, though we can always perform a more detailed characterization by considering additional genes from the ranking. We can also visualize the distribution of their expression values in each cluster (Figure 7.1), where the top markers should be upregulated in cluster 1 compared to most of the other clusters. library(scater) # Displaying symbols instead of Ensembl IDs for easier interpretation. plotExpression( sce.nn.10x, x=&quot;clusters&quot;, colour_by=&quot;clusters&quot;, # for some verisimilitude features=chosen.markers.10x$Symbol[1:6], swap_rownames=&quot;Symbol&quot; ) Figure 7.1: Distribution of log-expression values for the top marker genes of cluster 1 in the PBMC dataset. Of course, the default cohens.d.mean is just one of many possible choices for ranking potential marker genes. Different effect sizes or summary statistics can yield alternative rankings that may be more or less useful. 7.2.2 Choice of effect size For each pairwise comparison, we compute several effect sizes to quantify the magnitude of differential expression between two clusters. The choice of effect size influences the types of markers that are prioritized in rankings based on that effect size. To demonstrate, we’ll look at the different rankings obtained for cluster 1 with each effect size. (For consistency and simplicity, we will use the .mean summary in each example.) Cohen’s \\(d\\) is defined as the difference in the mean between groups divided by the average standard deviation across groups. In other words, it is the number of standard deviations that separate the means of the two groups. When applied to log-expression values, Cohen’s \\(d\\) can be interpreted as a standardized log-fold change. Positive values indicate that the gene is upregulated in our cluster of interest, negative values indicate downregulation and values close to zero indicate that there is little difference. Cohen’s \\(d\\) is roughly analogous to the \\(t\\)-statistic in a two-sample \\(t\\)-test. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;cohens.d.mean&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc cohens.d.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 6.63007 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 4.98157 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 4.65609 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 4.56463 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 4.30814 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 3.88013 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 3.76730 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 3.43847 ## ENSG00000204482 LST1 2.90740 0.988564 2.05035 3.33141 ## ENSG00000197956 S100A6 4.49784 0.998729 2.28517 3.19315 The area under the curve (AUC) is the probability that a randomly chosen observation from our cluster of interest is greater than a randomly chosen observation from the other cluster. A value of 1 corresponds to upregulation, where all values of our cluster of interest are greater than any value from the other cluster; a value of 0.5 means that there is no difference in the location of the distributions; and a value of 0 corresponds to downregulation. The AUC is closely related to the \\(U\\) statistic in the Wilcoxon ranked sum test (a.k.a., Mann-Whitney U-test). Both the AUC and Cohen’s \\(d\\) tend to detect similar markers - the former is more robust to outliers but less sensitive to the magnitude of the differences between clusters, i.e., a greater difference between clusters will usually result in a larger Cohen’s \\(d\\) but may not change the AUC much if it’s already close to 0 or 1. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;auc.mean&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc auc.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 0.990912 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 0.988210 ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 0.973853 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.957043 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.949031 ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.948650 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 0.944007 ## ENSG00000197956 S100A6 4.49784 0.998729 2.28517 0.943195 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 0.941868 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 0.940033 The “delta-detected” is the difference in the proportion of cells with detected (non-zero) expression between two clusters. A value of 1 indicates that all cells in the cluster of interest express a gene, while all cells in the other cluster do not; a value of zero indicates that there is no difference in the proportion; and a value of -1 indicates that expression is only found in the cells of the other cluster. Rankings based on the delta-detected value will prioritize genes that are near-silent in the other cluster (Figure 7.2). When available, these genes are often very effective markers as they are only expressed in our cluster of interest. However, it is also possible that strong markers will not have a large delta-detected value, e.g., because they are expressed at a low constitutive level in the other cluster. top.detected.10x &lt;- previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;delta.detected.mean&quot;) top.detected.10x ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc delta.detected.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.797889 ## ENSG00000038427 VCAN 1.95980 0.885642 1.84785 0.787284 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.758067 ## ENSG00000163563 MNDA 2.50935 0.960610 2.13786 0.733194 ## ENSG00000121552 CSTA 2.30972 0.949174 1.97753 0.729536 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.718534 ## ENSG00000100079 LGALS2 2.10228 0.885642 1.82007 0.705982 ## ENSG00000170458 CD14 1.39897 0.771283 1.32663 0.692697 ## ENSG00000110077 MS4A6A 1.54823 0.829733 1.24222 0.643400 ## ENSG00000127951 FGL2 1.48230 0.846252 1.22767 0.638580 plotExpression( sce.nn.10x, features=top.detected.10x$Symbol[1:6], swap_rownames=&quot;Symbol&quot;, x=&quot;clusters&quot;, colour_by=&quot;clusters&quot; ) Figure 7.2: Distribution of log-expression values for the top marker genes of cluster 1 in the PBMC dataset, ranked by the mean delta-detected. Finally, the “delta-mean” is the difference in the mean between two clusters. When computed on the log-normalized expression values, this is simply a fancy name for the log-fold change between clusters. In most cases, Cohen’s \\(d\\) or the AUC are better choices as they account for the variance within each cluster. Nonetheless, the log-fold changes are still useful as their values are easier to interpret. They can also help to diagnose pathological situations where large Cohen’s \\(d\\) or AUC values are driven by small variances instead of large differences between clusters. # Same as the &#39;lfc&#39; column. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;delta.mean.mean&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc delta.mean.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 4.50586 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 4.48435 ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 4.47263 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 2.67120 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 2.65311 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 2.58808 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 2.54102 ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 2.48297 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 2.38272 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 2.36554 Keep in mind that effect sizes are defined relative to other clusters in the same dataset. Biologically meaningful genes will not be detected as markers if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. This is usually not a problem as we should have some prior knowledge about the identity of the cell population, e.g., we should know we’ve isolated T cells for our experiment25. Nonetheless, if “absolute” identification of cell types is desired, we need to use cell type annotation methods like SingleR. 7.2.3 Summarizing pairwise effects As mentioned above, we perform pairwise comparisons between clusters to find differentially expressed genes. For each gene, we obtain one effect size of a given type from each pair of clusters; so in a dataset with \\(N\\) clusters, each cluster will have \\(N-1\\) effect sizes for consideration. To simplify interpretation, we summarize the effect sizes for each cluster into key statistics such as the mean and median. This allows us to create a ranking of potential marker genes based on one of the summary statistics for a given effect size. The mean and median are the most obvious and general-purpose summary statistics. For cluster \\(X\\), a large mean effect size indicates that the gene is upregulated in \\(X\\) compared to the average of the other groups. Similarly, a large median effect size indicates that the gene is upregulated in \\(X\\) compared to most (&gt;50%) other clusters. The median is more robust (or less sensitive, depending on one’s perspective) than the mean to large effect sizes in a minority of comparisons, which may or may not be desirable. In practice, these summaries usually generate similar rankings. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;cohens.d.median&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc cohens.d.median ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 7.52540 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 6.13868 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 5.59914 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 5.33218 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 4.98685 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 4.61885 ## ENSG00000204482 LST1 2.90740 0.988564 2.05035 4.17765 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 4.11857 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 3.96423 ## ENSG00000204472 AIF1 2.83022 0.978399 2.09265 3.80569 The minimum value is the most stringent summary for identifying upregulated genes. A large minimum value indicates that the gene is upregulated in \\(X\\) compared to all other clusters. Ranking on the minimum is a high-risk, high-reward approach; it can yield a concise set of excellent markers that are unique to \\(X\\), but can also overlook interesting genes if they are expressed at a similar level in any other cluster. The latter effect is not uncommon if the clusters correspond to closely-related cell types. To give a concrete example, consider a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells. Neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;cohens.d.min&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc cohens.d.min ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000143546 S100A8 5.37071 1.000000 4.505863 2.442256 ## ENSG00000163221 S100A12 2.59398 0.916137 2.482974 2.174899 ## ENSG00000163220 S100A9 5.63461 0.996188 4.484354 2.134705 ## ENSG00000038427 VCAN 1.95980 0.885642 1.847852 1.519953 ## ENSG00000170458 CD14 1.39897 0.771283 1.326635 1.509846 ## ENSG00000085265 FCN1 2.75854 0.970775 2.365537 1.071372 ## ENSG00000121552 CSTA 2.30972 0.949174 1.977535 1.008715 ## ENSG00000198886 MT-ND4 4.27689 1.000000 0.742434 0.935270 ## ENSG00000163563 MNDA 2.50935 0.960610 2.137860 0.926516 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.541024 0.903887 Another interesting summary statistic is the minimum rank, a.k.a., “min-rank”. The min-rank is the smallest rank of each gene across all pairwise comparisons involving our cluster of interest \\(X\\). Specifically, genes are ranked within each pairwise comparison based on decreasing effect size, and then the smallest rank across all comparisons is reported for each gene. A gene with a small min-rank is one of the top upregulated genes in at least one comparison between \\(X\\) and another cluster. Or in other words: the set of all genes with a min-rank less than or equal to \\(R\\) is equal to the union of the top \\(R\\) genes from all pairwise comparisons for \\(X\\). This guarantees that our set contains at least \\(R\\) genes that can distinguish our cluster of interest from any other cluster, which enables a comprehensive determination of a cluster’s identity. previewMarkers(chosen.markers.10x, pre.columns=&#39;Symbol&#39;, order.by=&quot;cohens.d.min.rank&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc cohens.d.min.rank ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;integer&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 1 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 1 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 1 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 2 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 2 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 2 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 2 ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 4 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 5 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 5 # min.rank &lt;= 5 means represents the union of the top 5 genes from each # pairwise comparison between our chosen cluster and every other cluster. chosen.markers.10x$Symbol[chosen.markers.10x$cohens.d.min.rank &lt;= 5] ## [1] &quot;LYZ&quot; &quot;FTL&quot; &quot;TYROBP&quot; &quot;S100A9&quot; ## [5] &quot;S100A8&quot; &quot;CTSS&quot; &quot;CST3&quot; &quot;FCN1&quot; ## [9] &quot;RP11-1143G9.4&quot; &quot;S100A12&quot; &quot;NEAT1&quot; The flexibility to choose between different summary statistics is one of the strengths of our pairwise strategy. This allows us to explore different rankings of markers depending on our preferences. For example, the min-rank is a conservative choice as it guarantees separation of our cluster of interest, at the cost of including weaker DE genes in the top set of markers; while the minimum provides an aggressive ranking that focuses on markers that are uniquely expressed in our cluster (if any exist). Pairwise comparisons are also robust to differences in the relative number of cells between clusters, which ensures that a single large cluster does not dominate the calculation of effect sizes for all other clusters. 7.3 Visualizing marker genes At this point, we suppose that we ought to create some figures to keep everyone entertained. We have already demonstrated how we can examine the distribution of expression values with violin plots in Figure 7.1. Another option is to color our \\(t\\)-SNE plot according to the log-expression values of a specific marker in each cell (Figure 7.3). Any heterogeneity in expression within our cluster might be indicative of internal structure. gridExtra::grid.arrange( plotReducedDim(sce.nn.10x, &quot;TSNE&quot;, colour_by=&quot;clusters&quot;), plotReducedDim(sce.nn.10x, &quot;TSNE&quot;, colour_by=chosen.markers.10x$Symbol[1], swap_rownames=&quot;Symbol&quot;), ncol=2 ) Figure 7.3: \\(t\\)-SNE plot of the cells in the PBMC dataset, colored by the assigned cluster (top) or the log-expression of the top marker gene in cluster 1 (bottom). The heatmap is a classic visualization in genomics, and scRNA-seq is no exception (Figure 7.4). This provides a compact summary of the relative expression of multiple markers across the cell population. Ideally, each marker should be consistently upregulated within our cluster of interest compared to the rest of the cells in the population. plotHeatmap(sce.nn.10x, features=chosen.markers.10x$Symbol[1:10], order_columns_by=&quot;clusters&quot;, swap_rownames=&quot;Symbol&quot;, center=TRUE) Figure 7.4: Heatmap of the top markers for cluster 1 in the PBMC dataset. Each row represents a gene and each column represents a cell. Each entry is colored by the log-fold change for each cell from the mean log-expression for that gene. Another popular visualization is the seurat-style “dot plot”26, also known as a bubble plot (Figure 7.5). This is more concise than the heatmap as it uses the size of each dot/bubble to represent the proportion of cells with detected expression. Personally, we disapprove of using variable areas to represent data as this can generate misleading visualizations27, but hey, to each their own. plotDots(sce.nn.10x, features=chosen.markers.10x$Symbol[1:10], group=&quot;clusters&quot;, swap_rownames=&quot;Symbol&quot;, center=TRUE) Figure 7.5: Dot plot of the top markers for cluster 1 in the PBMC dataset. The size of each point represents the number of cells that express each gene in each cluster, while the color of each point represents the log-fold change between the cluster and the average across all clusters. 7.4 Using a log-fold change threshold The Cohen’s \\(d\\) and AUC consider both the magnitude of the difference between clusters as well as the variability within each cluster. If the variability is low, it is possible for a gene to have a large effect size even if the magnitude of the difference is small. These genes tend to be uninformative for cell type identification, e.g., ribosomal protein genes. We would prefer genes with larger log-fold changes between clusters, even if they have higher variability (McCarthy and Smyth 2009). To favor the detection of such genes, we can compute the effect sizes relative to a log-fold change threshold. The definition of Cohen’s \\(d\\) is generalized to the standardized difference between the observed log-fold change and the specified threshold. Similarly, the AUC is redefined as the probability of randomly picking an expression value from one cluster that is greater than a random value from the other cluster plus the threshold. A large positive Cohen’s \\(d\\) and an AUC above 0.5 can only be obtained if the observed log-fold change between clusters is significantly greater than the threshold. (However, a negative Cohen’s \\(d\\) or AUC below 0.5 may not represent downregulation; it may just indicate that the observed log-fold change is less than the specified threshold.) markers.threshold.10x &lt;- scoreMarkers.se( sce.nn.10x, sce.nn.10x$clusters, extra.columns=&#39;Symbol&#39;, more.marker.args=list(threshold=2) ) chosen.markers.threshold.10x &lt;- markers.threshold.10x[[chosen.cluster]] # Default ordering by the mean of Cohen&#39;s d. previewMarkers(chosen.markers.threshold.10x, pre.columns=&quot;Symbol&quot;, post.columns=&quot;cohens.d.mean&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc cohens.d.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 3.777706 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 2.556023 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 2.411830 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 1.190628 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 1.158359 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 0.836300 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 0.703171 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.686771 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.614933 ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.517954 # Also looking at the order by the mean of the AUCs. previewMarkers(chosen.markers.threshold.10x, pre.columns=&quot;Symbol&quot;, order.by=&quot;auc.mean&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc auc.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 0.929055 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 0.928191 ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 0.880932 ## ENSG00000087086 FTL 6.24421 1.000000 2.67120 0.807194 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 0.713752 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.686833 ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.678680 ## ENSG00000101439 CST3 3.93963 0.998729 2.38272 0.662480 ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.656001 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 0.640162 In general, we only use a threshold if irrelevant genes with low variances are interfering with our interpretation of the clusters. Weakly expressed genes will often have low log-fold changes due to the pseudo-count shrinkage (Chapter 2), and genes that separate closely-related clusters will usually have smaller log-fold changes. Prematurely using a large threshold will prevent the detection of these potentially interesting genes. 7.5 Blocking on uninteresting factors Larger datasets may contain multiple blocks of cells where the differences between blocks are uninteresting, e.g., batch effects, variability between donors. These differences can interfere with marker gene detection by (i) inflating the variance within each cluster and (ii) distorting the log-fold changes if the cluster composition varies between blocks. To avoid these issues, we block on any uninteresting factors when computing the effect sizes. Let’s demonstrate on a mouse trophoblast dataset (Lun et al. 2017) generated across two plates, where any differences between plates are technical and should not be allowed to influence the marker statistics. library(scRNAseq) sce.tropho &lt;- LunSpikeInData(&quot;tropho&quot;) sce.tropho$block &lt;- factor(sce.tropho$block) table(sce.tropho$block) # i.e., plate of origin. ## ## 20160906 20170201 ## 96 96 # Computing the QC metrics. For brevity, we&#39;ll skip the spike-ins. library(scrapper) is.mito.tropho &lt;- which(any(seqnames(rowRanges(sce.tropho))==&quot;MT&quot;)) sce.qc.tropho &lt;- quickRnaQc.se(sce.tropho, subsets=list(MT=is.mito.tropho), block=sce.tropho$block) sce.qc.tropho &lt;- sce.tropho[,sce.qc.tropho$keep] # Computing log-normalized expression values. sce.norm.tropho &lt;- normalizeRnaCounts.se(sce.qc.tropho, size.factors=sce.qc.tropho$sum, block=sce.qc.tropho$block) # We now choose the top HVGs. sce.var.tropho &lt;- chooseRnaHvgs.se(sce.norm.tropho, block=sce.norm.tropho$block) # Running the PCA on the HVG submatrix. sce.pca.tropho &lt;- runPca.se(sce.var.tropho, features=rowData(sce.var.tropho)$hvg, block=sce.var.tropho$block) # Doing some graph-based clustering. sce.nn.tropho &lt;- runAllNeighborSteps.se(sce.pca.tropho) We set block= to instruct scoreMarkers.se() to perform the pairwise comparisons separately in each block, i.e., plate. Specifically, for a comparison between two clusters, we compute one effect size per plate where we only use cells in that plate. By performing comparisons within each plate, we cancel out any differences between plates so that they do not interfere with our effect sizes. The per-plate effect sizes are then averaged across plates to obtain a single value per comparison, using a weighted mean that accounts for the number of cells involved in the comparison in each plate. A similar average across plates is computed for the mean log-expression and proportion of detected cells. markers.tropho &lt;- scoreMarkers.se(sce.nn.tropho, sce.nn.tropho$clusters, block=sce.nn.tropho$block) previewMarkers(markers.tropho[[&quot;1&quot;]]) ## DataFrame with 10 rows and 3 columns ## mean detected lfc ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSMUSG00000027306 6.50487 0.914894 4.19102 ## ENSMUSG00000006398 10.20393 1.000000 1.33125 ## ENSMUSG00000084301 6.90226 1.000000 1.30211 ## ENSMUSG00000083407 3.02553 0.936170 1.63986 ## ENSMUSG00000083907 4.10552 1.000000 1.59130 ## ENSMUSG00000001403 9.20855 1.000000 1.99050 ## ENSMUSG00000074802 5.14600 0.872340 2.90355 ## ENSMUSG00000030867 8.95573 1.000000 2.10757 ## ENSMUSG00000048574 8.75497 1.000000 1.03204 ## ENSMUSG00000030654 9.94568 1.000000 1.18929 By default, we do not explicitly penalize genes that behave inconsistently across blocks. This is generally unnecessary as the average favors genes with large effect sizes in the same direction in all blocks. That said, it is theoretically possible for a top marker to have highly variable effect sizes across plates, as long as the average is large. We can check this by visualizing the expression profile of a gene of interest with respect to the plate (Figure 7.6). Ideally, our gene would behave consistently across plates. plotExpression( sce.nn.tropho, features=rownames(markers.tropho[[&quot;1&quot;]])[1], x=&quot;clusters&quot;, colour_by=&quot;clusters&quot;, other_fields=&quot;block&quot; ) + facet_grid(~block) Figure 7.6: Distribution of expression values for the top-ranked marker gene (ENSMUSG00000027306) of cluster 1 in the trophoblast dataset. Distributions are shown for each cluster (x-axis) in each plate (panel). If we really want to enforce consistent DE across blocks, we can ask scoreMarkers.se() to instead compute a quantile instead of a weighted mean. For example, rankings derived from the minimum effect size across blocks will focus on genes that exhibit large changes in the same direction within each block. markers.min.tropho &lt;- scoreMarkers.se( sce.nn.tropho, sce.nn.tropho$clusters, block=sce.nn.tropho$block, more.marker.args=list( block.average.policy=&quot;quantile&quot;, block.quantile=0 # i.e., minimum. ) ) previewMarkers(markers.min.tropho[[&quot;1&quot;]]) ## DataFrame with 10 rows and 3 columns ## mean detected lfc ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSMUSG00000027306 6.46814 0.909091 4.083865 ## ENSMUSG00000006398 9.99817 1.000000 1.114116 ## ENSMUSG00000001403 9.19752 1.000000 1.782523 ## ENSMUSG00000084301 6.78050 1.000000 1.113203 ## ENSMUSG00000030867 8.73720 1.000000 1.855625 ## ENSMUSG00000083907 3.94614 1.000000 1.346573 ## ENSMUSG00000029177 8.42776 1.000000 0.690333 ## ENSMUSG00000083407 2.61290 0.880000 1.247344 ## ENSMUSG00000048574 8.52426 1.000000 0.933816 ## ENSMUSG00000084133 7.38314 1.000000 0.894224 Blocking in scoreMarkers.se() assumes that each pair of clusters is present in at least one block in order to perform a comparison within the block. In scenarios where cells from two clusters never co-occur in the same block, the associated pairwise comparison will be impossible and is ignored during calculation of summary statistics. This can be problematic in rare situations where the blocks are perfectly confounded with the clusters, though marker detection is likely to be the least of our concerns with an experimental design of this calibre. 7.6 More uses for the marker scores Our discussion above focuses on genes that are upregulated in our cluster of interest, as these are the easiest to interpret and experimentally validate. However, a cluster may occasionally be defined by downregulation of some genes relative to the rest of the cell population. In such cases, we can reverse the rankings to see if there is any consistent downregulation compared to other clusters. Alternatively, we can recognize that any downregulated genes in cluster \\(X\\) should manifest as upregulated genes in other clusters when compared to \\(X\\). By using a summary like the min-rank, we guarantee that these genes will show up somewhere, i.e., as markers of other clusters. (Other summaries are less effective as the upregulation only applies to comparisons against \\(X\\) and may not cause a noticeable increase in the mean/median summary.) # Ordering by increasing Cohen&#39;s d. reversed.chosen.markers.10x &lt;- chosen.markers.10x[order(chosen.markers.10x$cohens.d.mean),] previewMarkers(reversed.chosen.markers.10x, pre.columns=&quot;Symbol&quot;, post.columns=&quot;cohens.d.mean&quot;) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc cohens.d.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000213741 RPS29 3.946430 1.000000 -1.062005 -2.31360 ## ENSG00000177954 RPS27 5.043689 1.000000 -0.830939 -2.26035 ## ENSG00000100316 RPL3 3.650995 0.997459 -0.953517 -1.94125 ## ENSG00000198242 RPL23A 3.480505 0.996188 -0.996558 -1.93683 ## ENSG00000168028 RPSA 1.923204 0.869123 -1.394956 -1.79849 ## ENSG00000149273 RPS3 3.569838 0.996188 -0.870643 -1.70630 ## ENSG00000227507 LTB 0.579215 0.428208 -1.448969 -1.67086 ## ENSG00000071082 RPL31 3.228407 0.987294 -0.893312 -1.57149 ## ENSG00000105372 RPS19 4.003734 0.998729 -0.727010 -1.53771 ## ENSG00000137154 RPS6 4.130116 0.997459 -0.699234 -1.48973 Occasionally, we are only interested in markers for a subset of the clusters. Imagine that we have a set of closely-related clusters and we want to identify the genes that distinguish these clusters from each other. The summary statistics generated from all clusters might not be satisfactory as they will not prioritize genes with weak upregulation between related clusters. (Except for min-rank, where these genes would at least show up near the top of the ranking. But they would be surrounded by many irrelevant genes from comparisons to other clusters.) Instead, we can just compute marker scores from the cells in the selected subset of clusters: # Let&#39;s pretend that clusters 1, 2 and 3 are of particular interest and # we want to find markers between them. subset.clusters.10x &lt;- sce.nn.10x$clusters %in% c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;) subset.markers.10x &lt;- scoreMarkers.se( sce.nn.10x[,subset.clusters.10x], sce.nn.10x$clusters[subset.clusters.10x], extra.columns=&quot;Symbol&quot; ) # Now let&#39;s have a look at the top markers for cluster 2. previewMarkers(subset.markers.10x[[&quot;2&quot;]], pre.columns=&quot;Symbol&quot;) ## DataFrame with 10 rows and 4 columns ## Symbol mean detected lfc ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000008517 IL32 3.02931 0.979466 1.977099 ## ENSG00000227507 LTB 3.37330 0.991786 1.778829 ## ENSG00000277734 TRAC 2.53751 0.973306 1.308343 ## ENSG00000168685 IL7R 1.73999 0.839836 1.205725 ## ENSG00000167286 CD3D 1.91705 0.942505 0.896106 ## ENSG00000166710 B2M 6.25505 1.000000 0.487200 ## ENSG00000213741 RPS29 5.40264 1.000000 0.632544 ## ENSG00000206503 HLA-A 3.74897 0.997947 0.836841 ## ENSG00000116824 CD2 1.21460 0.770021 0.801813 ## ENSG00000234745 HLA-B 4.70730 1.000000 0.618141 For more control on the selected markers, we can filter our data frame on the available statistics. For example, we might only consider genes as markers if they have detected proportions above 50%, a mean log-expression greater than 1, an average difference in the detected proportions above 50%, and an average log-fold change above 1. We tend to avoid a priori filtering as it is difficult to choose thresholds that are generally applicable. Nonetheless, it can be useful to refine the set of markers once we know what we’re interested in. filtered.markers.10x &lt;- chosen.markers.10x[ chosen.markers.10x$detected &gt;= 0.5 &amp; chosen.markers.10x$mean &gt;= 1 &amp; chosen.markers.10x$delta.detected.mean &gt;= 0.5 &amp; chosen.markers.10x$delta.mean.mean &gt;= 1, ] previewMarkers(filtered.markers.10x, pre.columns=&quot;Symbol&quot;, post.columns=c(&quot;delta.detected.mean&quot;)) ## DataFrame with 10 rows and 5 columns ## Symbol mean detected lfc delta.detected.mean ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000085265 FCN1 2.75854 0.970775 2.36554 0.718534 ## ENSG00000204482 LST1 2.90740 0.988564 2.05035 0.581743 ## ENSG00000121552 CSTA 2.30972 0.949174 1.97753 0.729536 ## ENSG00000204472 AIF1 2.83022 0.978399 2.09265 0.615761 ## ENSG00000257764 RP11-1143G9.4 2.86253 0.978399 2.54102 0.758067 ## ENSG00000163563 MNDA 2.50935 0.960610 2.13786 0.733194 ## ENSG00000163221 S100A12 2.59398 0.916137 2.48297 0.797889 ## ENSG00000038427 VCAN 1.95980 0.885642 1.84785 0.787284 ## ENSG00000100079 LGALS2 2.10228 0.885642 1.82007 0.705982 ## ENSG00000025708 TYMP 2.15622 0.932656 1.56581 0.567270 7.7 Invalidity of \\(p\\)-values In the old days, we used to report \\(p\\)-values along with the effect sizes for the detected markers. After all, Cohen’s \\(d\\) and the AUC are closely related to \\(t\\)-tests and the Wilcoxon ranked sum test, respectively. Unfortunately, the statistical interpretation of \\(p\\)-values is compromised when identifying cluster-specific markers. The first issue is that of “data dredging” (also known as fishing or data snooping) when the DE analysis is performed on the same data used to define the clusters. We are more likely to get a positive result when we use a dataset to test a hypothesis generated from that data. Or more simply - clustering will separate cells by expression, so of course we will get low \\(p\\)-values when we compare between clusters! To illustrate, let’s simulate i.i.d. normal values, perform \\(k\\)-means clustering and test for DE between clusters of cells with Wilcoxon ranked sum tests. Our input data is random so we’d expect a uniform distribution of \\(p\\)-values under the null hypothesis, but instead it is skewed towards low values (Figure 7.7). This means that we can detect “significant” differences between clusters even in the absence of any real substructure in the data. set.seed(0) y &lt;- matrix(rnorm(1000000), ncol=200) clusters &lt;- kmeans(t(y), centers=2)$cluster out &lt;- apply(y, 1, FUN=function(x) { wilcox.test(x[clusters==1], x[clusters==2])$p.value }) hist(out, col=&quot;grey80&quot;, xlab=&quot;p-value&quot;, main=&quot;&quot;) Figure 7.7: Distribution of \\(p\\)-values from a DE analysis between two clusters in a simulation with no true subpopulation structure. Another problem is that many \\(p\\)-value calculations treat counts from different cells in the same cluster as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample, i.e., cell culture, animal or patient. DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (Lun and Marioni 2017). This is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be repeated. In other words, if the experiment involved a single biological sample, the sample size is actually just 1, regardless of how many individual cells were assayed. By treating cells as replicates, we overstate our sample size and obtain much lower \\(p\\)-values than would be appropriate. In short, the \\(p\\)-values for marker genes don’t make much sense from a statistical perspective. We can still use them for ranking, but at that point, we might as well make our life simpler and use the effect sizes directly. If we really want to determine whether some markers are “real”, the best approach is to perform a separate validation experiment with an independent replicate cell population. A typical strategy is to use different experimental techniques like FACS, FISH, qPCR or IHC to find a subpopulation that expresses the marker(s) of interest. This confirms that the subpopulation actually exists and is not an artifact of the scRNA-seq protocol or the computational analysis. 7.8 Gene set enrichment We can summarize the biological functions of our top-ranked marker genes with gene set enrichment analyses. Here, we extract predefined sets of genes for specific pathways or processes and check if any gene set is overrepresented among our set of top markers. This reduces some of the hassle of manually examining the annotation for each gene to assign biological meaning to each cluster. To illustrate, we’ll use the gene ontology (GO)’s biological process (BP) subcategory, which defines gene sets associated with known biological processes (Ashburner et al. 2000). We might also consider other useful gene set collections like KEGG and REACTOME - see the MSigDB overview for details. library(msigdbr) go.bp.df &lt;- msigdbr(species=&quot;Homo sapiens&quot;, collection=&quot;C5&quot;, subcollection=&quot;GO:BP&quot;) go.bp.sets &lt;- split(go.bp.df$ensembl_gene, go.bp.df$gs_name) We use the hypergeometric test to quantify enrichment of each gene set among the top markers for cluster 1. The \\(p\\)-value of each gene set is determined by the number of shared genes between each GO set and the top markers, relative to the size of the GO set. More strongly enriched sets will have lower \\(p\\)-values and should be prioritized for interpretation. Here, we have chosen the top 100 markers but different values can be used depending on how many markers are of interest. # Choosing the top 100 genes with positive Cohen&#39;s d values. top.chosen.10x &lt;- head(rownames(chosen.markers.10x)[chosen.markers.10x$cohens.d.mean &gt; 0], 100) library(scrapper) enrich.chosen.10x &lt;- testEnrichment(top.chosen.10x, go.bp.sets, universe=rownames(chosen.markers.10x)) # Prioritizing the most enriched gene sets for examination. enrich.chosen.10x &lt;- enrich.chosen.10x[order(enrich.chosen.10x$p.value),,drop=FALSE] head(enrich.chosen.10x) ## DataFrame with 6 rows and 3 columns ## overlap ## &lt;integer&gt; ## GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS 42 ## GOBP_INFLAMMATORY_RESPONSE 30 ## GOBP_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 31 ## GOBP_REGULATION_OF_DEFENSE_RESPONSE 28 ## GOBP_REGULATION_OF_IMMUNE_SYSTEM_PROCESS 35 ## GOBP_POSITIVE_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 25 ## size ## &lt;integer&gt; ## GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS 1863 ## GOBP_INFLAMMATORY_RESPONSE 934 ## GOBP_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 1146 ## GOBP_REGULATION_OF_DEFENSE_RESPONSE 890 ## GOBP_REGULATION_OF_IMMUNE_SYSTEM_PROCESS 1649 ## GOBP_POSITIVE_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 671 ## p.value ## &lt;numeric&gt; ## GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS 1.21353e-26 ## GOBP_INFLAMMATORY_RESPONSE 5.65548e-23 ## GOBP_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 1.42341e-21 ## GOBP_REGULATION_OF_DEFENSE_RESPONSE 3.51677e-21 ## GOBP_REGULATION_OF_IMMUNE_SYSTEM_PROCESS 4.79163e-21 ## GOBP_POSITIVE_REGULATION_OF_RESPONSE_TO_EXTERNAL_STIMULUS 1.16081e-20 # Focusing on some of the smaller, more specific gene sets. head(enrich.chosen.10x[enrich.chosen.10x$size &lt; 100,,drop=FALSE]) ## DataFrame with 6 rows and 3 columns ## overlap ## &lt;integer&gt; ## GOBP_RESPIRATORY_BURST 7 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN 8 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN 7 ## GOBP_RESPONSE_TO_FUNGUS 8 ## GOBP_NEUTROPHIL_CHEMOTAXIS 8 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_ANTIGEN 7 ## size ## &lt;integer&gt; ## GOBP_RESPIRATORY_BURST 40 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN 70 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN 43 ## GOBP_RESPONSE_TO_FUNGUS 80 ## GOBP_NEUTROPHIL_CHEMOTAXIS 83 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_ANTIGEN 52 ## p.value ## &lt;numeric&gt; ## GOBP_RESPIRATORY_BURST 2.81859e-11 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN 3.67052e-11 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN 4.83654e-11 ## GOBP_RESPONSE_TO_FUNGUS 1.10001e-10 ## GOBP_NEUTROPHIL_CHEMOTAXIS 1.48591e-10 ## GOBP_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_ANTIGEN 1.96477e-10 If we need more detail about a particular set, we can examine the behavior of its constituent genes. top.set.10x &lt;- rownames(enrich.chosen.10x)[1] overlaps.top.set.10x &lt;- intersect(go.bp.sets[[top.set.10x]], top.chosen.10x) previewMarkers( chosen.markers.10x[rownames(chosen.markers.10x) %in% overlaps.top.set.10x,], pre.column=&quot;Symbol&quot;, rows=NULL # List all of the top genes in this set. ) ## DataFrame with 42 rows and 4 columns ## Symbol mean detected lfc ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 5.80862 0.998729 4.47263 ## ENSG00000011600 TYROBP 4.05728 0.997459 2.65311 ## ENSG00000163220 S100A9 5.63461 0.996188 4.48435 ## ENSG00000143546 S100A8 5.37071 1.000000 4.50586 ## ENSG00000163131 CTSS 3.81179 0.997459 2.58808 ## ... ... ... ... ... ## ENSG00000051523 CYBA 3.266806 0.993647 0.810757 ## ENSG00000135046 ANXA1 1.701281 0.836086 0.846713 ## ENSG00000116701 NCF2 0.778700 0.564168 0.594712 ## ENSG00000103490 PYCARD 1.388051 0.815756 0.733532 ## ENSG00000111729 CLEC4A 0.691981 0.519695 0.566143 Alternatively, we can aggregate the expression profiles of each set’s genes into a single per-cell score. Scores are defined as the column sums of a rank-1 approximation of the submatrix of the log-expression values corresponding to the genes in the set (Bueno et al. 2016). This effectively performs a PCA to collapse the submatrix into a single dimension, enriching for the biological signal associated with the set’s annotated function. The resulting scores are primarily useful for visualizing set activity (Figure 7.8). We tend not to use gene set scores for quantitative analyses as they are difficult to interpret. Should the score be higher in a cell that weakly upregulates many genes in the set, or a cell that strongly upregulates a few genes in the set? What if two cells have the same score but express different subsets of genes in the set? These complications can be minimized by operating on individual genes whenever possible - for example, instead of testing for differences in gene set scores between subpopulations, we could examine the distribution of effect sizes for the same comparison across all genes in the set, which is easier to interpret and more informative. top.set.score.10x &lt;- scoreGeneSet.se(sce.nn.10x, go.bp.sets[[top.set.10x]]) plotReducedDim(sce.nn.10x, &quot;TSNE&quot;, colour_by=data.frame(Score=top.set.score.10x$scores)) Figure 7.8: \\(t\\)-SNE plot of the cells in the PBMC dataset, colored by the activity of the GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION_BETWEEN_ORGANISMS gene set. Note that many Bioconductor packages implement methods for quantifying gene set enrichment, e.g., fgsea, goseq, limma, to name a few. We like the hypergeometric test as it is simple and focuses on the top markers, but any function can be used as long as it can accept a ranking of genes. In all cases, we would recommend only using the enrichment \\(p\\)-values to rank the gene sets, not to make any statements about statistical significance. Many of these methods compute their \\(p\\)-values by assuming that genes are independent under the null hypothesis. In a biological system with highly coordinated pathways and processes, this is unlikely to be true, potentially inflating the type I error beyond the threshold for significance. Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] msigdbr_26.1.0 ensembldb_2.35.0 ## [3] AnnotationFilter_1.35.0 GenomicFeatures_1.63.2 ## [5] AnnotationDbi_1.73.1 scRNAseq_2.25.0 ## [7] scater_1.39.4 ggplot2_4.0.2 ## [9] scuttle_1.21.6 scrapper_1.5.17 ## [11] DropletUtils_1.31.1 SingleCellExperiment_1.33.2 ## [13] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [15] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [17] IRanges_2.45.0 S4Vectors_0.49.1 ## [19] BiocGenerics_0.57.0 generics_0.1.4 ## [21] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [23] DropletTestFiles_1.21.0 BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 ## [3] magrittr_2.0.5 gypsum_1.7.0 ## [5] ggbeeswarm_0.7.3 farver_2.1.2 ## [7] rmarkdown_2.31 BiocIO_1.21.0 ## [9] vctrs_0.7.3 memoise_2.0.1 ## [11] Rsamtools_2.27.2 DelayedMatrixStats_1.33.0 ## [13] RCurl_1.98-1.18 htmltools_0.5.9 ## [15] S4Arrays_1.11.1 AnnotationHub_4.1.0 ## [17] curl_7.0.0 BiocNeighbors_2.5.4 ## [19] Rhdf5lib_1.33.6 SparseArray_1.11.13 ## [21] rhdf5_2.55.16 alabaster.base_1.11.4 ## [23] sass_0.4.10 bslib_0.10.0 ## [25] alabaster.sce_1.11.0 httr2_1.2.2 ## [27] cachem_1.1.0 GenomicAlignments_1.47.0 ## [29] lifecycle_1.0.5 pkgconfig_2.0.3 ## [31] rsvd_1.0.5 Matrix_1.7-5 ## [33] R6_2.6.1 fastmap_1.2.0 ## [35] digest_0.6.39 dqrng_0.4.1 ## [37] irlba_2.3.7 ExperimentHub_3.1.0 ## [39] RSQLite_2.4.6 beachmat_2.27.5 ## [41] filelock_1.0.3 labeling_0.4.3 ## [43] httr_1.4.8 abind_1.4-8 ## [45] compiler_4.6.0 bit64_4.6.0-1 ## [47] withr_3.0.2 S7_0.2.1 ## [49] BiocParallel_1.45.0 viridis_0.6.5 ## [51] DBI_1.3.0 alabaster.ranges_1.11.0 ## [53] alabaster.schemas_1.11.0 HDF5Array_1.39.1 ## [55] R.utils_2.13.0 rappdirs_0.3.4 ## [57] DelayedArray_0.37.1 rjson_0.2.23 ## [59] tools_4.6.0 vipor_0.4.7 ## [61] otel_0.2.0 beeswarm_0.4.0 ## [63] R.oo_1.27.1 glue_1.8.0 ## [65] h5mread_1.3.3 restfulr_0.0.16 ## [67] rhdf5filters_1.23.3 grid_4.6.0 ## [69] gtable_0.3.6 R.methodsS3_1.8.2 ## [71] BiocSingular_1.27.1 ScaledMatrix_1.19.0 ## [73] XVector_0.51.0 ggrepel_0.9.8 ## [75] BiocVersion_3.23.1 pillar_1.11.1 ## [77] babelgene_22.9 limma_3.67.1 ## [79] dplyr_1.2.1 BiocFileCache_3.1.0 ## [81] lattice_0.22-9 rtracklayer_1.71.3 ## [83] bit_4.6.0 tidyselect_1.2.1 ## [85] locfit_1.5-9.12 Biostrings_2.79.5 ## [87] knitr_1.51 gridExtra_2.3 ## [89] bookdown_0.46 ProtGenerics_1.43.0 ## [91] edgeR_4.9.7 xfun_0.57 ## [93] statmod_1.5.1 pheatmap_1.0.13 ## [95] UCSC.utils_1.7.1 lazyeval_0.2.3 ## [97] yaml_2.3.12 cigarillo_1.1.0 ## [99] evaluate_1.0.5 codetools_0.2-20 ## [101] tibble_3.3.1 alabaster.matrix_1.11.0 ## [103] BiocManager_1.30.27 cli_3.6.6 ## [105] jquerylib_0.1.4 GenomeInfoDb_1.47.2 ## [107] dichromat_2.0-0.1 Rcpp_1.1.1 ## [109] dbplyr_2.5.2 png_0.1-9 ## [111] XML_3.99-0.23 parallel_4.6.0 ## [113] assertthat_0.2.1 blob_1.3.0 ## [115] sparseMatrixStats_1.23.0 bitops_1.0-9 ## [117] alabaster.se_1.11.0 viridisLite_0.4.3 ## [119] scales_1.4.0 purrr_1.2.2 ## [121] crayon_1.5.3 rlang_1.2.0 ## [123] cowplot_1.2.0 KEGGREST_1.51.1 References "],["batch-correction.html", "Chapter 8 Batch correction 8.1 Motivation 8.2 Using mutual nearest neighbors 8.3 What is a batch effect, anyway? 8.4 Using the corrected values 8.5 Multi-condition analyses 8.6 Some thoughts about replicates Session information", " Chapter 8 Batch correction 8.1 Motivation In large scRNA-seq projects, data generation is split across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., small changes in incubation times, differences in reagent concentration/quality. This may introduce systematic differences in the observed expression in cells from different batches, a.k.a., batch effects. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking relevant biological differences and complicating interpretation of the results. Batch correction aims to remove batch effects to simplify downstream procedures like clustering. The aim is to merge cells from different batches that represent the same biological subpopulation, ensuring that they are assigned to the same cluster for easier interpretation. Otherwise, cells may cluster by their batch of origin, which would be quite uninteresting. Note that this step is quite different from the blocking discussed in most of the previous chapters, as setting block= just instructs those functions to ignore the batch effect instead of actively removing it. Historically, we used linear regression for batch correction of RNA-seq data (Ritchie et al. 2015; Leek et al. 2012). (We can achieve the same effect with our PCA if we compute the components from the residuals, see Section 4.4.) However, this assumes that the composition of cell subpopulations is either known, and can be used as a covariate in the model; or the composition is the same across batches, with a consistent batch effect in each subpopulation. Such assumptions are usually inappropriate for single-cell studies. Instead, we use bespoke methods for single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require these strong assumptions28. 8.2 Using mutual nearest neighbors Mutual nearest neighbors (MNN) correction was one of the first batch correction methods dedicated to scRNA-seq data (Haghverdi et al. 2018). For each cell in batch \\(B_1\\), we search for the \\(k\\) nearest neighbors in batch \\(B_2\\) (for some small \\(k\\), e.g., 10 - 20). Similarly, for each cell in batch \\(B_2\\), we search for the \\(k\\) nearest neighbors in batch \\(B_1\\). We form an MNN pair between cells \\(x_1\\) in batch \\(B_1\\) and \\(x_2\\) in batch \\(B_2\\) if \\(x_1\\) is \\(x_2\\)’s nearest neighbor and vice versa. The assumption is that MNN pairs will (mostly) only form between \\(x_1\\) and \\(x_2\\) from the same biological subpopulation. A subpopulation unique to batch \\(B_2\\) will (hopefully) not be able to form MNN pairs to \\(B_1\\), as each cell in \\(B_1\\) will be preoccupied with forming MNN pairs with cells from its matching subpopulation in \\(B_2\\). The difference between the cells in each MNN pair defines the direction and magnitude of the batch effect for its surrounding neighborhood, allowing us to correct, e.g., \\(B_2\\) to \\(B_1\\) by subtracting that difference from each cell in \\(B_2\\). To demonstrate, let’s use several PBMC datasets from 10X Genomics (Zheng et al. 2017): library(TENxPBMCData) sce.pbmc3k &lt;- TENxPBMCData(&#39;pbmc3k&#39;) sce.pbmc4k &lt;- TENxPBMCData(&#39;pbmc4k&#39;) sce.pbmc8k &lt;- TENxPBMCData(&#39;pbmc8k&#39;) # Finding a common set of genes across all batches to allow us to combine # everything into a single object. This is only necessary if the different # batches were processed with different genome annotations. inter.pbmc &lt;- Reduce( intersect, list( rownames(sce.pbmc3k), rownames(sce.pbmc4k), rownames(sce.pbmc8k) ) ) sce.pbmc &lt;- combineCols( sce.pbmc3k[inter.pbmc,], sce.pbmc4k[inter.pbmc,], sce.pbmc8k[inter.pbmc,] ) sce.pbmc$batch &lt;- rep( c(&quot;3k&quot;, &quot;4k&quot;, &quot;8k&quot;), c(ncol(sce.pbmc3k), ncol(sce.pbmc4k), ncol(sce.pbmc8k)) ) # For each dataset, TENxPBMCData loads the count data into the R session as a # file-backed matrix, i.e., the &quot;matrix&quot; object is just a pointer to file # containing the actual counts. For greater efficiency, we load the data into # memory as a sparse matrix so that we don&#39;t have to repeatedly read from disk. counts(sce.pbmc) &lt;- as(counts(sce.pbmc), &quot;dgCMatrix&quot;) # Quality control, blocking on the batch of origin for each cell. is.mito.pbmc &lt;- grep(&quot;MT&quot;, rowData(sce.pbmc)$Symbol) library(scrapper) sce.qc.pbmc &lt;- quickRnaQc.se( sce.pbmc, subsets=list(MT=is.mito.pbmc), block=sce.pbmc$batch ) sce.qc.pbmc &lt;- sce.pbmc[,sce.qc.pbmc$keep] # Normalization, blocking on the batch of origin for each cell. sce.norm.pbmc &lt;- normalizeRnaCounts.se( sce.qc.pbmc, size.factors=sce.qc.pbmc$sum, block=sce.qc.pbmc$batch ) # We now choose the top HVGs, with blocking. sce.var.pbmc &lt;- chooseRnaHvgs.se( sce.norm.pbmc, block=sce.norm.pbmc$batch ) # Running the PCA on the HVG submatrix, with blocking. sce.pca.pbmc &lt;- runPca.se( sce.var.pbmc, features=rowData(sce.var.pbmc)$hvg, number=25, block=sce.var.pbmc$batch ) If we examine the distribution of cells without any batch correction, we observe some batch-specific substructure (Figure 8.1). Such batch effects could have any number of causes - biological differences in the underlying cell population between donors, differences in the technology used for cell capture and/or sequencing, or changes in the computational piplines for alignment and quantification. Regardless of their origins, we consider these differences to be uninteresting as all batches are assaying the same PBMC population and should be replicates of each other. sce.unc.tsne.pbmc &lt;- runTsne.se(sce.pca.pbmc) library(scater) plotReducedDim(sce.unc.tsne.pbmc, &quot;TSNE&quot;, colour_by=&quot;batch&quot;) + ggtitle(&quot;uncorrected&quot;) Figure 8.1: \\(t\\)-SNE plot of the cells from the PBMC dataset, without any batch correction. Each cell is colored according to its batch of origin. To remove the batch effects, we use correctMnn() to apply MNN correction to the PC scores for all cells. (As mentioned in the other chapters, we use PCs to leverage the compaction and denoising effects of the PCA on the HVGs.) This yields a set of corrected scores that can be used in place of the original PCs in downstream analyses. We observe greater intermingling between batches in Figure 8.2, indicating that we have successfully mitigated the batch effect. sce.mnn.pbmc &lt;- correctMnn.se(sce.pca.pbmc, sce.pca.pbmc$batch) sce.mnn.tsne.pbmc &lt;- runTsne.se(sce.mnn.pbmc, reddim.type=&quot;MNN&quot;) plotReducedDim(sce.mnn.tsne.pbmc, &quot;TSNE&quot;, colour_by=&quot;batch&quot;) + ggtitle(&quot;After correction&quot;) Figure 8.2: \\(t\\)-SNE plot of the cells from the PBMC dataset after MNN correction. Each cell is colored according to its batch of origin. Clustering on the corrected PCs ensures that cells from the same underlying population are assigned to the same cluster. (We call this a “common clustering” as the definition of each cluster is the same in each batch.) This avoids the formation of multiple clusters that represent the same cell type/state but are only separated due to batch effects. Such redundant clusters are annoying to interpret as we have to (i) inspect more clusters to discover the same biology and (ii) match them up to each other for further analyses. In addition, the correction increases the number of cells in any subpopulations that are shared across batches. This provides some opportunities for improved resolution of rare subpopulations during the clustering step29. sce.mnn.graph.pbmc &lt;- clusterGraph.se(sce.mnn.pbmc, reddim.type=&quot;MNN&quot;) Once we obtain a common clustering, a useful diagnostic measure is the distribution of cells across batches within each cluster. If a cluster has contributions from multiple batches, it probably represents a cell type/state that is shared across those batches. We expect our PBMC clusters to be more-or-less evenly distributed across batches as each batch is a replicate of the others. # This is a normalized matrix of cell counts for each group (row) and block # (column). We divide each column by the number of cells in each batch to # account for differences between batches. Then we divide by the row sums to # get the distribution of cells across batches in each cluster. cluster.batch.mnn.pbmc &lt;- countGroupsByBlock( sce.mnn.graph.pbmc$clusters, sce.mnn.graph.pbmc$batch, normalize.groups=TRUE, normalize.block=TRUE ) print(cluster.batch.mnn.pbmc, digits=2, zero.print=&quot;.&quot;) ## block ## groups 3k 4k 8k ## 1 0.51 0.23 0.26 ## 2 0.29 0.37 0.34 ## 3 0.34 0.34 0.33 ## 4 0.42 0.32 0.26 ## 5 0.32 0.37 0.31 ## 6 0.33 0.33 0.33 ## 7 0.56 0.23 0.21 ## 8 0.31 0.34 0.35 ## 9 0.22 0.20 0.58 ## 10 0.31 0.34 0.35 ## 11 0.22 0.39 0.39 ## 12 0.20 0.39 0.41 ## 13 0.21 0.33 0.46 ## 14 0.25 0.40 0.35 ## 15 0.17 0.43 0.39 If a cluster has no contribution from a batch, this either represents a unique subpopulation or it indicates that batch correction was not completely successful. Indeed, clustering on the uncorrected PCs yields some batch-specific clusters in the PBMC data. These are unlikely to represent unique types/states given that the batches should be replicates. sce.unc.graph.pbmc &lt;- clusterGraph.se(sce.pca.pbmc) cluster.batch.unc.pbmc &lt;- countGroupsByBlock( sce.unc.graph.pbmc$clusters, sce.unc.graph.pbmc$batch, normalize.groups=TRUE, normalize.block=TRUE ) print(cluster.batch.unc.pbmc, digits=2, zero.print=&quot;.&quot;) ## block ## groups 3k 4k 8k ## 1 0.4123 0.2784 0.3093 ## 2 0.3086 0.3473 0.3441 ## 3 1.0000 . . ## 4 0.4290 0.3133 0.2577 ## 5 0.2479 0.4019 0.3502 ## 6 0.3124 0.3459 0.3416 ## 7 0.9986 0.0014 . ## 8 0.2180 0.3949 0.3871 ## 9 0.0274 0.5128 0.4597 ## 10 0.1153 0.4206 0.4641 ## 11 0.1815 0.4346 0.3840 ## 12 0.1739 0.4320 0.3941 ## 13 0.1427 0.2422 0.6150 ## 14 0.0219 0.3974 0.5806 ## 15 . 0.5020 0.4980 Compared to linear regression, MNN correction does not assume that the population composition is the same or known beforehand. It effectively learns the shared population structure via identification of MNN pairs and uses this information to estimate a local batch effect for subpopulation-specific correction. However, MNN correction is not without its own assumptions: It requires some shared subpopulations between batches to encourage formation of the correct MNN pairs. Otherwise, if one batch contains B cells only and another batch contains T cells only, MNN pairs would form between the two cell types and the correction would merge them together. More generally, MNN correction becomes more robust with more shared subpopulations between batches. This implicitly reduces the risk of forming incorrect MNN pairs between unique subpopulations in each batch. Imagine we have one batch containing B cells and CD4+ T cells and another batch containing B cells and CD8+ T cells. MNN pairs would form correctly across batches for B cells, but they would also form between the CD4+ and CD8+ T cells as they are the closest available matches to each other. If our first batch also contained CD8+ T cells, they would match across batches and the CD4+ T cells would (correctly) not participate in any MNN pairs. Any shared subpopulations should have more than \\(k\\) cells in each batch to ensure that MNN pairs do not incorrectly form across different subpopulations. For example, let’s say we have one batch that contains only T cells and another batch that contains B cells and fewer than 10 T cells. If we used \\(k = 10\\), some MNN pairs would form between the T cells in the first batch and B cells in the second batch, which would be wrong. (That said, failure is not guaranteed for small populations - the example above would have worked out fine if B cells also existed in the first batch. It’s just that the risk of incorrect MNN pairs is much higher when the subpopulation size drops below \\(k\\).) For more subtle population structure, MNN correction assumes that the batch effect is orthogonal to the axes of biological variation. This is generally reasonable for batch effects caused by technical differences that are unrelated to biology - less so for biological differences. Say we’re studying some kind of continuous biological variation, e.g., differentiation, and we have two batches that are replicates of each other. We introduce a batch effect that is not orthogonal to the biological variation, e.g., because the second batch has higher baseline expression of the differentiation marker. MNN correction would be slightly incorrect as it preserves that the non-orthogonal component of the batch effect (Figure 8.3). Figure 8.3: Diagram of MNN correction when the batch effect is confounded with biological variation. Violations of some of these assumptions might be tolerable, sometimes. For example, we wouldn’t lose too much sleep if monocytes and macrophages were merged together across batches… but then again, maybe we would, if we were really interested in studying differentiation in that particular lineage. In any case, it is best to treat batch-corrected data - and conclusions derived from it - with a grain of salt. The various merging decisions made by the algorithm may or may not be sensible depending on our scientific question. 8.3 What is a batch effect, anyway? In this chapter’s introduction, we defined batch effects in terms of technical differences that are obviously uninteresting. However, certain biological differences are also uninteresting and can be treated as batch effects. One example is the biological variability between replicate samples (e.g., donors, animals, cultures) from which the cells are extracted. We are generally uninterested in systematic differences between samples, which might cause cells of the same type to form separate clusters based on their sample of origin. In these replicated experiments, we might consider removing this sample-to-sample variability by treating each sample as a batch in correctMnn.se(). Similarly, we could apply batch correction to any uninteresting categorical factor in our dataset, e.g., sex, genotype, cell cycle phase. Admittedly, we’re misusing the word “batch” here30, but we’re already halfway into this chapter so let’s just bear with it until the end. Now, what happens if different samples contain cells from different experimental conditions? Say we have two samples where one contains control cells and the other contains drug-treated cells. If we applied MNN correction to the samples, any treatment-induced differential expression would be treated as a batch effect and removed. This behavior is both expected and desirable - by merging cells from both conditions, we only need to characterize population heterogeneity once for all cells. For example, we can use the corrected coordinates to define a common set of clusters across both treated and control samples. This, in turn, allows us test for differences in expression or abundance of the same cell type/state between conditions (Section 8.5). It may seem distressing to some folks that a (very interesting!) biological difference between conditions is deliberately removed by batch correction. However, this concern is largely misplaced as the corrected values are only ever used for defining common clusters and annotations. Any differences between conditions will still be preserved in the results of Section 8.5. The alternative strategy would be to cluster each condition separately and to attempt to identify matching clusters across conditions, which is much less convenient (though not an inherently bad idea, see Section 8.6). 8.4 Using the corrected values As previously mentioned, the batch-corrected values are typically used to quantify population heterogeneity in a common manner across batches. Cluster 1 in batch \\(B_1\\) is the same as cluster 1 in batch \\(B_2\\) when clustering is performed on the corrected data. We do not have to cluster each batch separately and then identify mappings between separate clusterings, which is time-consuming and might not even be possible when the clusters are not well-separated. The same reasoning applies to other cell-based analyses like trajectory reconstruction. For per-gene analyses, the corrected values are more difficult to interpret. The correction is not obliged to preserve relative differences in per-gene expression when aligning multiple batches. In fact, the opposite is true - the correction must distort the expression profiles to merge batches together, as any differences in expression between batches for the same subpopulation would be a batch effect. Let’s demonstrate using two pancreas datasets (Grun et al. 2016; Muraro et al. 2016) that we’ll consider as separate batches. library(scRNAseq) sce.grun &lt;- GrunPancreasData() sce.muraro &lt;- MuraroPancreasData() # Taking the intersection of features for both endogenous genes... inter.pancreas &lt;- intersect(rownames(sce.grun), rownames(sce.muraro)) sce.grun &lt;- sce.grun[inter.pancreas,] sce.muraro &lt;- sce.muraro[inter.pancreas,] # And spike-ins, for completeness... inter.ercc.pancreas &lt;- intersect(rownames(altExp(sce.grun, &quot;ERCC&quot;)), rownames(altExp(sce.muraro, &quot;ERCC&quot;))) altExp(sce.grun, &quot;ERCC&quot;) &lt;- altExp(sce.grun, &quot;ERCC&quot;)[inter.ercc.pancreas,] altExp(sce.muraro, &quot;ERCC&quot;) &lt;- altExp(sce.muraro, &quot;ERCC&quot;)[inter.ercc.pancreas,] # Before combining both datasets into a single SCE object. sce.pancreas &lt;- combineCols(sce.grun, sce.muraro) sce.pancreas$batch &lt;- rep(c(&quot;grun&quot;, &quot;muraro&quot;), c(ncol(sce.grun), ncol(sce.muraro))) # Quality control, blocking on the batch of origin for each cell. We don&#39;t # have mitochondrial genes here so we&#39;ll use the spike-ins instead. library(scrapper) sce.qc.pancreas &lt;- quickRnaQc.se( sce.pancreas, subsets=NULL, altexp.proportions=&quot;ERCC&quot;, block=sce.pancreas$batch ) sce.qc.pancreas &lt;- sce.qc.pancreas[,sce.qc.pancreas$keep] # Normalization, blocking on the batch of origin for each cell. sce.norm.pancreas &lt;- normalizeRnaCounts.se( sce.qc.pancreas, size.factors=sce.qc.pancreas$sum, block=sce.qc.pancreas$batch ) # We now choose the top HVGs, with blocking. sce.var.pancreas &lt;- chooseRnaHvgs.se(sce.norm.pancreas, block=sce.norm.pancreas$batch) # Running the PCA on the HVG submatrix, with blocking. sce.pca.pancreas &lt;- runPca.se( sce.var.pancreas, features=rowData(sce.var.pancreas)$hvg, block=sce.var.pancreas$batch ) We use correctMnn() to obtain MNN-corrected PCs for clustering and visualization. Both batches contribute to each cluster and are intermingled in Figure 8.4, which is expected given that both datasets are measuring the same pancreatic cell types. sce.mnn.pancreas &lt;- correctMnn.se(sce.pca.pancreas, sce.qc.pancreas$batch) sce.nn.mnn.pancreas &lt;- runAllNeighborSteps.se(sce.mnn.pancreas, reddim.type=&quot;MNN&quot;) cluster.batch.mnn.pancreas &lt;- countGroupsByBlock( sce.nn.mnn.pancreas$clusters, sce.nn.mnn.pancreas$batch, normalize.groups=TRUE, normalize.block=TRUE ) print(cluster.batch.mnn.pancreas, digits=2, zero.print=&quot;.&quot;) ## block ## groups grun muraro ## 1 0.66 0.34 ## 2 0.71 0.29 ## 3 0.86 0.14 ## 4 0.33 0.67 ## 5 0.51 0.49 ## 6 0.35 0.65 ## 7 0.24 0.76 ## 8 0.36 0.64 ## 9 0.71 0.29 ## 10 0.30 0.70 ## 11 0.38 0.62 ## 12 0.13 0.87 library(scater) plotReducedDim(sce.nn.mnn.pancreas, &quot;TSNE&quot;, colour_by=&quot;batch&quot;) Figure 8.4: \\(t\\)-SNE plot of the Grun and Muraro pancreas datasets after MNN correction. Each point is a cell, colored according to its assigned batch. We recover “corrected expression values” for any given gene by multiplying the corrected PCs with the corresponding row of the rotation matrix. This is effectively a low-rank approximation of our original log-expression matrix, but using the corrected coordinates for each cell. Of particular interest is the INS-IGF2 gene, where MNN correction forces the expression profiles to be consistent between batches (Figure 8.5). (As of time of writing, this involved eliminating the variability in INS-IGF2 across clusters in the Grun dataset to match the lack of expression in the Muraro dataset, though the opposite outcome is equally possible, i.e., introducing non-zero expression in the Muraro dataset to match that of the Grun dataset.) From the perspective of the correction algorithm, this effect is intended as these differences between batches are part of the batch effect and must be removed. However, if we relied the corrected expression values, we would draw misleading conclusions about the behavior of INS-IGF2 across batches. For example, if one batch consisted of drug-treated patients and another batch was a control, we would not detect any treatment-induced differential expression from the corrected expression values. current.insigf2 &lt;- &quot;INS-IGF2__chr11&quot; rotation.insigf2 &lt;- metadata(sce.nn.mnn.pancreas)$PCA$rotation[current.insigf2,] lowrank.unc.insigf2 &lt;- reducedDim(sce.nn.mnn.pancreas, &quot;PCA&quot;) %*% rotation.insigf2 lowrank.mnn.insigf2 &lt;- reducedDim(sce.nn.mnn.pancreas, &quot;MNN&quot;) %*% rotation.insigf2 gridExtra::grid.arrange( plotExpression( sce.nn.mnn.pancreas, x=&quot;clusters&quot;, features=current.insigf2, colour_by=&quot;clusters&quot;, other_fields=&quot;batch&quot; ) + facet_grid(~batch) + ggtitle(&quot;original expression&quot;), plotXY( sce.nn.mnn.pancreas$clusters, lowrank.unc.insigf2, colour_by=sce.nn.mnn.pancreas$clusters, other_fields=list(batch=sce.nn.mnn.pancreas$batch) ) + facet_grid(~batch) + ggtitle(&quot;reconstruction without correction&quot;), plotXY( sce.nn.mnn.pancreas$clusters, lowrank.mnn.insigf2, colour_by=sce.nn.mnn.pancreas$clusters, other_fields=list(batch=sce.nn.mnn.pancreas$batch) ) + facet_grid(~batch) + ggtitle(&quot;reconstruction with correction&quot;), ncol=1 ) Figure 8.5: Expression of INS-IGF2 across clusters in the combined Grun/Muraro pancreas dataset. Expression is quantified in terms of the log-normalized expression values (top panel), the reconstructed expression values with the uncorrected PCs (middle), and the reconstructed expression values with the MNN-corrected PCs (bottom). For gene-based analyses, we recommend using the original log-expression values as these are easier to interpret. Differences between batches should be handled by some other mechanism, e.g., blocking during marker detection (Figure 8.6, Section 7.5). In the past decade, we have - perhaps once or twice - used the corrected values for visualization, specifically to synchronize expression across all batches to the same color gradient in a \\(t\\)-SNE plot. This was done purely for aesthetics and was probably not worth the extra hassle, given that we had to check that the plot with the corrected values gave the same conclusions as the original expression values. markers.pancreas &lt;- scoreMarkers.se( sce.nn.mnn.pancreas, sce.nn.mnn.pancreas$clusters, block=sce.nn.mnn.pancreas$block ) # Looking at the top markers for cluster 1. chosen.markers.pancreas &lt;- markers.pancreas[[&quot;1&quot;]] previewMarkers(chosen.markers.pancreas) ## DataFrame with 10 rows and 3 columns ## mean detected lfc ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## REG1A__chr2 8.39445 1.000000 6.28158 ## SPINK1__chr5 6.69446 0.998291 5.60744 ## CTRB2__chr16 6.01565 0.996581 5.26939 ## PRSS1__chr7 6.04823 0.996581 5.35635 ## GSTA1__chr6 3.88348 0.979487 3.62438 ## CD24__chrY 5.10648 0.996581 3.12548 ## SERPINA3__chr14 5.19148 0.996581 4.25590 ## PRSS3P2__chr7 5.14508 0.989744 4.75444 ## CPA1__chr7 4.82004 0.982906 4.37088 ## GSTA2__chr6 3.24975 0.953846 3.11099 plotExpression( sce.nn.mnn.pancreas, x=&quot;clusters&quot;, features=rownames(chosen.markers.pancreas)[1], colour_by=&quot;clusters&quot;, other_fields=&quot;batch&quot; ) + facet_grid(~batch) Figure 8.6: Distribution of log-expression values across clusters for the top marker in cluster 1 of the merged Grun/Muraro pancreas dataset. Each point is a cell and each facet is a batch. 8.5 Multi-condition analyses 8.5.1 Differential expression The most interesting scRNA-seq datasets consist of multiple samples across different conditions, e.g., treated and untreated. Once we have a common definition of clusters across all our samples, we can test for differences between conditions for each cluster. In effect, we treat the scRNA-seq data as a kind of in silico “super-FACS” - FACS31 is used to experimentally isolate cell types of interest before bulk RNA-seq or quantifying abundance, and now we do the same with scRNA-seq but our putative cell types are defined by clustering instead. To illustrate, we’ll pull out some pancreas data generated from normal donors and patients with type II diabetes (Segerstolpe et al. 2016): library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() table(sce.seger$individual, sce.seger$disease) ## ## normal type II diabetes mellitus ## H1 96 0 ## H2 352 0 ## H3 383 0 ## H4 383 0 ## H5 383 0 ## H6 383 0 ## T2D1 0 383 ## T2D2 0 383 ## T2D3 0 384 ## T2D4 0 384 Happily enough, the authors provided cell type labels so we’ll use those directly instead of going through the hassle of defining clusters ourselves. We compute “pseudo-bulk” expression profiles (Tung et al. 2017) by summing counts together for all cells with the same combination of cell type and sample. As their name suggests, these pseudo-bulk profiles are intended to mimic bulk RNA-seq data so that they can be analyzed with existing DE workflows, e.g., edgeR, voom(). We use the sum of counts for several reasons: Larger counts are more amenable to analysis workflows designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a bulk RNA-seq workflow would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra complexity, typically for little benefit - see Crowell et al. (2020).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DE genes that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DE genes - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicates but heterogeneous within each sample. pseudo.bulk.seger &lt;- aggregateAcrossCells.se(sce.seger, colData(sce.seger)[,c(&quot;individual&quot;,&quot;cell type&quot;)]) pseudo.bulk.seger ## class: SummarizedExperiment ## dim: 26179 119 ## metadata(1): aggregated ## assays(2): sums detected ## rownames(26179): SGIP1 AZIN2 ... BIVM-ERCC5 eGFP ## rowData names(2): refseq symbol ## colnames: NULL ## colData names(12): factor.individual factor.cell type ... submitted ## single cell quality cell type colData(pseudo.bulk.seger)[,c(&quot;factor.individual&quot;, &quot;factor.cell type&quot;, &quot;counts&quot;)] ## DataFrame with 119 rows and 3 columns ## factor.individual factor.cell type counts ## &lt;character&gt; &lt;character&gt; &lt;integer&gt; ## 1 H1 NA 23 ## 2 H1 MHC class II cell 1 ## 3 H1 PSC cell 1 ## 4 H1 acinar cell 4 ## 5 H1 alpha cell 28 ## ... ... ... ... ## 115 T2D4 delta cell 35 ## 116 T2D4 ductal cell 47 ## 117 T2D4 epsilon cell 1 ## 118 T2D4 gamma cell 34 ## 119 T2D4 mast cell 1 assay(pseudo.bulk.seger)[1:10,1:10] # sum of counts for each individual/cell-type combination ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## SGIP1 0 0 389 1 0 0 0 0 0 0 ## AZIN2 0 0 0 0 125 0 0 0 0 0 ## CLIC4 1 0 52 503 856 401 332 4 130 368 ## AGBL4 0 0 0 0 399 0 0 0 0 0 ## NECAP2 0 91 0 284 1835 491 28 3 354 0 ## SLC45A1 0 0 0 0 383 36 0 0 0 0 ## TGFBR3 0 0 0 341 1 232 262 541 0 0 ## DBT 0 0 0 113 274 12 164 333 56 0 ## RFWD2 0 0 0 162 340 91 9 6 0 0 ## C1orf21 0 0 138 27 2125 0 149 123 207 0 Once we have the pseudo-bulk count matrix, we test for differences between conditions (in this case, disease status) within each cell type. Any DE analysis method that works with bulk RNA-seq data can be used - here, we’ll be using voom() from the limma package (Law et al. 2014). We won’t go into too much detail here as there is plentiful documentation elsewhere, e.g., see limma::limmaUsersGuide(). Our only extra advice is to: Consider removing unreliable pseudo-bulk profiles with very few cells. The exact threshold depends on the dataset, the rarity of the cell type, the variance of the assay technology (e.g., UMIs versus reads), and whether the DE analysis supports downweighting of low-quality profiles. A good rule of thumb seems to be 10 cells (Crowell et al. 2020). Perform a separate analysis for each cell type instead of cramming all cell types into the same design matrix. This protects against differences in the mean-variance relationship across cell types. It also ensures that any odd behavior for one cell type’s does not affect inferences for the other cell types. Get used to higher variances and fewer DE genes compared to actual bulk RNA-seq data. The number of cells contributing to each pseudo-bulk profile is often orders of magnitude less than that used in bulk RNA-seq, so the latter will be a more precise assay of the population transcriptome. We test for disease-associated DE genes in beta cells using voom() with additional weighting for sample quality. Perhaps unsurprisingly, the top DE gene is INS. pseudo.beta.seger &lt;- pseudo.bulk.seger[,which(pseudo.bulk.seger$`factor.cell type` == &quot;beta cell&quot;)] # We can have a look at the number of cells contributing to each profile, in # case we want to remove low-abundance profiles. pseudo.beta.seger$counts ## [1] 12 48 32 34 10 35 10 14 11 64 library(edgeR) y.beta.seger &lt;- DGEList(assay(pseudo.beta.seger, &quot;sums&quot;), samples=as.data.frame(colData(pseudo.beta.seger))) keep.beta.seger &lt;- filterByExpr(y.beta.seger, group=y.beta.seger$samples$disease) y.beta.seger &lt;- y.beta.seger[keep.beta.seger,] y.beta.seger &lt;- normLibSizes(y.beta.seger) design.beta.seger &lt;- model.matrix(~disease, y.beta.seger$samples) v.beta.seger &lt;- voomWithQualityWeights(y.beta.seger, design.beta.seger) fit.beta.seger &lt;- lmFit(v.beta.seger) fit.beta.seger &lt;- eBayes(fit.beta.seger, robust=TRUE) res.beta.seger &lt;- topTable(fit.beta.seger, sort.by=&quot;p&quot;, n=Inf, coef=2) head(res.beta.seger) ## ID logFC AveExpr t P.Value adj.P.Val B ## 7287 INS -2.761129 16.655021 -7.500553 4.296911e-06 0.05020081 4.616178 ## 7689 FXYD2 -3.519464 5.676002 -6.864834 1.072013e-05 0.05020081 2.602945 ## 7688 FXYD2 -2.589501 7.284994 -6.795743 1.195542e-05 0.05020081 3.388534 ## 8349 ARL6IP4 -1.726072 7.811593 -5.660748 7.428443e-05 0.11330881 1.868303 ## 11413 HPN -1.797720 6.132052 -5.654938 7.501799e-05 0.11330881 1.664214 ## 11187 TRAPPC5 -2.121947 7.037605 -5.645027 7.628700e-05 0.11330881 1.768896 8.5.2 Differential abundance Another interesting analysis involves testing for differences in cell type abundance between conditions, i.e., differential abundance (DA). We all know how immunologists love to create FACS plots showing some change in the percentages between treatments (e.g., Figure 1A of Richard et al. (2018)) - now we can do the same kind of thing with scRNA-seq data. For our pancreas dataset, we create a count matrix of the number of cells assigned to each cell type in each sample. (If we didn’t already have annotated cell types, we could instead consider using a tool like miloR, which performs a DA analysis without requiring explicit assignment of each cells to clusters.) ab.count.seger &lt;- countGroupsByBlock(colData(sce.seger)[,&quot;cell type&quot;], colData(sce.seger)$individual) ab.count.seger &lt;- unclass(ab.count.seger) # get rid of the weird table class. ab.count.seger ## block ## groups H1 H2 H3 H4 H5 H6 T2D1 T2D2 T2D3 T2D4 ## MHC class II cell 1 0 0 0 0 0 0 2 1 1 ## PSC cell 1 1 2 6 3 10 2 12 13 4 ## acinar cell 4 20 80 3 2 3 8 28 24 13 ## alpha cell 28 117 26 136 44 92 141 119 87 96 ## beta cell 12 48 32 34 10 35 10 14 11 64 ## co-expression cell 3 3 5 6 3 6 1 5 1 6 ## delta cell 7 21 2 7 10 12 9 6 5 35 ## ductal cell 4 19 67 8 23 14 3 76 125 47 ## endothelial cell 1 1 0 1 2 8 1 1 1 0 ## epsilon cell 0 1 1 0 0 3 1 0 0 1 ## gamma cell 7 19 15 2 1 31 70 8 10 34 ## mast cell 0 4 0 0 0 0 0 2 0 1 ## unclassified cell 0 0 0 0 0 1 1 0 0 0 ## unclassified endocrine cell 5 15 4 0 0 5 3 3 6 0 We then apply standard DA pipelines to see which cell types are affected by disease. In particular, testing for DA is bread-and-butter stuff in the microbiome field, so we’d recommend checking out some of their best practices. Right now, though, this book is hard enough to compile without adding extra dependencies, so we’ll just re-use edgeR’s statistical machinery to test for differences in the cell abundance matrix (Robinson, McCarthy, and Smyth 2010)32. y.ab.seger &lt;- DGEList(ab.count.seger) y.ab.seger$samples$disease &lt;- sce.seger$disease[match(colnames(y.ab.seger), sce.seger$individual)] keep.ab.seger &lt;- filterByExpr(y.ab.seger, group=y.ab.seger$samples$disease) y.ab.seger &lt;- y.ab.seger[keep.ab.seger,] # If we don&#39;t normalize, our results will be affected by composition bias. But # if we use TMM normalization, that would assume that most cell types do not # have any change in their abundance. Hard to tell which one&#39;s worse here. design.ab.seger &lt;- model.matrix(~disease, y.ab.seger$samples) fit.ab.seger &lt;- glmQLFit(y.ab.seger, design.ab.seger) res.ab.seger &lt;- glmQLFTest(fit.ab.seger, coef=2) topTags(res.ab.seger) ## Coefficient: diseasetype II diabetes mellitus ## logFC logCPM F PValue FDR ## ductal cell 0.82000345 17.40873 1.38453794 0.2451320 0.6327002 ## beta cell -0.82063001 16.98834 1.08166503 0.3035359 0.6327002 ## gamma cell 0.78555501 16.53608 1.02526643 0.3163501 0.6327002 ## acinar cell -0.45764795 16.45058 0.33776409 0.5638419 0.8457628 ## delta cell -0.29945991 15.88049 0.13536851 0.7145470 0.8574564 ## alpha cell -0.02224822 18.64197 0.00231903 0.9617915 0.9617915 It’s worth noting that DA and DE are two sides of the same coin as they are both based from the per-cell expression profiles. Consider a scRNA-seq experiment involving two biological conditions with several shared cell types. We focus on a cell type \\(X\\) that is present in both conditions but contains some DE genes between conditions. This leads to two possible outcomes: The DE between conditions is strong enough to split \\(X\\) into two separate clusters (say, \\(X_1\\) and \\(X_2\\)) in expression space. This manifests as DA where \\(X_1\\) is enriched in one condition and \\(X_2\\) is enriched in the other condition. The DE between conditions is not sufficient to split \\(X\\) into two separate clusters, e.g., because our batch correction algorithm identifies them as corresponding cell types and merges them together. Thus, the differences between conditions manifest as DE within the single cluster corresponding to \\(X\\). It is difficult to predict whether a difference between conditions will manifest as DE or DA. For example, we might see DE for coarser clusters but DA for finer clusters. We’d recommend performing both DE and DA analyses to ensure that we can catch either possibility. 8.6 Some thoughts about replicates Don’t put too much faith in the results of DE/DA analyses derived from a common clustering. These analyses do not capture the uncertainty in the clustering and its biological interpretation, which reduces confidence in the reproducibility of the results. Say we discover significant DE/DA for a cell type in our dataset. If an independent party were to repeat our experiment and analysis, would they be able to reach the same conclusion? More specifically, would they be able to partition an equivalent cluster and assign the same cell type identity? Weakly separated cell subtypes might not manifest as separate clusters in a new dataset, or the ranking of markers might change in a manner that causes the analyst to assign a different biological identity. We wouldn’t know; we can’t evaluate the reproducibility of our cell type annotations because we only did the clustering and interpretation once. That said, there is a way to model this uncertainty - rarely used and tedious, but it can be done. Consider a dataset that has multiple replicate samples for each of multiple conditions. The strategy is as follows: Analyze each sample independently, from quality control to identification of cell types/states from the clusters. If we were being very careful, we would blind and randomize samples across multiple analysts so that variances in human bias are also modelled during manual annotation of clusters33. Alternatively, we could use automated cell type annotation tools like SingleR; these do not require any clustering and can be applied to each sample independently, but assume that our cell types of interest exist in the reference annotation. Match corresponding cell types or states across samples. For manual annotation, we might consider using a controlled vocabulary of cell types/states to simplify this step, especially if multiple analysts are involved. A hierarchical cell type ontology is also useful, e.g., if we can’t match two closely related subtypes, we can at least agree that they both match to the parent type. This step yields a common set of cell type/state identities across all samples, replacing the common clustering derived from the corrected PCs. As we are forced to be explicit about how cell types are matched across samples, we don’t have to rely on the assumptions (and potential errors) of the correction algorithm. Create a pseudo-bulk or cell abundance count matrix based on the annotated cell types/states from all samples. Any variability in the per-sample analysis will manifest as greater variance across replicates in these count matrices. For example, if a cell subtype is weakly defined, we may not be able to identify it consistently across replicates, increasing the variance in the cell type abundances. Similarly, if a subtype is poorly separated from its relatives, its cluster may occasionally include cells from neighboring subtypes, increasing the variance of the pseudo-bulk profiles. The increased variance is important as it properly reflects our uncertainty about the existence of the cell subtype itself. In practice, this kind of analysis is pretty exhausting, especially for larger studies. We recall only a handful of instances over the years because it’s just too inconvenient. Besides, the incentives for reproducibility don’t exist in the current scientific environment. Why should we do more work to introduce more variance and reduce the number of significant hits34? We typically settle on a compromise between convenience and rigor, where we still use a common clustering from corrected PCs but invest the extra time and resources into independent validation experiments (see also suggestions in Section 7.7). As long as our conclusions can be validated, we can say that our preceding analyses were “exploratory” and give ourselves a pass for any statistical impropriety. Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] edgeR_4.9.7 limma_3.67.1 ## [3] scRNAseq_2.25.0 scater_1.39.4 ## [5] ggplot2_4.0.2 scuttle_1.21.6 ## [7] scrapper_1.5.17 TENxPBMCData_1.29.0 ## [9] HDF5Array_1.39.1 h5mread_1.3.3 ## [11] rhdf5_2.55.16 DelayedArray_0.37.1 ## [13] SparseArray_1.11.13 S4Arrays_1.11.1 ## [15] abind_1.4-8 Matrix_1.7-5 ## [17] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1 ## [19] Biobase_2.71.0 GenomicRanges_1.63.2 ## [21] Seqinfo_1.1.0 IRanges_2.45.0 ## [23] S4Vectors_0.49.1 BiocGenerics_0.57.0 ## [25] generics_0.1.4 MatrixGenerics_1.23.0 ## [27] matrixStats_1.5.0 BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.5 ## [4] ggbeeswarm_0.7.3 GenomicFeatures_1.63.2 gypsum_1.7.0 ## [7] farver_2.1.2 rmarkdown_2.31 BiocIO_1.21.0 ## [10] vctrs_0.7.3 memoise_2.0.1 Rsamtools_2.27.2 ## [13] RCurl_1.98-1.18 htmltools_0.5.9 AnnotationHub_4.1.0 ## [16] curl_7.0.0 BiocNeighbors_2.5.4 Rhdf5lib_1.33.6 ## [19] sass_0.4.10 alabaster.base_1.11.4 bslib_0.10.0 ## [22] alabaster.sce_1.11.0 httr2_1.2.2 cachem_1.1.0 ## [25] GenomicAlignments_1.47.0 lifecycle_1.0.5 pkgconfig_2.0.3 ## [28] rsvd_1.0.5 R6_2.6.1 fastmap_1.2.0 ## [31] digest_0.6.39 AnnotationDbi_1.73.1 irlba_2.3.7 ## [34] ExperimentHub_3.1.0 RSQLite_2.4.6 beachmat_2.27.5 ## [37] filelock_1.0.3 labeling_0.4.3 httr_1.4.8 ## [40] compiler_4.6.0 bit64_4.6.0-1 withr_3.0.2 ## [43] S7_0.2.1 BiocParallel_1.45.0 viridis_0.6.5 ## [46] DBI_1.3.0 alabaster.ranges_1.11.0 alabaster.schemas_1.11.0 ## [49] rappdirs_0.3.4 rjson_0.2.23 tools_4.6.0 ## [52] vipor_0.4.7 otel_0.2.0 beeswarm_0.4.0 ## [55] glue_1.8.0 restfulr_0.0.16 rhdf5filters_1.23.3 ## [58] grid_4.6.0 gtable_0.3.6 ensembldb_2.35.0 ## [61] BiocSingular_1.27.1 ScaledMatrix_1.19.0 XVector_0.51.0 ## [64] ggrepel_0.9.8 BiocVersion_3.23.1 pillar_1.11.1 ## [67] dplyr_1.2.1 BiocFileCache_3.1.0 lattice_0.22-9 ## [70] rtracklayer_1.71.3 bit_4.6.0 tidyselect_1.2.1 ## [73] locfit_1.5-9.12 Biostrings_2.79.5 knitr_1.51 ## [76] gridExtra_2.3 bookdown_0.46 ProtGenerics_1.43.0 ## [79] xfun_0.57 statmod_1.5.1 UCSC.utils_1.7.1 ## [82] lazyeval_0.2.3 yaml_2.3.12 evaluate_1.0.5 ## [85] codetools_0.2-20 cigarillo_1.1.0 tibble_3.3.1 ## [88] alabaster.matrix_1.11.0 BiocManager_1.30.27 cli_3.6.6 ## [91] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.1.1 ## [94] GenomeInfoDb_1.47.2 dbplyr_2.5.2 png_0.1-9 ## [97] XML_3.99-0.23 parallel_4.6.0 blob_1.3.0 ## [100] AnnotationFilter_1.35.0 bitops_1.0-9 alabaster.se_1.11.0 ## [103] viridisLite_0.4.3 scales_1.4.0 purrr_1.2.2 ## [106] crayon_1.5.3 rlang_1.2.0 cowplot_1.2.0 ## [109] KEGGREST_1.51.1 References "],["protein-multiomics.html", "Chapter 9 Protein multiomics 9.1 Motivation 9.2 Quality control 9.3 Normalization 9.4 Feature selection and PCA 9.5 The rest of the analysis 9.6 Combining modalities Session information", " Chapter 9 Protein multiomics 9.1 Motivation Cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) simultaneously quantifies gene expression and surface protein abundance in each cell (Stoeckius et al. 2017). First, we create antibodies against the proteins of interest and conjugate them to synthetic RNA tags, i.e., antibody-derived tags (ADTs)35. Cells are labelled with these antibodies and processed with single-cell technologies like 10X Genomics. For each cell, both ADTs and endogenous transcripts are reverse-transcribed into cDNA and sequenced. This yields a set of counts for the ADTs, to quantify the abundance of each selected protein; and another set of counts for the genes, as in scRNA-seq. We can then examine aspects of the proteome (e.g., post-translational modifications) and other cellular features that would normally be overlooked in transcriptomic studies. To analyze CITE-seq data, we split the dataset into the RNA and ADT counts and apply usual steps (quality control, normalization, etc.) to each modality. For the RNA modality, we can re-use the same functions from the previous chapters as if the data were generated from an scRNA-seq experiment. For ADTs, some tweaks are necessary to account for unique aspects of the ADT counts - specifically, fewer features are available as the proteins of interest were chosen by the reseacher, and the coverage of each ADT is much deeper as the sequencing resources are concentrated into a smaller number of features. Once modality-specific processing is complete, we combine the ADT and RNA data so that information in both modalities are used in downstream steps like clustering. 9.2 Quality control As in the RNA-based analysis, we want to remove cells in which ADTs were not efficiently captured or sequenced. This involves similar QC metrics to those described in Chapter 1, specifically: The number of ADTs detected (i.e., with non-zero counts) in each cell. We expect non-zero counts for most ADTs in each cell, even if the corresponding protein target is not present on the cell surface. This is due to deeper sequencing coverage that detects up free-floating antibodies in the ambient solution or antibodies that are non-specifically bound to the cell membrance. An unusually low number of detected features is indicative of a failure in library preparation or sequencing. The sum of counts for isotype control (IgG) antibodies. IgG controls lack a specific target in the cell but otherwise have similar properties to the primary antibodies against the proteins of interest. The coverage of these control ADTs serves as a measure of non-specific binding in each cell. A large sum for the controls is indicative of a problem with specificity, possibly even the formation of undesirable protein aggregates. We demonstrate using a PBMC dataset from 10X Genomics (Zheng et al. 2017) that contains quantified abundances for a number of interesting surface proteins. library(DropletTestFiles) path.pbmc &lt;- getTestFile(&quot;tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz&quot;) dir.pbmc &lt;- tempfile() untar(path.pbmc, exdir=dir.pbmc) # Loading it in as a SingleCellExperiment object. library(DropletUtils) sce.pbmc &lt;- read10xCounts(file.path(dir.pbmc, &quot;filtered_feature_bc_matrix&quot;)) # Splitting off the ADTs into an alternative experiment for separate # processing, otherwise they&#39;d be treated as genes. sce.pbmc &lt;- splitAltExps(sce.pbmc, rowData(sce.pbmc)$Type) sce.pbmc ## class: SingleCellExperiment ## dim: 33538 7865 ## metadata(1): Samples ## assays(1): counts ## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: Gene Expression ## altExpNames(1): Antibody Capture # Here, the &quot;main&quot; experiment contains the RNA data, while the alternative # experiment contains the antibody data. mainExpName(sce.pbmc) ## [1] &quot;Gene Expression&quot; sce.adt.pbmc &lt;- altExp(sce.pbmc, &quot;Antibody Capture&quot;) sce.adt.pbmc ## class: SingleCellExperiment ## dim: 17 7865 ## metadata(1): Samples ## assays(1): counts ## rownames(17): CD3 CD4 ... IgG1 IgG2b ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(0): ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): # Taking a sneak peak at the ADT counts. counts(sce.adt.pbmc)[,1:10] ## 17 x 10 sparse Matrix of class &quot;dgCMatrix&quot; ## ## CD3 18 30 18 18 5 21 34 48 4522 2910 ## CD4 138 119 207 11 14 1014 324 1127 3479 2900 ## CD8a 13 19 10 17 14 29 27 43 38 28 ## CD14 491 472 1289 20 19 2428 1958 2189 55 41 ## CD15 61 102 128 124 156 204 607 128 111 130 ## CD16 17 155 72 1227 1873 148 676 75 44 37 ## CD56 17 248 26 491 458 29 29 29 30 15 ## CD19 3 3 8 5 4 7 15 4 6 6 ## CD25 9 5 15 15 16 52 85 17 13 18 ## CD45RA 110 125 5268 4743 4108 227 175 523 4044 1081 ## CD45RO 74 156 28 28 21 492 517 316 26 43 ## PD-1 9 9 20 25 28 16 26 16 28 16 ## TIGIT 4 9 11 59 76 11 12 12 9 8 ## CD127 7 8 12 16 17 15 11 10 231 179 ## IgG2a 5 4 12 12 7 9 6 3 19 14 ## IgG1 2 8 19 16 14 10 12 7 16 10 ## IgG2b 3 3 6 4 9 8 50 2 8 2 We compute each of the QC metrics described above from the ADT count matrix. We also compute the sum of counts across all ADTs for each cell, but this is strictly for informational purposes only as it is not an effective QC metric. Specifically, the presence of a targeted protein can lead to a several-fold increase in the total ADT count, given the binary nature of most surface markers. Removing cells with low total ADT counts could inadvertently eliminate cell types that do not express many - or indeed, any - of the selected protein targets. Similarly, we prefer to use the sum of IgG counts instead of the proportion as the latter relies on the total count and is more affected by the biology. For example, a cell that does not express any of the targets would have a lower total and thus a higher IgG proportion, making it unfairly susceptible to removal. library(scrapper) is.igg.pbmc &lt;- grep(&quot;^IgG&quot;, rownames(sce.adt.pbmc)) sce.qc.adt.pbmc &lt;- quickAdtQc.se(sce.adt.pbmc, subsets=list(IgG=is.igg.pbmc)) summary(sce.qc.adt.pbmc$sum) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 3332 5816 6509 8166 147076 summary(sce.qc.adt.pbmc$detected) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 17.00 17.00 16.94 17.00 17.00 summary(sce.qc.adt.pbmc$subset.sum.IgG) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 18.00 23.00 27.41 30.00 2113.00 The quickAdtQc.se() function computes thresholds using the outlier-based strategy described in Section 1.3.1 (Figure 9.1). We use a log-transformation for the number of detected features and the IgG sum to avoid negative thresholds and improve normality. We also perform a minor adjustment to relax the threshold for the number of detected ADTs if the MAD is zero. qc.thresh.adt.pbmc &lt;- metadata(sce.qc.adt.pbmc)$qc$thresholds qc.thresh.adt.pbmc ## $detected ## [1] 15.3 ## ## $subset.sum ## IgG ## 74.98505 library(scater) gridExtra::grid.arrange( plotColData(sce.qc.adt.pbmc, y=&quot;detected&quot;) + geom_hline(yintercept=qc.thresh.adt.pbmc$detected, linetype=&quot;dashed&quot;, color=&quot;red&quot;) + ggtitle(&quot;Detected features&quot;), plotColData(sce.qc.adt.pbmc, y=&quot;subset.sum.IgG&quot;) + geom_hline(yintercept=qc.thresh.adt.pbmc$subset.sum[&quot;IgG&quot;], linetype=&quot;dashed&quot;, color=&quot;red&quot;) + scale_y_log10() + ggtitle(&quot;IgG sum&quot;), ncol=2 ) Figure 9.1: Distribution of ADT-based QC metrics in the PBMC dataset. Each point represents a cell, while dashed lines represent thresholds for each metric. We then apply these thresholds to our metrics to identify high-quality cells. If we wanted to use custom thresholds, we could modify our thresholds in the same manner as described in Section 1.3.2. Similarly, if our dataset contained multiple experimental batches, we could use the same blocking approach as described in Section 1.5. summary(sce.qc.adt.pbmc$keep) ## Mode FALSE TRUE ## logical 158 7707 If we were only interested in the ADT data, we could subset our SingleCellExperiment with qc.keep.adt.pbmc and proceed to the next step. However, the entire purpose of CITE-seq is to examine both protein abundance and gene expression for the same cell. Thus, we need to apply quality control to the RNA counts as described in Chapter 1. We only keep cells that are considered to be of high quality in both of the ADT and RNA modalities. is.mito.pbmc &lt;- grep(&quot;^MT-&quot;, rowData(sce.pbmc)$Symbol) sce.qc.pbmc &lt;- quickRnaQc.se(sce.pbmc, subsets=list(MT=is.mito.pbmc)) # Seeing how many cells pass both, one or neither QC filters. table(RNA=sce.qc.pbmc$keep, ADT=sce.qc.adt.pbmc$keep) ## ADT ## RNA FALSE TRUE ## FALSE 41 296 ## TRUE 117 7411 # Only keeping cells that pass both filters. qc.keep.combined.pbmc &lt;- sce.qc.pbmc$keep &amp; sce.qc.adt.pbmc$keep sce.qc.pbmc &lt;- sce.qc.pbmc[,qc.keep.combined.pbmc] sce.qc.adt.pbmc &lt;- sce.qc.adt.pbmc[,qc.keep.combined.pbmc] ncol(sce.qc.pbmc) ## [1] 7411 9.3 Normalization As with RNA, we performing scaling normalization to remove cell-specific biases due to differences in library preparation and sequencing efficiency (Chapter 2). Unfortunately, we can’t just take the size factors for the RNA counts and re-use them for the ADTs. The two modalities will be subject to different biases due to differences in biophysical properties between endogenous transcripts and ADTs, e.g., length, sequence composition. Some aspects of the library preparation and sequencing are also unique to each modality, providing more opportunities for differences in the biases. So, instead, we need to compute ADT-specific size factors to normalize the ADT counts. The simplest choice of size factor is to use the total sum of ADT counts, i.e., the library size for the ADTs. Unfortunately, this is highly susceptible to composition biases caused by differences in protein abundance between cells. Composition biases are much more pronounced in ADT data compared to RNA due to (i) the binary nature of target protein abundances, where any increase in protein abundance manifests as a large increase to the total ADT count; and (ii) the a priori selection of interesting protein targets, which enriches for features that are more likely to be differentially abundant across the population. These composition biases are strong enough to interfere with interpretation of fold-changes in protein abundance between clusters. Instead, we use the geometric mean of all counts as the size factor for each cell (Stoeckius et al. 2017), which is based on the centered log-ratio (CLR) transformation for handling compositional data. The geometric mean is a reasonable estimator of the scaling biases for large counts, with the added benefit that it mitigates the effects of composition biases by dampening the impact of one or two highly abundant ADTs. scrapper implements a slightly more accurate variant of this approach named “CLRm1”, which accounts for the bias introduced by adding a pseudo-count during the calculation of the geometric mean. We center the size factors to ensure that the scaling normalization preserves the magnitude of the original counts, and we compute log-normalized abundance values for ADTs as described in Section 2.3.1. sce.norm.adt.pbmc &lt;- normalizeAdtCounts.se(sce.qc.adt.pbmc) summary(sce.norm.adt.pbmc$sizeFactor) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2041 0.7042 0.9094 1.0000 1.1486 6.7457 We observe some deviation between the CLRm1 size factors and their library size-derived counterparts (Figure 9.2). This is consistent with the presence of strong composition biases in the latter that are dampened in the former. Of course, the geometric mean is not foolproof and will progressively become less accurate with more upregulated ADTs in each cell. It is also more sensitive to noise at low counts, though this should be less problematic for ADT data due to its deeper sequencing coverage compared to RNA. lib.sf.adt.pbmc &lt;- centerSizeFactors(sce.norm.adt.pbmc$sum) plot(sce.norm.adt.pbmc$sizeFactor, lib.sf.adt.pbmc, log=&quot;xy&quot;, pch=16, cex=0.5) Figure 9.2: Comparison between the CLRm1 size factors and the library size-derived factors for the ADT modality of the PBMC dataset. 9.4 Feature selection and PCA Feature selection for ADTs is generally unnecessary as it was already performed during the design of the antibody panel. The manual choice of target proteins means that all ADTs already correspond to “interesting” features. In addition, there is little scope for further filtering when the number of ADTs is low. Here, we have fewer than 20 ADTs, and even for the larger datasets, the panel will usually have less than 200 features. These are small numbers compared to our previous selections of 1000-5000 HVGs in Chapter 3. We might consider removing the IgG controls as we know that they will not be biologically interesting. This probably won’t make much difference as the controls are unlikely to exhibit strong variation that might intefere with downstream steps. But it probably won’t hurt either, so we might as well do it: selected.adt.pbmc &lt;- !grepl(&quot;^IgG&quot;, rownames(sce.norm.adt.pbmc)) rowData(sce.norm.adt.pbmc)$of.interest &lt;- selected.adt.pbmc summary(selected.adt.pbmc) ## Mode FALSE TRUE ## logical 3 14 We also perform a PCA on the ADT log-abundance matrix as described in Chapter 4. This is mostly useful for datasets with larger panels to compact the data from ~200 ADTs to 10-20 PCs. For smaller datasets, PCA is unnecessary as the number of ADTs is comparable to the typical number of PCs. Regardless, it doesn’t hurt to run a PCA in such cases - if the number of ADTs is lower than the requested number of PCs, the PC scores will simply be a rotation of the log-abundance data. sce.pca.adt.pbmc &lt;- runPca.se( sce.norm.adt.pbmc, features=selected.adt.pbmc, number=20 ) dim(reducedDim(sce.pca.adt.pbmc, &quot;PCA&quot;)) ## [1] 7411 14 If we don’t want to run a PCA, we could instead use the log-normalized abundance matrix directly in downstream analyses. # Transpose to make it look like a reducedDim entry, so that we could plug it # into downstream algorithms by just setting reddim.type=. norm.adt.pbmc &lt;- t(assay(sce.norm.adt.pbmc, &quot;logcounts&quot;)[selected.adt.pbmc,]) reducedDim(sce.pca.adt.pbmc, &quot;selected&quot;) &lt;- as.matrix(norm.adt.pbmc) # For example... sce.kmeans.adt.pbmc &lt;- clusterKmeans.se(sce.pca.adt.pbmc, k=10, reddim.type=&quot;selected&quot;) summary(sce.kmeans.adt.pbmc$clusters) ## 1 2 3 4 5 6 7 8 9 10 ## 916 1232 698 490 615 1249 160 549 682 820 9.5 The rest of the analysis Once we have the PCs, we can use them for clustering and visualization in the same manner as described in Chapters 5 and 6. This summarizes the heterogeneity specific to the ADT modality (Figure 9.3). sce.nn.adt.pbmc &lt;- runAllNeighborSteps.se(sce.pca.adt.pbmc) table(sce.nn.adt.pbmc$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 702 886 956 1017 208 640 570 416 409 312 292 482 135 82 160 144 library(scater) plotReducedDim(sce.nn.adt.pbmc, &quot;TSNE&quot;, colour_by=&quot;clusters&quot;) Figure 9.3: \\(t\\)-SNE plot generated from the log-normalized abundance of each ADT in the PBMC dataset. Each point is a cell and is colored according to its assigned cluster. We then identify markers from the log-abundance matrix, as described in Chapter 7. For the top ADTs, we usually observe very large effect sizes due to the binary nature of surface targets. However, there are also strong composition biases in this data so some caution is required when interpreting the smaller log-fold changes. markers.adt.pbmc &lt;- scoreMarkers.se(sce.nn.adt.pbmc, sce.nn.adt.pbmc$clusters) previewMarkers(markers.adt.pbmc[[&quot;1&quot;]]) # Looking at the top marker tags for cluster 1. ## DataFrame with 10 rows and 3 columns ## mean detected lfc ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## CD14 10.47980 1.000000 4.815979 ## CD4 8.93075 1.000000 1.263068 ## CD15 7.23453 1.000000 0.430545 ## CD56 5.42423 1.000000 0.474958 ## CD45RO 7.52737 1.000000 0.355057 ## IgG2a 3.28407 0.994302 0.209745 ## CD16 6.25177 1.000000 0.029543 ## CD25 4.45000 1.000000 0.131739 ## IgG1 3.67019 0.998575 0.123389 ## IgG2b 2.49592 0.961538 0.170127 We can also use the ADT-derived clusters to identify marker genes from the log-expression matrix for the RNA modality. This is analogous to performing FACS to isolate cell types before differential expression analyses with bulk RNA-seq. # Computing log-normalized expression values from the RNA counts. sce.norm.pbmc &lt;- normalizeRnaCounts.se(sce.qc.pbmc) # Computing markers for RNA data but using the ADT-derived clusters! markers.adt2rna.pbmc &lt;- scoreMarkers.se(sce.norm.pbmc, sce.nn.adt.pbmc$clusters, extra.columns=&quot;Symbol&quot;) # Now looking at the top marker genes for cluster 1. previewMarkers(markers.adt2rna.pbmc[[&quot;1&quot;]], pre.columns=&quot;Symbol&quot;) ## DataFrame with 10 rows and 4 columns ## Symbol mean detected lfc ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000090382 LYZ 6.09956 1.000000 5.04224 ## ENSG00000101439 CST3 3.88405 1.000000 3.14760 ## ENSG00000011600 TYROBP 3.76942 1.000000 2.72675 ## ENSG00000158869 FCER1G 3.16000 0.998575 2.36430 ## ENSG00000163220 S100A9 5.93046 1.000000 4.89871 ## ENSG00000085265 FCN1 3.07327 0.994302 2.55797 ## ENSG00000163563 MNDA 2.71558 0.997151 2.26325 ## ENSG00000163131 CTSS 3.62322 0.998575 2.64997 ## ENSG00000143546 S100A8 5.24883 0.994302 4.37216 ## ENSG00000025708 TYMP 2.66359 0.995726 2.12287 Conversely, we could derive clusters from the RNA data and test for differential abundance of ADTs between clusters. This is most relevant when the ADTs represent some kind of functional readout (e.g., binding activity) instead of cell type identity. 9.6 Combining modalities A more efficient use of our CITE-seq data would consider heterogeneity in both modalities simultaneously. In other words, the ADT and RNA data are combined in some manner prior to clustering and visualisations. This ensures that any unique variation in either modality will be captured in the cluster definitions. For example, if the antibody panel captures transient post-translation modifications like phosphorylation, this will not show up in the RNA data; conversely, biological processes without a surface target will not be represented in the ADT data. To demonstrate, let’s continue the analysis of the RNA modality of our PBMC dataset: sce.var.pbmc &lt;- chooseRnaHvgs.se(sce.norm.pbmc) sce.pca.pbmc &lt;- runPca.se(sce.var.pbmc, features=rowData(sce.var.pbmc)$hvg, number=20) ncol(reducedDim(sce.pca.pbmc)) ## [1] 20 Possibly the simplest method to combine modalities involves literally combining the matrices of ADT- and RNA-derived PC scores. (Or if no PCA was performed for the ADTs, the log-abundance matrix can be used instead.) The combined matrix contains both sets of PCs, ensuring that heterogeneity from both modalities will be considered, e.g., when computing distances and finding neighbors. However, naively combining the two matrices is not ideal as the number of genes is typically several orders of magnitude greater than the number of ADTs. This would cause the RNA modality to dominate the variance in the combined matrix, effectively sidelining any contributions from the ADT modality. Instead, we scale the modalities to balance their contributions to the combined matrix with the scaleByNeighbors.se() function. For each modality, we compute the median distance from each cell to its \\(k\\)-nearest neighbor, which we treat as a proxy for the uninteresting variation within subpopulations. Each matrix of PCs is then scaled according to its median distance, equalizing the magnitude of uninteresting variation across modalities. This ensures that high baseline variation in one modality will not drown out interesting biological variation in another modality in the combined matrix. We use the nearest neighbor distance to avoid capturing genuine biological differences between subpopulations - otherwise, if we scaled on total variance, we would penalize the most informative modalities with the strongest heterogeneity. # We put our ADT experiment back inside the parent object so that # scaleByNeighbors.se can see both sets of PCs at once. altExp(sce.pca.pbmc, &quot;Antibody Capture&quot;) &lt;- sce.pca.adt.pbmc sce.combined.pbmc &lt;- scaleByNeighbors.se( sce.pca.pbmc, main.reddims=&quot;PCA&quot;, altexp.reddims=c(`Antibody Capture`=&quot;PCA&quot;) ) dim(reducedDim(sce.combined.pbmc, &quot;combined&quot;)) ## [1] 7411 34 # Scaling applied to PCs from the main experiment, i.e., the RNA. metadata(sce.combined.pbmc)$combined$main.scaling ## PCA ## 1 # Scaling applied to PCs from the alternative experiment, i.e., the ADTs. metadata(sce.combined.pbmc)$combined$altexp.scaling ## $`Antibody Capture` ## PCA ## 2.195758 The combined matrix of PCs is convenient as it can be used in the same functions that accept a regular matrix of PCs. Now, we can easily accommodate multiple modalities in downstream steps like clustering and visualization (Figure 9.4). sce.nn.combined.pbmc &lt;- runAllNeighborSteps.se(sce.combined.pbmc, reddim.type=&quot;combined&quot;) table(sce.nn.combined.pbmc$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 1649 730 1005 763 585 510 460 375 461 208 230 91 73 158 79 34 plotReducedDim(sce.nn.combined.pbmc, &quot;TSNE&quot;, colour_by=&quot;clusters&quot;) Figure 9.4: \\(t\\)-SnE plot of the PBMC data generated from combined ADT and RNA PCs. Each point is a cell and is colored according to the assigned cluster. In practice, the RNA and ADT modalities are often strongly correlated when the antibody panel targets cell type-related proteins. Using a combined matrix does not offer much benefit in these cases - in fact, we would say that a well-designed panel is more than enough for cell type identification36, without any help from gene expression at all. Combining modalities may even be detrimental if one of the modalities has little biological variation, e.g., if no antibodies are bound, the ADT matrix will only be contributing noise. So, what should we do? Well, our usual advice for single-cell analysis applies, a.k.a., see if we get interesting results and try something else if we don’t. Session information sessionInfo() ## R version 4.6.0 alpha (2026-04-05 r89794) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] scater_1.39.4 ggplot2_4.0.2 ## [3] scuttle_1.21.6 scrapper_1.5.17 ## [5] DropletUtils_1.31.1 SingleCellExperiment_1.33.2 ## [7] SummarizedExperiment_1.41.1 Biobase_2.71.0 ## [9] GenomicRanges_1.63.2 Seqinfo_1.1.0 ## [11] IRanges_2.45.0 S4Vectors_0.49.1 ## [13] BiocGenerics_0.57.0 generics_0.1.4 ## [15] MatrixGenerics_1.23.0 matrixStats_1.5.0 ## [17] DropletTestFiles_1.21.0 BiocStyle_2.39.0 ## ## loaded via a namespace (and not attached): ## [1] DBI_1.3.0 gridExtra_2.3 ## [3] httr2_1.2.2 rlang_1.2.0 ## [5] magrittr_2.0.5 otel_0.2.0 ## [7] compiler_4.6.0 RSQLite_2.4.6 ## [9] DelayedMatrixStats_1.33.0 png_0.1-9 ## [11] vctrs_0.7.3 pkgconfig_2.0.3 ## [13] crayon_1.5.3 fastmap_1.2.0 ## [15] dbplyr_2.5.2 XVector_0.51.0 ## [17] labeling_0.4.3 rmarkdown_2.31 ## [19] ggbeeswarm_0.7.3 purrr_1.2.2 ## [21] bit_4.6.0 xfun_0.57 ## [23] cachem_1.1.0 beachmat_2.27.5 ## [25] jsonlite_2.0.0 blob_1.3.0 ## [27] rhdf5filters_1.23.3 DelayedArray_0.37.1 ## [29] Rhdf5lib_1.33.6 BiocParallel_1.45.0 ## [31] irlba_2.3.7 parallel_4.6.0 ## [33] R6_2.6.1 bslib_0.10.0 ## [35] RColorBrewer_1.1-3 limma_3.67.1 ## [37] jquerylib_0.1.4 Rcpp_1.1.1 ## [39] bookdown_0.46 knitr_1.51 ## [41] R.utils_2.13.0 Matrix_1.7-5 ## [43] tidyselect_1.2.1 viridis_0.6.5 ## [45] dichromat_2.0-0.1 abind_1.4-8 ## [47] yaml_2.3.12 codetools_0.2-20 ## [49] curl_7.0.0 lattice_0.22-9 ## [51] tibble_3.3.1 S7_0.2.1 ## [53] withr_3.0.2 KEGGREST_1.51.1 ## [55] evaluate_1.0.5 BiocFileCache_3.1.0 ## [57] ExperimentHub_3.1.0 Biostrings_2.79.5 ## [59] pillar_1.11.1 BiocManager_1.30.27 ## [61] filelock_1.0.3 BiocVersion_3.23.1 ## [63] sparseMatrixStats_1.23.0 scales_1.4.0 ## [65] glue_1.8.0 tools_4.6.0 ## [67] AnnotationHub_4.1.0 BiocNeighbors_2.5.4 ## [69] ScaledMatrix_1.19.0 locfit_1.5-9.12 ## [71] cowplot_1.2.0 rhdf5_2.55.16 ## [73] grid_4.6.0 AnnotationDbi_1.73.1 ## [75] edgeR_4.9.7 beeswarm_0.4.0 ## [77] BiocSingular_1.27.1 HDF5Array_1.39.1 ## [79] vipor_0.4.7 rsvd_1.0.5 ## [81] cli_3.6.6 rappdirs_0.3.4 ## [83] viridisLite_0.4.3 S4Arrays_1.11.1 ## [85] dplyr_1.2.1 gtable_0.3.6 ## [87] R.methodsS3_1.8.2 sass_0.4.10 ## [89] digest_0.6.39 ggrepel_0.9.8 ## [91] SparseArray_1.11.13 dqrng_0.4.1 ## [93] farver_2.1.2 memoise_2.0.1 ## [95] htmltools_0.5.9 R.oo_1.27.1 ## [97] lifecycle_1.0.5 h5mread_1.3.3 ## [99] httr_1.4.8 statmod_1.5.1 ## [101] bit64_4.6.0-1 References "],["closing-remarks.html", "Closing remarks", " Closing remarks Well, it’s over. Congratulations on making it to the end. Congratulations! Here’s a few more pieces of opinionated advice, drawn from bitter experience37: Don’t feel too bad about making subjective decisions in the choice of parameters. Much of scRNA-seq data analysis is exploratory, which is inherently guided by our own interests. We’re just generating new hypotheses at this point so we don’t need to be too rigorous. Because exploration is so open-ended, scRNA-seq data analysis can be quite time-consuming. For example, you might redo the analysis with different parameters, perform subclustering, etc. to examine the data from different perspectives. Make sure you get appropriate recognition for all this effort38. Always validate conclusions with independent replicates and a different (i.e., non-sequencing-based) assay technique. We had a fun time with all the subjective data exploration to generate new hypotheses, but at some point, we need to pay the piper and test all the stuff we made up. And hey, if all else fails, there’s nothing wrong with a bit of stamp collecting39 You’ve already been paid to generate and analyze the data, so you might as well get it published; perhaps it might help someone else down the line. As I often told my manager, “fools learn from experience, wise men learn from history.” In the end, I guess the distinction didn’t matter as we both got fired.↩︎ For a study with a major single-cell component, a joint first/corresponding authorship seems to be fair market price for the primary analyst.↩︎ For our younger readers: back in the day, when you wanted to send a message to someone, you would write your message on a piece of paper, put that paper in an envelope (a paper-based packaging device), and request your country’s postal service to physically deliver it to the recipient’s address. Payment for delivery would be denoted by purchasing an adhesive “stamp” and sticking it on the envelope. These stamps would often have interesting decorative features that made them desirable to collectors. In a scientific context, the infamous adage “all science is either physics or stamp collecting” is often attributed to Sir Ernest Rutherford, dismissing the importance of other research fields to their inability to generate testable hypotheses. The comparison is particularly apt for much of single-cell genomics, though at least stamp collecting is fun and cheap.↩︎ "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
