\documentclass[a4paper]{article} \begin{document} \title{Dropout-based Feature Selection with M3Drop and NBumi} \author{Tallulah Andrews} %\VignetteIndexEntry{Introduction to M3Drop} \maketitle \section*{Introduction} Feature selection is a key step in the analysis of single-cell RNASeq data. Feature selection aims to identify and remove genes whose expression is dominated by technical noise, thus reducing the effect of batch-effects and other technical counfounders while also reducing the curse of dimensionality. A common way to select features is to identify genes that follow a different expression distribution than a set of control genes. These control genes may be spike-ins but more commonly one assumes that the majority of genes don't differ from control genes and use all genes to approximate the expression pattern of expression of controls. Often the variance relative to the mean expression is used to characterize the expression distribution of a gene, and feature selection proceeds by identifying highly variable genes. In contrast, this package provides two methods which characterize the expression distribution of a gene using the dropout-rate and mean expression. This takes advantage of the fact than at least 50\% of entries in a single-cell RNASeq expression matrix are dropouts, and that dropout rate is less sensitive to sampling errors than variance. \section*{M3Drop} Single-cell RNA sequencing is able to quantify the whole transcriptome from the small amount of RNA present in individual cells. However, a consequence of reverse-transcribing and amplifying small quantities of RNA is a large number of dropouts, genes with zero expression in particular cells. The frequency of dropout events is strongly non-linearly related to the measured expression levels of the respective genes. M3Drop posits that these dropouts are due to failures of reverse transcription, a simple enzyme reaction, thus should be modelled using the Michaelis-Menten equation as follows: $$P_{i} = 1 - \frac{S_{i}}{S_{i}+K}$$ Where $P_{i}$ is the proportion of cells where gene $i$ dropouts out, $S_{i}$ is the mean expression of gene $i$ and $K$ is the Michaelis constant. This model fits observed single-cell RNASeq data which have been sequenced to near saturation (i.e. >100,000 reads per cell). Note: This model works well for Smartseq2 and other single-cell datasets that do not use unique molecular identifiers (UMIs). For 10X Chromium data or other UMI tagged data see the NBumi model below. We'll be using a portion of the Deng et al. (2014) dataset in this example. You can download the R-package containing this data (M3DExampleData) from Bioconductor using biocLite(). <>= library(M3Drop) library(M3DExampleData) @ \subsection*{Prepare Data} The first step is to perform quality control on cells. Any existing tools can be used for this. Here we will simply remove cells with fewer than 2000 detected genes. <<>>= counts <- Mmus_example_list$data labels <- Mmus_example_list$labels total_features <- colSums(counts >= 0) counts <- counts[,total_features >= 2000] labels <- labels[total_features >= 2000] @ The second is to extract and correctly format data for M3Drop. To support quality control tools in various other single-cell R packages, we have provided a function to extract appropriate data from objects used by scater, monocle, and Seurat. This function also removes undetected genes, and if necessary applies a counts-per-million normalization to raw count data. <<>>= norm <- M3DropConvertData(counts, is.counts=TRUE) @ M3Drop requires a normalized but not log-transformed expression matrix, thus the above function can optionally de-log a log2-normalized expression matrix from any normalization method. <<>>= norm <- M3DropConvertData(log2(norm+1), is.log=TRUE, pseudocount=1) @ Any normalization method which preserves dropouts (i.e zeros) is compatible with M3Drop. \subsection*{Feature Selection} Since the Michaelis-Menten equation is convex, averaging across a mixed population forces differentially expressed genes to be shifted to the right of the Michaelis-Menten curve (Figure 1). \begin{center} <>= K <- 49 S_sim <- 10^seq(from=-3, to=4, by=0.05) MM <- 1-S_sim/(K+S_sim) plot(S_sim, MM, type="l", lwd=3, xlab="Expression", ylab="Dropout Rate", xlim=c(1,1000)) S1 <- 10; P1 <- 1-S1/(K+S1); S2 <- 750; P2 <- 1-S2/(K+S2); points(c(S1,S2), c(P1,P2), pch=16, col="grey85", cex=3) lines(c(S1, S2), c(P1,P2), lwd=2.5, lty=2, col="grey35") mix <- 0.5; points(S1*mix+S2*(1-mix), P1*mix+P2*(1-mix), pch=16, col="grey35", cex=3) @ \\ \textbf{Figure 1}: Michaelis-Menten is convex which leads to DE genes being outliers to the right/above the curve. \end{center} Feature selection for DE genes are identified by comparing the local $K$ calculated for a specific gene to the global $K$ fitted across all genes using a Z-test followed by multiple-testing correction. Here we find 1,248 DE genes at 1\% FDR. \begin{center} <>= M3Drop_genes <- M3DropFeatureSelection(norm, mt_method="fdr", mt_threshold=0.01) @ \\ \textbf{Figure 2}: Differentially expressed genes at 1\% FDR (purple). \end{center} \section*{NBumi} The Michaelis-Menten equation fits full-transcript single-cell RNASeq data well, but often struggles to fit data tagged with unique molecular identifiers (UMIs). This is a result of UMI datasets typically not being sequenced to saturation, thus many dropouts are a result of low sequencing coverage rather than a failure of reverse transcription. To account for zeros resulting from insufficient sequencing depth, the M3Drop packaged includes a depth-adjusted negative binomial model (DANB). DANB models each observation as a negative binomial distribution with mean proportional to the mean expression of the respective gene and the relative sequencing depth of the respective cell. The dispersion parameter of the negative binomial is fit to the variance of each gene. The equations for calculating these parameters are: Observation specific means: $$\mu_{ij} = \frac{\sum\limits_{i} t_{ij} * \sum\limits_{j} t_{ij}}{\sum\limits_{ij} t_{ij}}$$ Gene specific dispersion (solved for $r_{i}$): $$\sum_{j}(t_{ij}-\mu_{ij})^2 = \sum_{j}(\mu_{ij}+r_{i}\mu_{ij}^2)$$ Where $\mu_{ij}$ is the observation specific mean for the negative binomial for the observed molecule counts, $t_{ij}$ of the $i$th gene in the $j$th of cell and $r_{i}$ is the dispersion parameter (the "size" paramter of R's nbinom functions) for gene $i$. Functions relating to the DANB model are tagged with the "NBumi" prefix. We will continue using the example data despite it not using UMIs for demonstration purposes. Similar to M3Drop we have provided a function to extract the relevant expression matrix from objects used by popular single-cell RNASeq analysis packages : scater, monocle, and Seurat. This is a separate function because DANB requires raw counts as opposed to the normalized expression matrix used by M3Drop. <<>>= count_mat <- NBumiConvertData(Mmus_example_list$data, is.counts=TRUE) @ This function can also de-log and de-normalize a provided expression matrix, but raw counts are preferable if they are available. Next we fit the DANB model to this count matrix, and check it fits the data: \begin{center} <>= DANB_fit <- NBumiFitModel(count_mat) # Smoothed gene-specific variances par(mfrow=c(1,2)) stats <- NBumiCheckFitFS(count_mat, DANB_fit) print(c(stats$gene_error,stats$cell_error)) @ \\ \textbf{Figure 3}: Fits of the depth-adjusted negative binomial. \end{center} Since this is a full-transcript dataset that we have converted from normalized values the model doesn't fit well as can be seen above. We will continue with this data for demonstration purposes only. We use a binomial test to evaluable the significance of features under the DANB model: \begin{center} <>= NBDropFS <- NBumiFeatureSelectionCombinedDrop(DANB_fit, method="fdr", qval.thres=0.01, suppress.plot=FALSE) @ \textbf{Figure 4}: Dropout-based feature selection using DANB \end{center} \subsection*{Pearson Residuals} Pearson residuals have recently been proposed as an alternative normalization approach for UMI tagged single-cell RNAseq data. (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02451-7) We have added two option for calculating Pearson residuals using the DANB model presented here: <<>>= pearson1 <- NBumiPearsonResiduals(count_mat, DANB_fit) pearson2 <- NBumiPearsonResidualsApprox(count_mat, DANB_fit) @ If you have not fit the DANB model to the data, you can run each of these functions directly on the count matrix alone, and the required model fitting will be performed as part of the function. In this case the NBumiPearsonResidualsApprox function will be much quicker since it does not fit the dispersion parameter of the DANB model, only the mean, and approximates the Pearson residuals as if the data was Poisson distributed: $$p_{approx} = \frac{x_{ij}-mu_{ij}}{\sqrt{mu_{ij}}}$$ Where $x_{ij}$ is the observed umi counts. Whereas the NBumiPearsonResiduals function calculates the Pearson residuals using the full DANB model as: $$p_{exact} = \frac{x_{ij}-mu_{ij}}{\sqrt{mu_{ij} + \frac{mu_{ij}^2}{r_i}}}$$ We do not perform any trimming or clipping of the residuals. To perform $\sqrt{n}$ clipping as per Hafemeister and Satija, simple run the following additional steps: <<>>= pearson1[pearson1 > sqrt(ncol(count_mat))] <- sqrt(ncol(count_mat)) pearson1[pearson1 < -sqrt(ncol(count_mat))] <- -1*sqrt(ncol(count_mat)) @ \section*{Other Feature Selection Methods} For comparison purposes we have provided functions for other feature selection methods, including: BrenneckeGetVariableGenes from Brennecke et al. (2013) modified to optionally use all genes to fit the function between CV2 and mean expression. giniFS based on GiniClust (REF) pcaFS based on monocle (REF) using irlba and sparse-matricies. corFS which uses gene-gene correlations. and ConsensusFS which uses all the available feature selection methods and takes the average rank of genes across all methods. Only the Brennecke highly variable gene method also provides a significance test and runs in linear time, thus is the only one we will demostrate here. \begin{center} <>= HVG <- BrenneckeGetVariableGenes(norm) @ \textbf{Figure 5}: Brennecke highly variable genes. \end{center} \section*{Examining Features and Identifying Subpopulations of Cells} To check that the identified genes are truly differentially expressed we can plot the normalised expression of the genes across cells. \begin{center} <>= heat_out <- M3DropExpressionHeatmap(M3Drop_genes$Gene, norm, cell_labels = labels) @ \\ \textbf{Figure 6}: Heatmap of expression of M3Drop features. \end{center} The heatmap (Figure 6) shows that the genes identified by M3Drop are differentially expressed across timepoints. Furthermore, it shows that the blastocysts cluster into two different groups based on the expression of these genes. We can extract these groups and identify marker genes for them as follows: <<>>= cell_populations <- M3DropGetHeatmapClusters(heat_out, k=4, type="cell") library("ROCR") marker_genes <- M3DropGetMarkers(norm, cell_populations) @ The first function cuts the dendrogram from the heatmap to produce k clusters of cells. These labels are stored in cell\_populations. The second function tests all genes as marker genes for the provided clusters. Marker genes are ranked by the area-under the ROC curve (AUC) for predicting the population with the highest expression of the gene from the other groups. Significance is calculated using a Wilcox-rank-sum test. Now we can examine the marker genes of the two clusters of blastocyst cells more closely. <<>>= head(marker_genes[marker_genes$Group==4,],20) marker_genes[rownames(marker_genes)=="Cdx2",] @ This shows that the inner cell mass (ICM) marker Sox2 is one of the top 20 markers for group 4 and that the trophectoderm (TE) marker Cdx2 is a marker for group 3, suggesting these two clusters coorespond to ICM and TE cells within each blastocyst. \section*{Comparing to Other Methods} The DANB features look similar to M3Drop though identifies more genes due to the simplified noise model it uses, which is appropriate for UMI-tagged data but too permissive for data with high amplifcation noise. \begin{center} <>= heat_out <- M3DropExpressionHeatmap(NBDropFS$Gene, norm, cell_labels = labels) @ \\ \textbf{Figure 7}: Heatmap of expression of DANB features. \end{center} The HVG method is more sensitive to lowly expressed genes. In addition, the quadratic model it uses frequenly over estimates the expected variability of highly expressed genes. This is in contrast with M3Drop which recognizes the low information available for lowly expressed genes. This difference can be seen by comparing the heatmaps for the respective genes. The highly variable genes contains many more genes exhibiting just noisy expression, whereas nearly all genes detected by M3Drop are clearly differentially expressed across the different cell populations. \begin{center} <>= heat_out <- M3DropExpressionHeatmap(HVG, norm, cell_labels = labels) @ \\ \textbf{Figure 8}: Heatmap of expression of highly variable genes across cells. \end{center} \section*{References} Tallulah Andrews, Martin Hemberg. Modelling dropouts allows for unbiased identification of marker genes in scRNASeq experiments. \textit{bioRxiv}, 2016. doi:10.1101/065094 Philip Brennecke, Simon Anders, Jong Kyoung Kim, Aleksandra A Kolodziejczyk, Xiuwei Zhang, Valentina Proserpio, Bianka Baying, Vladimir Benes, Sarah A Teichmann, John C Marioni and Marcus G Heisler. Accounting for technical noise in single-cell RNA-seq experiments. \textit{Nature Methods}, 10:1093-1095, 2013. doi:10.1038/nmeth.2645, PMID: 24056876 \end{document}