The package can be installed from bioconductor
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scDDboost")Issue can be reported at “https://github.com/wiscstatman/scDDboost/issues”
scDDboost scores evidence of a gene being differentially distributed(DD) across two conditions for single cell RNA-seq data. Higher resolution brings several chanllenges for analyzing the data, specifically, the distribution of gene expression tends to have high prevalence of zero and multi-modes. To account for those characteristics and utilizing some biological intuition, we view the expression values sampled from a pool of cells mixed by distinct cellular subtypes blind to condition label. Consequently, the distributional change can be fully determined by the the change of subtype proportions. One tricky part is that not any change of proportions will lead to a distributional change. Given that some genes could be equivalent expressed across several subtypes, even the individual subytpe proportion may differ between conditions but as long as the aggregated proportions over those subtypes remain the same between conditions, it will not introduce different distribution. For example
Proportions of subtypes 1 and 2 changed between the 2 conditions. The gene is not DD if subtype 1 and 2 have the same expression level
For subtype 1 and 2 have different expression level, there is different distribution
pdd is the core function developed to quantify the
posterior probabilities of DD for input genes.
Let’s look at an example,
suppressMessages(library(scDDboost))Next, we load the toy simulated example a object that we will use for identifying and classifying DD genes.
data(sim_dat)Verify that this object is a member of the SingleCellExperiment class and that it contains 200
cells and 1000 genes. The colData slot (which contains a dataframe of metadata for the
cells) should have a column that contains the biological condition or grouping of interest. In
this example data, that variable is the condition variable. Note that the input gene set needs
to be a matrix of normalized counts.
We run the function pdd
data_counts <- SummarizedExperiment::assays(sim_dat)$counts
conditions <- SummarizedExperiment::colData(sim_dat)$conditions
rownames(data_counts) <- seq_len(1000)
##here we use 2 cores to compute the distance matrix
bp <- BiocParallel::MulticoreParam(2)
D_c <- calD(data_counts,bp)
ProbDD <- pdd(data = data_counts,cd = conditions, bp = bp, D = D_c)There are 4 input parameters needed to be specified by user, the dataset, the condition label, number of cpu cores used for computation and a distance matrix of cells. Other input parameters have default settings.
We provide a default method of getting the distance matrix, archived by calD, in general pdd accept all valid distance matrix. User can also input a cluster label rather than distance matrix for the argument D, but the random distancing mechanism which relies on distance matrix will be disabled and random should be set to false.
For the number of sutypes, we provide a default function detK, which consider the smallest number of sutypes such that the ratio of difference within cluster between difference between clusters become smaller than a threshold (default setting is 1).
If user have other ways to determine \(K\), \(K\) should be specified in pdd.
## determine the number of subtypes
K <- detK(D_c)If we set threshold to be 5% then we have estimated DD genes
EDD <- which(ProbDD > 0.95)Notice that, pdd is actually local false discovery rate, this is a conservative estimation of DD genes. We could gain further power, let index gene by \(g = 1,2,...,G\) and let \(p_g = P(DD_g | \text{data})\), \(p_{(1)},...,p_{(G)}\) be ranked local false discovery rate from small to large. To control the false discovery rate at 5%, our positive set is those genes with the \(s^*\) smallest lFDR, where \[s^* = \text{argmax}_s\{s,\frac{\Sigma_{i = 1}^s p_{(i)}}{s} \leq 0.05\}\]
EDD <- getDD(ProbDD,0.05)Function getDD extracts the estimated DD genes using the above transformation.
sessionInfo()## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] S4Vectors_0.40.0     IRanges_2.36.0       GenomicRanges_1.54.0
## [4] scDDboost_1.4.0      ggplot2_3.4.4        BiocStyle_2.30.0    
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.32.0 gtable_0.3.4               
##  [3] xfun_0.40                   bslib_0.5.1                
##  [5] caTools_1.18.2              Biobase_2.62.0             
##  [7] lattice_0.22-5              vctrs_0.6.4                
##  [9] tools_4.3.1                 bitops_1.0-7               
## [11] generics_0.1.3              stats4_4.3.1               
## [13] parallel_4.3.1              tibble_3.2.1               
## [15] fansi_1.0.5                 cluster_2.1.4              
## [17] blockmodeling_1.1.5         pkgconfig_2.0.3            
## [19] KernSmooth_2.23-22          Matrix_1.6-1.1             
## [21] EBSeq_2.0.0                 desc_1.4.2                 
## [23] lifecycle_1.0.3             GenomeInfoDbData_1.2.11    
## [25] compiler_4.3.1              farver_2.1.1               
## [27] brio_1.1.3                  gplots_3.1.3               
## [29] munsell_0.5.0               codetools_0.2-19           
## [31] GenomeInfoDb_1.38.0         htmltools_0.5.6.1          
## [33] sass_0.4.7                  RCurl_1.98-1.12            
## [35] yaml_2.3.7                  pillar_1.9.0               
## [37] crayon_1.5.2                jquerylib_0.1.4            
## [39] BiocParallel_1.36.0         SingleCellExperiment_1.24.0
## [41] cachem_1.0.8                DelayedArray_0.28.0        
## [43] magick_2.8.1                abind_1.4-5                
## [45] mclust_6.0.0                gtools_3.9.4               
## [47] tidyselect_1.2.0            digest_0.6.33              
## [49] dplyr_1.1.3                 bookdown_0.36              
## [51] labeling_0.4.3              rprojroot_2.0.3            
## [53] fastmap_1.1.1               grid_4.3.1                 
## [55] colorspace_2.1-0            cli_3.6.1                  
## [57] SparseArray_1.2.0           magrittr_2.0.3             
## [59] S4Arrays_1.2.0              utf8_1.2.4                 
## [61] withr_2.5.1                 scales_1.2.1               
## [63] rmarkdown_2.25              XVector_0.42.0             
## [65] matrixStats_1.0.0           RcppEigen_0.3.3.9.3        
## [67] evaluate_0.22               knitr_1.44                 
## [69] testthat_3.2.0              Oscope_1.32.0              
## [71] rlang_1.1.1                 Rcpp_1.0.11                
## [73] glue_1.6.2                  BiocManager_1.30.22        
## [75] BiocGenerics_0.48.0         pkgload_1.3.3              
## [77] jsonlite_1.8.7              R6_2.5.1                   
## [79] MatrixGenerics_1.14.0       zlibbioc_1.48.0