The waddR package provides an adaptation of the semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in single-cell RNA-seqencing (scRNA-seq) data.
In particular, a two-stage (TS) approach has been implemented that takes account of the specific nature of scRNA-seq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semi-parametric 2-Wasserstein distance-based test) between two conditions.
This an example on how to analyse the expression of single cell data in two conditions. As example data we will look at single cell data from decidua and blood samples analysed by Vento-Tormo et al. 2018 (https://doi.org/10.1038/s41586-018-0698-6). For this demonstration we use only one replicate of the two conditions that has been normalized, scaled and is available for download on github.
First, we will first load all required packages:
suppressPackageStartupMessages({
    library("waddR")
    library("SingleCellExperiment")
    library("BiocFileCache")
})Download the example data set
url.base <- "https://github.com/goncalves-lab/waddR-data/blob/master/data/"
sce.blood.url <- paste0(url.base, "sce_blood.RDa?raw=true")
sce.decidua.url <- paste0(url.base, "sce_decidua.RDa?raw=true")
getCachedPath <- function(url, rname){
    bfc <- BiocFileCache() # fire up cache
    res <- bfcquery(bfc, url, field="fpath", exact=TRUE)
    if (bfccount(res) == 0L)
        cachedFilePath <- bfcadd(bfc, rname=rname, fpath=url)
    else
        cachedFilePath <- bfcpath(bfc, res[["rid"]])
    cachedFilePath
}
load(getCachedPath(sce.blood.url, "sce.blood"))
load(getCachedPath(sce.decidua.url, "sce.decidua"))We have downloaded the two SingleCellExperiment objects sce.blood and sce.decidua. Next we will randomly select a subset of genes with which we will proceed, a whole analysis would be out of scope for this vignette.
set.seed(28)
randgenes <- sample(rownames(sce.blood), 2500, replace=FALSE)
sce.blood <- sce.blood[randgenes, ]
sce.decidua <- sce.decidua[randgenes, ]The input data to wasserstein.sc always discriminates two conditions. It can be given either as a data matrix and a vector explaining the condition of each column, or a SingleCellExperiment for each condition.
We will proceed to use the two SingleCellExperiment objects we loaded. The SingleCellExperiment objects passed to wasserstein.sc must contain a counts assay each.
assays(sce.blood)
#> List of length 1
#> names(1): counts
assays(sce.decidua)
#> List of length 1
#> names(1): countsBefore testing for differential expression in the two conditions, the counts should be normalized for cell or gene-specific biases. We can skip this here since the example data has already been normalized and an entire analysis would be out of scope for this vignette. The detailed processing steps performed on this data can be seen here.
wsres <- wasserstein.sc(sce.blood, sce.decidua, "OS")
head(wsres, n=10L)
#>                                   d.wass    d.wass^2    d.comp^2     d.comp
#> GNPTG-ENSG00000090581         0.20573251 0.042325867 0.042627776 0.20646495
#> HIVEP3-ENSG00000127124        0.20588614 0.042389102 0.043288184 0.20805813
#> C6orf47-AS1-ENSG00000227198   0.00000000 0.000000000 0.000000000 0.00000000
#> RP11-377D9.3-ENSG00000255621  0.00000000 0.000000000 0.000000000 0.00000000
#> BBOX1-ENSG00000129151         0.00000000 0.000000000 0.000000000 0.00000000
#> CEP290-ENSG00000198707        0.16215300 0.026293594 0.025579340 0.15993542
#> FOXRED2-ENSG00000100350       0.07682033 0.005901363 0.005922216 0.07695594
#> RP11-166D18.1-ENSG00000251471 0.00000000 0.000000000 0.000000000 0.00000000
#> BBX-ENSG00000114439           0.39587040 0.156713370 0.157161283 0.39643572
#> MESTIT1-ENSG00000272701       0.00000000 0.000000000 0.000000000 0.00000000
#>                                   location        size      shape       rho
#> GNPTG-ENSG00000090581         4.142331e-04 0.019841603 0.02237194 0.9344195
#> HIVEP3-ENSG00000127124        3.160493e-03 0.017597022 0.02253067 0.8807532
#> C6orf47-AS1-ENSG00000227198   0.000000e+00 0.000000000 0.00000000 0.0000000
#> RP11-377D9.3-ENSG00000255621  0.000000e+00 0.000000000 0.00000000 0.0000000
#> BBOX1-ENSG00000129151         0.000000e+00 0.000000000 0.00000000 0.0000000
#> CEP290-ENSG00000198707        9.538226e-04 0.004271303 0.02035421 0.7916142
#> FOXRED2-ENSG00000100350       4.165072e-05 0.005880565 0.00000000 0.0000000
#> RP11-166D18.1-ENSG00000251471 0.000000e+00 0.000000000 0.00000000 0.0000000
#> BBX-ENSG00000114439           2.463632e-02 0.002027704 0.13049726 0.8288606
#> MESTIT1-ENSG00000272701       0.000000e+00 0.000000000 0.00000000 0.0000000
#>                                     pval p.ad.gdp N.exc perc.loc perc.size
#> GNPTG-ENSG00000090581         0.20570000       NA    NA     0.97     46.55
#> HIVEP3-ENSG00000127124        0.08620000       NA    NA     7.30     40.65
#> C6orf47-AS1-ENSG00000227198   1.00000000       NA    NA      NaN       NaN
#> RP11-377D9.3-ENSG00000255621  1.00000000       NA    NA      NaN       NaN
#> BBOX1-ENSG00000129151         1.00000000       NA    NA      NaN       NaN
#> CEP290-ENSG00000198707        0.22640000       NA    NA     3.73     16.70
#> FOXRED2-ENSG00000100350       0.60040000       NA    NA     0.70     99.30
#> RP11-166D18.1-ENSG00000251471 1.00000000       NA    NA      NaN       NaN
#> BBX-ENSG00000114439           0.00079992       NA    NA    15.68      1.29
#> MESTIT1-ENSG00000272701       1.00000000       NA    NA      NaN       NaN
#>                               perc.shape decomp.error   pval.adj
#> GNPTG-ENSG00000090581              52.48  0.007132959 0.75513950
#> HIVEP3-ENSG00000127124             52.05  0.021210199 0.39541284
#> C6orf47-AS1-ENSG00000227198          NaN  0.000000000 1.00000000
#> RP11-377D9.3-ENSG00000255621         NaN  0.000000000 1.00000000
#> BBOX1-ENSG00000129151                NaN  0.000000000 1.00000000
#> CEP290-ENSG00000198707             79.57  0.027164565 0.79830748
#> FOXRED2-ENSG00000100350             0.00  0.003533602 1.00000000
#> RP11-166D18.1-ENSG00000251471        NaN  0.000000000 1.00000000
#> BBX-ENSG00000114439                83.03  0.002858166 0.01242112
#> MESTIT1-ENSG00000272701              NaN  0.000000000 1.00000000The 2-Wasserstein single-cell testing procedure is performed on all genes in the data set – one line of the output describes the result of testing one gene.
Note: The two-stage (TS) procedure includes a test for differential proportions of zero gene expression, considering the proportion of zeroes over all conditions and genes in a model. It may yield incorrect results, if it is applied on only a subset of genes is given.
Note: For reproducible p-values, see the “seed” argument.
| Output Column | Description | 
|---|---|
| d.wass | 2-Wasserstein distance between the two samples | 
| d.wass^2 | squared 2-Wasserstein distance between the two samples | 
| d.comp^2 | squared 2-Wasserstein distance (decomposition approx.) | 
| d.comp | 2-Wasserstein distance (decomposition approx.) | 
| location | location term in the squared 2-Wasserstein decomposition | 
| size | size term in the squared 2-Wasserstein decomposition | 
| shape | shape term in the squared 2-Wasserstein decomposition | 
| rho | correlation coefficient in the quantile-quantile plot | 
| pval | p-value of the semi-parametric test | 
| p.ad.gpd | p-value of the Anderson-Darling test to check whether the GPD actually fits the data well * | 
| N.exc | number of exceedances (starting with 250 and iteratively decreased by 10 if necessary) that are required to obtain a good GPD fit (i.e. p-value of Anderson-Darling test greater or eqaul to 0.05) * | 
| perc.loc | fraction (in %) of the location part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation | 
| perc.size | fraction (in %) of the size part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation | 
| perc.shape | fraction (in %) of the shape part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation | 
| decomp.error | relative error between the squared 2-Wasserstein distance computed by the quantile approximation and the squared 2-Wasserstein distance computed by the decomposition approximation | 
| p.zero | p-value of the test for differential proportions of zero expression (logistic regression model) ** | 
| p.combined | combined p-value of p.nonzero and p.zero obtained by Fisher’s method ** | 
| pval.adj | adjusted p-value of the semi-parametric 2-Wasserstein distance-based test according to the method of Benjamini-Hochberg | 
| p.adj.zero | adjusted p-value of the test for differential proportions of zero expression (logistic regression model) according to the method of Benjamini-Hochberg ** | 
| p.adj.combined | adjusted combined p-value of p.nonzero and p.zero obtained by Fisher’s method according to the method of Benjamini-Hochberg ** | 
* only if GDP fitting is performed (NA otherwise); ** only if the two-stage test is run
The waddR package
Fast and accurate calculation of the Wasserstein distance
Two-sample test to check for differences between two distributions
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] BiocFileCache_1.14.0        dbplyr_1.4.4               
#>  [3] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
#>  [5] Biobase_2.50.0              GenomicRanges_1.42.0       
#>  [7] GenomeInfoDb_1.26.0         IRanges_2.24.0             
#>  [9] S4Vectors_0.28.0            BiocGenerics_0.36.0        
#> [11] MatrixGenerics_1.2.0        matrixStats_0.57.0         
#> [13] waddR_1.4.0                
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.2             bit64_4.0.5            splines_4.0.3         
#>  [4] Formula_1.2-4          assertthat_0.2.1       statmod_1.4.35        
#>  [7] latticeExtra_0.6-29    blob_1.2.1             GenomeInfoDbData_1.2.4
#> [10] yaml_2.2.1             backports_1.1.10       pillar_1.4.6          
#> [13] RSQLite_2.2.1          lattice_0.20-41        glue_1.4.2            
#> [16] digest_0.6.27          RColorBrewer_1.1-2     XVector_0.30.0        
#> [19] checkmate_2.0.0        minqa_1.2.4            colorspace_1.4-1      
#> [22] htmltools_0.5.0        Matrix_1.2-18          pkgconfig_2.0.3       
#> [25] zlibbioc_1.36.0        purrr_0.3.4            scales_1.1.1          
#> [28] jpeg_0.1-8.1           lme4_1.1-25            BiocParallel_1.24.0   
#> [31] arm_1.11-2             tibble_3.0.4           htmlTable_2.1.0       
#> [34] generics_0.0.2         ggplot2_3.3.2          ellipsis_0.3.1        
#> [37] nnet_7.3-14            survival_3.2-7         magrittr_1.5          
#> [40] crayon_1.3.4           memoise_1.1.0          evaluate_0.14         
#> [43] nlme_3.1-150           MASS_7.3-53            foreign_0.8-80        
#> [46] data.table_1.13.2      tools_4.0.3            lifecycle_0.2.0       
#> [49] stringr_1.4.0          munsell_0.5.0          cluster_2.1.0         
#> [52] DelayedArray_0.16.0    compiler_4.0.3         rlang_0.4.8           
#> [55] nloptr_1.2.2.2         grid_4.0.3             RCurl_1.98-1.2        
#> [58] rstudioapi_0.11        htmlwidgets_1.5.2      rappdirs_0.3.1        
#> [61] bitops_1.0-6           base64enc_0.1-3        rmarkdown_2.5         
#> [64] boot_1.3-25            gtable_0.3.0           abind_1.4-5           
#> [67] DBI_1.1.0              curl_4.3               R6_2.4.1              
#> [70] gridExtra_2.3          knitr_1.30             dplyr_1.0.2           
#> [73] bit_4.0.4              Hmisc_4.4-1            stringi_1.5.3         
#> [76] Rcpp_1.0.5             vctrs_0.3.4            rpart_4.1-15          
#> [79] png_0.1-7              coda_0.19-4            tidyselect_1.1.0      
#> [82] xfun_0.18