`waddR`

packageThe `waddR`

package offers statistical tests based on the 2-Wasserstein distance for detecting and characterizing differences between two distributions given in the form of samples. Functions for calculating the 2-Wasserstein distance and testing for differential distributions are provided, as well as a specifically tailored test for differential expression in single-cell RNA sequencing data.

`waddR`

provides tools to address the following tasks, each described in a separate vignette:

Two-sample tests to check for differences between two distributions,

Detection of differential gene expression distributions in single-cell RNA sequencing (scRNAseq) data.

These are bundled into one package, because they are internally dependent: The procedure for detecting differential distributions in scRNAseq data is an adaptation of the general two-sample test, which itself uses the 2-Wasserstein distance to compare two distributions.

The 2-Wasserstein distance is a metric to describe the distance between two distributions, representing e.g. two diferent conditions \(A\) and \(B\). The `waddR`

package specifically considers the squared 2-Wasserstein distance which can be decomposed into location, size, and shape terms, thus providing a characterization of potential differences.

The `waddR`

package offers three functions to calculate the (squared) 2-Wasserstein distance, which are implemented in C++ and exported to R with Rcpp for faster computation. The function `wasserstein_metric`

is a Cpp reimplementation of the `wasserstein1d`

function from the R package `transport`

. The functions `squared_wass_approx`

and `squared_wass_decomp`

compute approximations of the squared 2-Wasserstein distance, with `squared_wass_decomp`

also returning the decomposition terms for location, size, and shape.

See `?wasserstein_metric`

, `?squared_wass_aprox`

, and `?squared_wass_decomp`

for more details.

The `waddR`

package provides two testing procedures using the 2-Wasserstein distance to test whether two distributions \(F_A\) and \(F_B\) given in the form of samples are different by testing the null hypothesis \(H_0: F_A = F_B\) against the alternative hypothesis \(H_1: F_A != F_B\).

The first, semi-parametric (SP), procedure uses a permutation-based test combined with a generalized Pareto distribution approximation to estimate small p-values accurately.

The second procedure uses a test based on asymptotic theory (ASY) which is valid only if the samples can be assumed to come from continuous distributions.

See `?wasserstein.test`

for more details.

The `waddR`

package provides an adaptation of the semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in scRNAseq data. In particular, a two-stage (TS) approach is implemented that takes account of the specific nature of scRNAseq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semiparametric 2-Wasserstein distance-based test) between two conditions.

See `?wasserstein.sc`

and `?testZeroes`

for more details.

To install `waddR`

from Bioconductor, use `BiocManager`

with the following commands:

```
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("MyPackage")
```

Using `BiocManager`

, the package can also be installed from GitHub directly:

The package `waddR`

can then be used in R:

```
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] waddR_1.6.1
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-152 bitops_1.0-7
#> [3] matrixStats_0.58.0 eva_0.2.6
#> [5] bit64_4.0.5 filelock_1.0.2
#> [7] RColorBrewer_1.1-2 httr_1.4.2
#> [9] GenomeInfoDb_1.28.0 tools_4.1.0
#> [11] backports_1.2.1 bslib_0.2.5.1
#> [13] utf8_1.2.1 R6_2.5.0
#> [15] rpart_4.1-15 Hmisc_4.5-0
#> [17] DBI_1.1.1 BiocGenerics_0.38.0
#> [19] colorspace_2.0-1 nnet_7.3-16
#> [21] withr_2.4.2 tidyselect_1.1.1
#> [23] gridExtra_2.3 bit_4.0.4
#> [25] curl_4.3.1 compiler_4.1.0
#> [27] Biobase_2.52.0 htmlTable_2.2.1
#> [29] DelayedArray_0.18.0 sass_0.4.0
#> [31] scales_1.1.1 checkmate_2.0.0
#> [33] rappdirs_0.3.3 stringr_1.4.0
#> [35] digest_0.6.27 minqa_1.2.4
#> [37] foreign_0.8-81 rmarkdown_2.8
#> [39] XVector_0.32.0 base64enc_0.1-3
#> [41] jpeg_0.1-8.1 pkgconfig_2.0.3
#> [43] htmltools_0.5.1.1 lme4_1.1-27
#> [45] MatrixGenerics_1.4.0 dbplyr_2.1.1
#> [47] fastmap_1.1.0 htmlwidgets_1.5.3
#> [49] rlang_0.4.11 rstudioapi_0.13
#> [51] RSQLite_2.2.7 jquerylib_0.1.4
#> [53] generics_0.1.0 jsonlite_1.7.2
#> [55] BiocParallel_1.26.0 dplyr_1.0.6
#> [57] RCurl_1.98-1.3 magrittr_2.0.1
#> [59] GenomeInfoDbData_1.2.6 Formula_1.2-4
#> [61] Matrix_1.3-3 Rcpp_1.0.6
#> [63] munsell_0.5.0 S4Vectors_0.30.0
#> [65] fansi_0.5.0 abind_1.4-5
#> [67] lifecycle_1.0.0 stringi_1.6.2
#> [69] yaml_2.2.1 MASS_7.3-54
#> [71] SummarizedExperiment_1.22.0 zlibbioc_1.38.0
#> [73] BiocFileCache_2.0.0 grid_4.1.0
#> [75] blob_1.2.1 parallel_4.1.0
#> [77] crayon_1.4.1 lattice_0.20-44
#> [79] splines_4.1.0 knitr_1.33
#> [81] pillar_1.6.1 GenomicRanges_1.44.0
#> [83] boot_1.3-28 stats4_4.1.0
#> [85] glue_1.4.2 evaluate_0.14
#> [87] latticeExtra_0.6-29 data.table_1.14.0
#> [89] nloptr_1.2.2.2 vctrs_0.3.8
#> [91] png_0.1-7 gtable_0.3.0
#> [93] purrr_0.3.4 assertthat_0.2.1
#> [95] cachem_1.0.5 ggplot2_3.3.3
#> [97] xfun_0.23 coda_0.19-4
#> [99] survival_3.2-11 SingleCellExperiment_1.14.1
#> [101] tibble_3.1.2 arm_1.11-2
#> [103] memoise_2.0.0 IRanges_2.26.0
#> [105] cluster_2.1.2 ellipsis_0.3.2
```