The SCArray package provides large-scale single-cell RNA-seq data manipulation using Genomic Data Structure (GDS) files. It combines dense/sparse matrices stored in GDS files and the Bioconductor infrastructure framework (SingleCellExperiment and DelayedArray) to provide out-of-memory data storage and manipulation using the R programming language. As shown in the figure, SCArray provides a SingleCellExperiment object for downstream data analyses. GDS is an alternative to HDF5. Unlike HDF5, GDS supports the direct storage of a sparse matrix without converting it to multiple vectors.
Figure 1: Workflow of SCArray
Requires R (>= v3.5.0), gdsfmt (>= v1.24.0)
Bioconductor repository
To install this package, start R and enter:
The SCArray package can convert a single-cell experiment object (SingleCellExperiment) to a GDS file using the function scConvGDS(). For example,
suppressPackageStartupMessages(library(SCArray))
suppressPackageStartupMessages(library(SingleCellExperiment))
# load a SingleCellExperiment object
fn <- system.file("extdata", "LaMannoBrainSub.rds", package="SCArray")
sce <- readRDS(fn)
# convert to a GDS file
scConvGDS(sce, "test.gds")## Output: test.gds
## Compression: LZMA_RA
## Dimension: 18219 x 25
## Assay List:
##     counts  |+ counts   { SparseReal32 18219x25 LZMA_ra(13.1%), 75.6K }
## rowData:
## colData:
##     CELL_ID
##     Cell_type
## Done.
## Clean up the fragments of GDS file:
##     open the file 'test.gds' (138.1K)
##     # of fragments: 29
##     save to 'test.gds.tmp'
##     rename 'test.gds.tmp' (137.9K, reduced: 168B)
##     # of fragments: 15## Object of class "SCArrayFileClass"
## File: /tmp/Rtmp8sTU3R/Rbuildfbabd7d2534c9/SCArray/vignettes/test.gds (137.9K)
## +    [  ] *
## |--+ feature.id   { Str8 18219 LZMA_ra(48.6%), 60.5K }
## |--+ sample.id   { Str8 25 LZMA_ra(40.0%), 157B }
## |--+ counts   { SparseReal32 18219x25 LZMA_ra(13.1%), 75.6K }
## |--+ feature.data   [  ]
## |--+ sample.data   [  ]
## |  |--+ CELL_ID   { Str8 25 LZMA_ra(40.0%), 157B }
## |  \--+ Cell_type   { Str8 25 LZMA_ra(64.3%), 133B }
## \--+ meta.data   [  ]The input of scConvGDS() can be a dense or sparse matrix for count data:
library(Matrix)
cnt <- matrix(0, nrow=4, ncol=8)
set.seed(100); cnt[sample.int(length(cnt), 8)] <- rpois(8, 4)
(cnt <- as(cnt, "dgCMatrix"))## 4 x 8 sparse Matrix of class "dgCMatrix"
##                     
## [1,] . . . . . . . 6
## [2,] 3 1 . . . 4 . .
## [3,] . . . . . 3 . 4
## [4,] 4 . 3 . . . . .## Output: test.gds
## Compression: LZMA_RA
## Dimension: 4 x 8
## Assay List:
##     counts  |+ counts   { SparseReal32 4x8 LZMA_ra(159.4%), 109B }
## Done.
## Clean up the fragments of GDS file:
##     open the file 'test.gds' (1.4K)
##     # of fragments: 17
##     save to 'test.gds.tmp'
##     rename 'test.gds.tmp' (1.3K, reduced: 84B)
##     # of fragments: 10When a single-cell GDS file is available, users can use scExperiment() to load a SingleCellExperiment object from the GDS file. The assay data in the SingleCellExperiment object are DelayedMatrix objects to avoid the memory limit.
# a GDS file in the SCArray package
(fn <- system.file("extdata", "LaMannoBrainData.gds", package="SCArray"))## [1] "/tmp/Rtmp8sTU3R/Rinstfbabd5d12ad1f/SCArray/extdata/LaMannoBrainData.gds"## class: SingleCellExperiment 
## dim: 12000 243 
## metadata(0):
## assays(1): counts
## rownames(12000): Rp1 Sox17 ... Efhd2 Fhad1
## rowData names(0):
## colnames(243): 1772072122_A04 1772072122_A05 ... 1772099011_H05
##   1772099012_E04
## colData names(2): CELL_ID Cell_type
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):## <12000 x 243> sparse matrix of class DelayedMatrix and type "double":
##         1772072122_A04 1772072122_A05 ... 1772099011_H05 1772099012_E04
##     Rp1              0              0   .              0              0
##   Sox17              0              0   .              0              0
##  Mrpl15              1              2   .              2              2
##  Lypla1              0              0   .              0              1
##   Tcea1              1              0   .              6              1
##     ...              .              .   .              .              .
##   Agmat              0              0   .              0              0
## Dnajc16              0              0   .              0              0
##   Casp9              0              0   .              0              0
##   Efhd2              0              0   .              1              1
##   Fhad1              0              0   .              1              0## DataFrame with 243 rows and 2 columns
##                       CELL_ID   Cell_type
##                   <character> <character>
## 1772072122_A04 1772072122_A04     DA-VTA4
## 1772072122_A05 1772072122_A05     DA-VTA2
## 1772072122_A06 1772072122_A06     DA-VTA3
## 1772072122_A07 1772072122_A07      DA-SNC
## 1772072122_A08 1772072122_A08     DA-VTA4
## ...                       ...         ...
## 1772099011_D01 1772099011_D01     DA-VTA1
## 1772099011_F04 1772099011_F04     DA-VTA2
## 1772099011_G07 1772099011_G07     DA-VTA4
## 1772099011_H05 1772099011_H05      DA-SNC
## 1772099012_E04 1772099012_E04     DA-VTA4## DataFrame with 12000 rows and 0 columnsSCArray provides a SingleCellExperiment object for downstream data analyses. At first, we create a log count matrix logcnt from the count matrix. Note that logcnt is also a DelayedMatrix without actually generating the whole matrix.
## <12000 x 243> sparse matrix of class DelayedMatrix and type "double":
##         1772072122_A04 1772072122_A05 ... 1772099011_H05 1772099012_E04
##     Rp1       0.000000       0.000000   .       0.000000       0.000000
##   Sox17       0.000000       0.000000   .       0.000000       0.000000
##  Mrpl15       1.000000       1.584963   .       1.584963       1.584963
##  Lypla1       0.000000       0.000000   .       0.000000       1.000000
##   Tcea1       1.000000       0.000000   .       2.807355       1.000000
##     ...              .              .   .              .              .
##   Agmat              0              0   .              0              0
## Dnajc16              0              0   .              0              0
##   Casp9              0              0   .              0              0
##   Efhd2              0              0   .              1              1
##   Fhad1              0              0   .              1              0The DelayedMatrixStats package provides functions operating on rows and columns of DelayedMatrix objects. For example, we can calculate the mean for each column or row of the log count matrix.
suppressPackageStartupMessages(library(DelayedMatrixStats))
col_mean <- DelayedMatrixStats::colMeans2(logcnt)
str(col_mean)##  num [1:243] 0.261 0.138 0.238 0.259 0.143 ...##  num [1:12000] 0 0.00652 0.81827 0.47055 1.33912 ...The scater package can perform the uniform manifold approximation and projection (UMAP) for the cell data, based on the data in a SingleCellExperiment object.
plotReducedDim() plots cell-level reduced dimension results (UMAP) stored in the SingleCellExperiment object:
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.20.0               ggplot2_3.3.3              
##  [3] scuttle_1.2.0               DelayedMatrixStats_1.14.0  
##  [5] SingleCellExperiment_1.14.0 SummarizedExperiment_1.22.0
##  [7] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [9] GenomeInfoDb_1.28.0         SCArray_1.0.0              
## [11] DelayedArray_0.18.0         IRanges_2.26.0             
## [13] S4Vectors_0.30.0            MatrixGenerics_1.4.0       
## [15] matrixStats_0.58.0          BiocGenerics_0.38.0        
## [17] Matrix_1.3-3                gdsfmt_1.28.0              
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.1           sass_0.4.0              BiocSingular_1.8.0     
##  [4] jsonlite_1.7.2          viridisLite_0.4.0       bslib_0.2.5.1          
##  [7] assertthat_0.2.1        highr_0.9               GenomeInfoDbData_1.2.6 
## [10] vipor_0.4.5             yaml_2.2.1              pillar_1.6.1           
## [13] lattice_0.20-44         glue_1.4.2              beachmat_2.8.0         
## [16] digest_0.6.27           XVector_0.32.0          colorspace_2.0-1       
## [19] cowplot_1.1.1           htmltools_0.5.1.1       pkgconfig_2.0.3        
## [22] zlibbioc_1.38.0         purrr_0.3.4             scales_1.1.1           
## [25] RSpectra_0.16-0         ScaledMatrix_1.0.0      BiocParallel_1.26.0    
## [28] tibble_3.1.2            farver_2.1.0            generics_0.1.0         
## [31] ellipsis_0.3.2          withr_2.4.2             magrittr_2.0.1         
## [34] crayon_1.4.1            evaluate_0.14           fansi_0.4.2            
## [37] beeswarm_0.3.1          FNN_1.1.3               tools_4.1.0            
## [40] lifecycle_1.0.0         stringr_1.4.0           munsell_0.5.0          
## [43] irlba_2.3.3             compiler_4.1.0          jquerylib_0.1.4        
## [46] rsvd_1.0.5              rlang_0.4.11            grid_4.1.0             
## [49] RCurl_1.98-1.3          BiocNeighbors_1.10.0    labeling_0.4.2         
## [52] bitops_1.0-7            rmarkdown_2.8           gtable_0.3.0           
## [55] DBI_1.1.1               R6_2.5.0                gridExtra_2.3          
## [58] knitr_1.33              dplyr_1.0.6             uwot_0.1.10            
## [61] utf8_1.2.1              stringi_1.6.2           ggbeeswarm_0.6.0       
## [64] Rcpp_1.0.6              vctrs_0.3.8             tidyselect_1.1.1       
## [67] xfun_0.23               sparseMatrixStats_1.4.0