As the genetic/genomic data are having increasingly larger profile,
the annotation file are also getting much bigger than expected. the
memory space in R has been an obstable for fast and efficient
data processing, because most available R or
Bioconductor packages are developed based on in-memory data
manipulation. With some newly developed data structure as HDF5 or GDS, and the R
interface of DelayedArray
to represent on-disk data structures with different back-end in
R-user-friendly array data structure (e.g., HDF5Array,GDSArray), the
high-throughput genetic/genomic data are now being able to easily loaded
and manipulated within R. However, the annotation files for the
samples and features inside the high-through data are also getting
unexpectedly larger than before. With an ordinary
data.frame or DataFrame, it is still getting
more and more challenging for any analysis to be done within R.
So here we have developed the DelayedDataFrame, which has
the very similar characteristics as data.frame and
DataFrame. But at the same time, all column data could be
optionally saved on-disk (e.g., in DelayedArray
structure with any back-end). Common operations like constructing,
subsetting, splitting, combining could be done in the same way as
DataFrame. This feature of DelayedDataFrame
could enable efficient on-disk reading and processing of the large-scale
annotation files, and at the same, signicantly saves memory space with
common DataFrame metaphor in R and
Bioconductor.
Download the package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DelayedDataFrame")The development version is also available to download through github:
Load the package into R session before using:
DelayedDataFrame extends the DataFrame data
structure, with an additional slot called lazyIndex, which
saves all the mapping indexes for each column of the data inside
DelayedDataFrame. It is similar to data.frame
in terms of construction, subsetting, splitting, combining… The
rownames are having same feature as DataFrame.
It will not be given automatically, but only by explicitly specify in
the constructor function DelayedDataFrame(, row.names=...)
or using the slot setter function rownames()<-.
Here we use the GDSArray data as
example to show the DelayedDataFrame characteristics. GDSArray is a
Bioconductor package that represents GDS files as objects
derived from the DelayedArray
package and DelayedArray class. It carries the on-disk data
path and represent the GDS nodes in a DelayedArray-derived
data structure.
The GDSArray() constructor takes 2 arguments: the file
path and the GDS node name inside the GDS file.
library(GDSArray)
## Loading required package: gdsfmt
file <- SeqArray::seqExampleFileName("gds")
gdsnodes(file)
## [1] "sample.id" "variant.id"
## [3] "position" "chromosome"
## [5] "allele" "genotype/data"
## [7] "genotype/~data" "genotype/extra.index"
## [9] "genotype/extra" "phase/data"
## [11] "phase/~data" "phase/extra.index"
## [13] "phase/extra" "annotation/id"
## [15] "annotation/qual" "annotation/filter"
## [17] "annotation/info/AA" "annotation/info/AC"
## [19] "annotation/info/AN" "annotation/info/DP"
## [21] "annotation/info/HM2" "annotation/info/HM3"
## [23] "annotation/info/OR" "annotation/info/GP"
## [25] "annotation/info/BN" "annotation/format/DP/data"
## [27] "annotation/format/DP/~data" "sample.annotation/family"
varid <- GDSArray(file, "annotation/id")
DP <- GDSArray(file, "annotation/info/DP")We use an ordinary character vector and the GDSArray
objects to construct a DelayedDataFrame object.
The slots of DelayedDataFrame could be accessed by
lazyIndex(), nrow(), rownames()
(if not NULL) functions. With a newly constructed
DelayedDataFrame object, the initial value of
lazyIndex slot will be NULL for all columns.
lazyIndex slotThe lazyIndex slot is in LazyIndex class,
which is defined in the DelayedDataFrame package and
extends the SimpleList class. The listData
slot saves unique indexes for all the columns, and the
index slots saves the position of index in
listData slot for each column in
DelayedDataFrame object. In the above example, with an
initial construction of DelayedDataFrame object, the index
for each column will all be NULL, and all 3 columns points the NULL
values which sits in the first position in listData slot of
lazyIndex.
Whenever an operation is done (e.g., subsetting), the
listData slot inside the DelayedDataFrame
stays the same, but the lazyIndex slot will be updated, so
that the show method, further statistical calculation will be applied to
the subsetting data set. For example, here we subset the
DelayedDataFrame object ddf to keep only the
first 5 rows, and see how the lazyIndex works. As shown in
below, after subsetting, the listData slot in
ddf1 stays the same as ddf. But the subsetting
operation was recorded in the lazyIndex slot, and the slots
of lazyIndex, nrows and rownames
(if not NULL) are all updated. So the subsetting operation is kind of
delayed.
ddf1 <- ddf[1:20,]
identical(ddf@listData, ddf1@listData)
## [1] TRUE
lazyIndex(ddf1)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
nrow(ddf1)
## [1] 20Only when functions like DataFrame(), or
as.list(), the lazyIndex will be realized and
DelayedDataFrame returned. We will show the realization in
the following coercion method section.
The common methods on data.frame or
DataFrame are also defined on DelayedDataFrame
class, so that they behave similarily on DelayedDataFrame
objects.
Coercion methods between DelayedDataFrame and other data
structures are defined. When coercing from ANY to
DelayedDataFrame, the lazyIndex slot will be
added automatically, with the initial NULL value of indexes for each
column.
as(letters, "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## X
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 zas(DataFrame(letters), "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## letters
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z(a <- as(list(a=1:5, b=6:10), "DelayedDataFrame"))
## DelayedDataFrame with 5 rows and 2 columns
## a b
## <integer> <integer>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1When coerce DelayedDataFrame into other data structure,
the lazyIndex slot will be realized and the new data
structure returned. For example, when DelayedDataFrame is
coerced into a DataFrame object, the listData
slot will be updated according to the lazyIndex slot.
df1 <- as(ddf1, "DataFrame")
df1@listData
## $varid
## <20> DelayedArray object of type "character":
## [1] [2] [3] . [19]
## "rs111751804" "rs114390380" "rs1320571" . "rs61751002"
## [20]
## "rs6691840"
##
## $DP
## <20> DelayedArray object of type "integer":
## [1] [2] [3] [4] . [17] [18] [19] [20]
## 3251 2676 7610 3383 . 6040 6589 5089 6871
dim(df1)
## [1] 20 2[two-dimensional [ subsetting on
DelayedDataFrame objects by integer, character, logical
values all work.
ddf[, 1, drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206ddf[, "DP", drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## DP
## <GDSArray>
## 1 3251
## 2 2676
## 3 7610
## ... ...
## 1346 823
## 1347 1257
## 1348 48ddf[, c(TRUE,FALSE), drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206When subsetting using [ on an already subsetted
DelayedDataFrame object, the lazyIndex,
nrows and rownames(if not NULL) slot will be
updated.
(a <- ddf1[1:10, 2, drop=FALSE])
## DelayedDataFrame with 10 rows and 1 column
## DP
## <GDSArray>
## 1 3251
## 2 2676
## 3 7610
## ... ...
## 8 9076
## 9 9605
## 10 9707
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## index of each column:
## [1] 1
nrow(a)
## [1] 10[[The [[ subsetting will take column subscripts for
integer or character values, and return corresponding columns in it’s
original data format.
ddf[[1]]
## <1348> GDSArray object of type "character":
## [1] [2] [3] . [1347]
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## [1348]
## "rs5771206"
ddf[["varid"]]
## <1348> GDSArray object of type "character":
## [1] [2] [3] . [1347]
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## [1348]
## "rs5771206"
identical(ddf[[1]], ddf[["varid"]])
## [1] TRUErbind/cbindWhen doing rbind, the lazyIndex of input
arguments will be realized and a new DelayedDataFrame with
NULL lazyIndex will be returned.
ddf2 <- ddf[21:40, ]
(ddfrb <- rbind(ddf1, ddf2))
## DelayedDataFrame with 40 rows and 2 columns
## varid DP
## <DelayedArray> <DelayedArray>
## 1 rs111751804 3251
## 2 rs114390380 2676
## 3 rs1320571 7610
## ... ... ...
## 38 rs1886116 3641
## 39 rs115917561 3089
## 40 rs61751016 4109
lazyIndex(ddfrb)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1cbind of DelayedDataFrame objects will keep
all existing lazyIndex of input arguments and carry into
the new DelayedDataFrame object.
(ddfcb <- cbind(varid = ddf1[,1, drop=FALSE], DP=ddf1[, 2, drop=FALSE]))
## DelayedDataFrame with 20 rows and 2 columns
## varid DP.DP
## <GDSArray> <GDSArray>
## 1 rs111751804 3251
## 2 rs114390380 2676
## 3 rs1320571 7610
## ... ... ...
## 18 rs115614983 6589
## 19 rs61751002 5089
## 20 rs6691840 6871
lazyIndex(ddfcb)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GDSArray_1.29.0 gdsfmt_1.47.0 DelayedDataFrame_1.26.0
## [4] DelayedArray_0.37.0 SparseArray_1.11.1 S4Arrays_1.11.0
## [7] abind_1.4-8 IRanges_2.45.0 MatrixGenerics_1.23.0
## [10] matrixStats_1.5.0 Matrix_1.7-4 S4Vectors_0.49.0
## [13] BiocGenerics_0.56.0 generics_0.1.4 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 crayon_1.5.3 compiler_4.5.2
## [4] BiocManager_1.30.26 Biostrings_2.79.2 GenomicRanges_1.63.0
## [7] parallel_4.5.2 jquerylib_0.1.4 Seqinfo_1.1.0
## [10] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-7
## [13] R6_2.6.1 XVector_0.51.0 knitr_1.50
## [16] maketools_1.3.2 bslib_0.9.0 rlang_1.1.6
## [19] cachem_1.1.0 xfun_0.54 sass_0.4.10
## [22] sys_3.4.3 cli_3.6.5 digest_0.6.37
## [25] grid_4.5.2 SeqArray_1.51.0 lifecycle_1.0.4
## [28] evaluate_1.0.5 buildtools_1.0.0 rmarkdown_2.30
## [31] tools_4.5.2 htmltools_0.5.8.1