--- title: "VariantExperiment: A RangedSummarizedExperiment Container for Large-Scale Variant Data with GDS Backend" valauthor: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last edit: 04/10/2024" output: BiocStyle::html_document: toc: true toc_float: true package: VariantExperiment vignette: | %\VignetteIndexEntry{VariantExperiment-class} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, eval=TRUE, echo=FALSE} options(showHeadLines=3) options(showTailLines=3) ``` # Introduction With the rapid development of the biotechnologies, the sequencing (e.g., DNA, bulk/single-cell RNA, etc.) and other types of biological data are getting increasingly larger-profile. The memory space in R has been an obstable for fast and efficient data processing, because most available _R_ or _Bioconductor_ packages are developed based on in-memory data manipulation. [SingleCellExperiment][] has achieved efficient on-disk saving/reading of the large-scale count data as [HDF5Array][] objects. However, there was still no such light-weight containers available for high-throughput variant data (e.g., DNA-seq, genotyping, etc.). We have developed [VariantExperiment][], a _Bioconductor_ package to contain variant data into `RangedSummarizedExperiment` object. The package converts and represent VCF/GDS files using standard `SummarizedExperiment` metaphor. It is a container for high-through variant data with GDS back-end. In `VariantExperiment`, The high-throughput variant data is saved in [DelayedArray][] objects with GDS back-end. In addition to the light-weight `Assay` data, it also supports the on-disk saving of annotation data for both features and samples (corresponding to `rowData/colData` respectively) by implementing the [DelayedDataFrame][] data structure. The on-disk representation of both assay data and annotation data realizes on-disk reading and processing and saves _R_ memory space significantly. The interface of `RangedSummarizedExperiment` data format enables easy and common manipulations for high-throughput variant data with common [SummarizedExperiment][] metaphor in _R_ and _Bioconductor_. [VariantExperiment]: https://bioconductor.org/packages/VariantExperiment [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment [SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment [DelayedArray]: https://bioconductor.org/packages/DelayedArray [HDF5Array]: https://bioconductor.org/packages/HDF5Array [GDSArray]: https://bioconductor.org/packages/GDSArray [DelayedDataFrame]: http://bioconductor.org/packages/DelayedDataFrame # Installation 1. Download the package from Bioconductor: ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("VariantExperiment") ``` Or install the development version of the package from Github. ```{r, eval = FALSE} BiocManager::install("Bioconductor/VariantExperiment") ``` 2. Load the package into R session. ```{r Load, message=FALSE} library(VariantExperiment) ``` # Background ## GDSArray [GDSArray][] is a _Bioconductor_ package that represents `GDS` files as objects derived from the [DelayedArray][] package and `DelayedArray` class. It converts `GDS` nodes into a `DelayedArray`-derived data structure. The rich common methods and data operations defined on `GDSArray` makes it more _R_-user-friendly than working with the GDS file directly. The `GDSArray()` constructor takes 2 arguments: the file path and the GDS node name (which can be retrieved with the `gdsnodes()` function) inside the GDS file. ```{r, GDSArray} library(GDSArray) file <- GDSArray::gdsExampleFileName("seqgds") gdsnodes(file) GDSArray(file, "genotype/data") GDSArray(file, "sample.id") ``` More details about `GDS` or `GDSArray` format can be found in the vignettes of the [gdsfmt][], [SNPRelate][], [SeqArray][], [GDSArray][] and [DelayedArray][] packages. [gdsfmt]: https://bioconductor.org/packages/gdsfmt [SNPRelate]: https://bioconductor.org/packages/SNPRelate [SeqArray]: https://bioconductor.org/packages/SeqArray ## DelayedDataFrame [DelayedDataFrame][] is a _Bioconductor_ package that implements delayed operations on `DataFrame` objects using standard `DataFrame` metaphor. Each column of data inside `DelayedDataFrame` is represented as 1-dimensional `GDSArray` with on-disk GDS file. Methods like `show`,`validity check`, `[`, `[[` subsetting, `rbind`, `cbind` are implemented for `DelayedDataFrame`. The `DelayedDataFrame` stays lazy until an explicit realization call like `DataFrame()` constructor or `as.list()` triggered. More details about [DelayedDataFrame][] data structure could be found in the vignette of [DelayedDataFrame][] package. # `VariantExperiment` class ## `VariantExperiment` class `VariantExperiment` class is defined to extend `RangedSummarizedExperiment`. The difference would be that the assay data are saved as `DelayedArray`, and the annotation data are saved by default as `DelayedDataFrame` (with option to save as ordinary `DataFrame`), both of which are representing the data on-disk with `GDS` back-end. Conversion methods into `VariantExperiment` object are defined directly for `VCF` and `GDS` files. Here we show one simple example to convert a DNA-sequencing data in GDS format into `VariantExperiment` and some class-related operations. ```{r makeVariantExperimentFromGDS} ve <- makeVariantExperimentFromGDS(file) ve ``` In this example, the sequencing file in GDS format was converted into a `VariantExperiment` object, with all available assay data saved into the `assay` slot, all available feature annotation nodes into `rowRanges/rowData` slot, and all available sample annotation nodes into `colData` slot. The available values for each arguments in `makeVariantExperimentFromGDS()` function can be retrieved using the `showAvailable()` function. ```{r showAvailable} args(makeVariantExperimentFromGDS) showAvailable(file) ``` ## slot accessors Assay data are in `GDSArray` format, and could be retrieve by the `assays()/assay()` function. **NOTE** that when converted into a `VariantExperiment` object, the assay data will be checked and permuted, so that the first 2 dimensions always match to features (variants/snps) and samples respectively, no matter how are the dimensions are with the original GDSArray that can be constructed. ```{r makeVariantExperimentFromGDS2} assays(ve) assay(ve, 1) GDSArray(file, "genotype/data") ## original GDSArray from GDS file before permutation ``` In this example, the original `GDSArray` object from genotype data was `2 x 90 x 1348`. But it was permuted to `1348 x 90 x 2` when constructed into the `VariantExperiment` object. The `rowData()` of the `VariantExperiment` is by default saved in `DelayedDataFrame` format. We can use `rowRanges()` / `rowData()` to retrieve the feature-related annotation file, with/without a GenomicRange format. ```{r rrrd} rowRanges(ve) rowData(ve) ``` sample-related annotation is by default in `DelayedDataFrame` format, and could be retrieved by `colData()`. ```{r colData} colData(ve) ``` The `gdsfn()` will retrieve the gds file path associated with the `VariantExperiment` object. ```{r gdsfile} gdsfn(ve) ``` Some other getter function like `metadata()` will return any metadata that we have saved inside the `VariantExperiment` object. ```{r metaData} metadata(ve) ``` # Coercion methods To take advantage of the functions and methods that are defined on `SummarizedExperiment`, from which the `VariantExperiment` extends, we have defined coercion methods from `VCF` and `GDS` to `VariantExperiment`. ## From `VCF` to `VariantExperiment` The coercion function of `makeVariantExperimentFromVCF` could convert the `VCF` file directly into `VariantExperiment` object. To achieve the best storage efficiency, the assay data are saved in `DelayedArray` format, and the annotation data are saved in `DelayedDataFrame` format (with no option of ordinary `DataFrame`), which could be retrieved by `rowData()` for feature related annotations and `colData()` for sample related annotations (Only when `sample.info` argument is specified). ```{r makeVariantExperimentFromVCF} vcf <- SeqArray::seqExampleFileName("vcf") ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile()) ve ``` Internally, the `VCF` file was converted into a on-disk `GDS` file, which could be retrieved by: ```{r retrieve GDS} gdsfn(ve) ``` assay data is in `DelayedArray` format: ```{r makeVariantExperimentFromVCF2} assay(ve, 1) ``` feature-related annotation is in `DelayedDataFrame` format: ```{r makeVariantExperimentFromVCF3} rowData(ve) ``` User could also have the opportunity to save the sample related annotation info directly into the `VariantExperiment` object, by providing the file path to the `sample.info` argument, and then retrieve by `colData()`. ```{r sampleInfo} sampleInfo <- system.file("extdata", "Example_sampleInfo.txt", package="VariantExperiment") vevcf <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo) colData(vevcf) ``` Arguments could be specified to take only certain info columns or format columns from the vcf file. ```{r makeVariantExperimentFromVCFArgs} vevcf1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP")) rowData(vevcf1) ``` In the above example, only 2 info entries ("OR" and "GP") are read into the `VariantExperiment` object. The `start` and `count` arguments could be used to specify the start position and number of variants to read into `Variantexperiment` object. ```{r makeVariantExperimentFromVCFArgs_startCount} vevcf2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000) vevcf2 ``` For the above example, only 1000 variants are read into the `VariantExperiment` object, starting from the position of 101. ## From `GDS` to `VariantExperiment` The coercion function of `makeVariantExperimentFromGDS` coerces `GDS` files into `VariantExperiment` objects directly, with the assay data saved as `DelayedArray`, and the `rowData()/colData()` in `DelayedDataFrame` by default (with the option of ordinary `DataFrame` object). ```{r makeVariantExperimentFromGDS_seq} gds <- SeqArray::seqExampleFileName("gds") ve <- makeVariantExperimentFromGDS(gds) ve ``` Arguments could be specified to take only certain annotation columns for features and samples. All available data entries for `makeVariantExperimentFromGDS` arguments could be retrieved by the `showAvailable()` function with the gds file name as input. ```{r showAvailable1} showAvailable(gds) ``` Note that the `infoColumns` from gds file will be saved as columns inside the `rowData()`, with the prefix of "info.". `rowDataOnDisk/colDataOnDisk` could be set as `FALSE` to save all annotation data in ordinary `DataFrame` format. ```{r makeVariantExperimentFromGDSArgs} ve3 <- makeVariantExperimentFromGDS(gds, rowDataColumns = c("allele", "annotation/id"), infoColumns = c("AC", "AN", "DP"), rowDataOnDisk = TRUE, colDataOnDisk = FALSE) rowData(ve3) ## DelayedDataFrame object colData(ve3) ## DataFrame object ``` ## customization for certain gds types For GDS formats of `SEQ_ARRAY` (defined in [SeqArray][] as `SeqVarGDSClass` class) and `SNP_ARRAY` (defined in [SNPRelate][] as `SNPGDSFileClass` class), we have made some customized transfer of certain nodes when reading into `VariantExperiment` object for users' convenience. The `allele` node in `SEQ_ARRAY` gds file is converted into 2 columns in `rowData()` asn `REF` and `ALT`. ```{r} veseq <- makeVariantExperimentFromGDS(file, rowDataColumns = c("allele"), infoColumns = character(0)) rowData(veseq) ``` The `snp.allele` node in `SNP_ARRAY` gds file was converted into 2 columns in `rowData()` as `snp.allele1` and `snp.allele2`. ```{r makeVariantExperimentFromGDS_snp} snpfile <- SNPRelate::snpgdsExampleFileName() vesnp <- makeVariantExperimentFromGDS(snpfile, rowDataColumns = c("snp.allele")) rowData(vesnp) ``` # Subsetting methods `VariantExperiment` supports basic subsetting operations using `[`, `[[`, `$`, and ranged-based subsetting operations using `subsetByOverlap`. ## two-dimensional subsetting ```{r, 2d} ve[1:10, 1:5] ``` ## `$` subsetting The `$` subsetting can be operated directly on `colData()` columns, for easy sample extraction. **NOTE** that the `colData/rowData` are (by default) in the `DelayedDataFrame` format, with each column saved as `GDSArray`. So when doing subsetting, we need to use `as.logical()` to convert the 1-dimensional `GDSArray` into ordinary vector. ```{r colDataExtraction} colData(ve) ve[, as.logical(ve$family == "1328")] ``` subsetting by `rowData()` columns. ```{r rowDataExtraction} rowData(ve) ve[as.logical(rowData(ve)$REF == "T"),] ``` ## Range-based operations `VariantExperiment` objects support all of the `findOverlaps()` methods and associated functions. This includes `subsetByOverlaps()`, which makes it easy to subset a `VariantExperiment` object by an interval. ```{r overlap} ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) ve1 ``` In this example, only 23 out of 1348 variants were retained with the `GRanges` subsetting. # Save / load `VariantExperiment` object Note that after the subsetting by `[`, `$` or `Ranged-based operations`, and you feel satisfied with the data for downstream analysis, you need to save that `VariantExperiment` object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned. 0 ## save `VariantExperiment` object Use the function `saveVariantExperiment` to synchronize the on-disk and in-memory representation. This function writes the processed data as `ve.gds`, and save the _R_ object (which lazily represent the backend data set) as `ve.rds` under the specified directory. It finally returns a new `VariantExperiment` object into current R session generated from the newly saved data. ```{r saveVE} a <- tempfile() ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE, chunk_size = 30) ``` ## load `VariantExperiment` object You can alternatively use `loadVariantExperiment` to load the synchronized data into R session, by providing only the file directory. It reads the `VariantExperiment` object saved as `ve.rds`, as lazy representation of the backend `ve.gds` file under the specific directory. ```{r loadVE} ve3 <- loadVariantExperiment(dir=a) gdsfn(ve3) all.equal(ve2, ve3) ``` # Session Info ```{r sessionInfo} sessionInfo() ```