Dr. Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)
Jun 25, 2016
Whole-genome sequencing (WGS) data is being generated at an unprecedented rate
CoreArray (C++ library)
Two R packages
SeqArray provides the same capabilities as VCF
Stores data in a binary and array-oriented manner
Genotypes are stored in a compressed manner
Parallel access
File: SeqArray/extdata/CEU_Exon.gds (387.3K)
|--+ description   [  ] *
|--+ sample.id   { Str8 90 ZIP_ra(30.8%), 222B }
|--+ variant.id   { Int32 1348 ZIP_ra(35.7%), 1.9K }
|--+ position   { Int32 1348 ZIP_ra(86.4%), 4.6K }
|--+ chromosome   { Str8 1348 ZIP_ra(2.66%), 91B }
|--+ allele   { Str8 1348 ZIP_ra(17.2%), 928B }
|--+ genotype   [  ] *
|  \--+ data   { Bit2 2x90x1348 ZIP_ra(28.4%), 16.8K } *
|--+ phase   [  ]
|  \--+ data   { Bit1 90x1348 ZIP_ra(0.36%), 55B } *
|--+ annotation   [  ]
|  |--+ id   { Str8 1348 ZIP_ra(41.0%), 5.8K }
|  |--+ qual   { Float32 1348 ZIP_ra(0.91%), 49B }
|  |--+ filter   { Int32,factor 1348 ZIP_ra(0.89%), 48B } *
|  |--+ info   [  ]
|  |  |--+ AA   { Str8 1348 ZIP_ra(24.2%), 653B } *
|  |  \--+ HM2   { Bit1 1348 ZIP_ra(117.2%), 198B } *
|  \--+ format   [  ]
|     \--+ DP   [  ] *
|        \--+ data   { Int32 90x1348 ZIP_ra(33.8%), 160.3K }
\--+ sample.annotation   [  ]
   \--+ family   { Str8 90 ZIP_ra(34.7%), 135B }Table 1: The key functions in the SeqArray package.
| Function | Description | 
|---|---|
| seqVCF2GDS | Reformats VCF files | 
| seqSetFilter | Defines a data subset of samples or variants | 
| seqGetData | Gets data from a SeqArray file with a defined filter | 
| seqApply | Applies a user-defined function over array margins | 
| seqParallel | Applies functions in a computing cluster | 
# load the R package
library(SeqArray)
# open the file
genofile <- seqOpen("1KG_chr1.gds")
# apply a user-defined function over variants
system.time(afreq <- seqApply(genofile, "genotype",
    FUN = function(x) { mean(x==0L, na.rm=TRUE) },
    as.is="double", margin="by.variant")
)10.8 minutes on Linux with Intel Xeon CPU @2GHz and 128GB RAM function(x) { mean(x==0L, na.rm=TRUE) } is a user-defined function, where x is an integer matrix:
0 – reference allele, 1 – the first alternative allele
seqParallel() splits genotypes into 4 non-overlapping parts according to different cores.
# load the R package
library(parallel)
# create a computing cluster with 4 cores
seqParallelSetup(4)
# run in parallel
system.time(afreq <- seqParallel(gdsfile=genofile,
    FUN = function(f) {
        seqApply(f, "genotype", as.is="double", margin="by.variant",
            FUN = function(x) mean(x==0L, na.rm=TRUE))
    }, split = "by.variant")
)3.1 minutes (vs. 10.8m in Test 1)
library(Rcpp)
# dynamically define an inline C/C++ function in R
cppFunction('double RefAlleleFreq(IntegerMatrix x) {
    int nrow = x.nrow(), ncol = x.ncol();
    int cnt=0, zero_cnt=0, g;
    for (int i = 0; i < nrow; i++) {
        for (int j = 0; j < ncol; j++) {
            if ((g = x(i, j)) != NA_INTEGER) {
                cnt ++;
                if (g == 0) zero_cnt ++;
            }
    }}
    return double(zero_cnt) / cnt;
}')
system.time(
    afreq <- seqApply(genofile, "genotype", RefAlleleFreq,
        as.is="double", margin="by.variant")
)1.5 minutes (significantly faster! vs. 10.8m in Test 1)
SeqArray is of great interest to
SeqVarTools (Bioconductor)
SNPRelate (Bioconductor)
https://gds-stat.s3.amazonaws.com/download/1000g/index.html
1000 Genomes Project Phase 3:
Department of Biostatistics at University of Washington – Seattle
Genetic Analysis Center: