Introduction to TVTB

Kévin Rue-Albrecht1*

1Department of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London, W12 0NN, UK

23 October 2016

Abstract

The VCF Tool Box (TVTB) provides S4 classes and methods to filter, summarise and visualise genetic variation data stored in VCF files pre-processed by the Ensembl Variant Effect Predictor (VEP). In particular, the package extends the FilterRules class (S4Vectors package) to define news classes of filter rules applicable to the various slots of VCF objects. A Shiny web-application, the Shiny Variant Explorer (tSVE), provides a convenient interface to demonstrate those functionalities integrated in a programming-free environment.

Package: TVTB 1.0.2

1 Introduction
2 Installation
3 Recurrent settings: TVTBparam
4 Data import
5 Adding allele frequencies
- 5.1 Adding overall frequencies
- 5.2 Adding frequencies within phenotype level(s)
6 Filtering variants
7 Summarising Ensembl VEP predictions
8 Acknowledgement
9 Session info
References

1 Introduction

The VCF Tool Box (TVTB) offers S4 classes and methods to filter, summarise and visualise genetic variation data stored in VCF files pre-processed by the Ensembl Variant Effect Predictor (VEP) (McLaren et al. 2010). An RStudio/Shiny web-application, the Shiny Variant Explorer (tSVE), provides a convenient interface to demonstrate those functionalities integrated in a programming-free environment.

Currently, major functionalities in the TVTB package include:

A class to store recurrent parameters of genetic analyses

List of reference homozygote, heterozygote, and alternate homozygote genotype encodings
Key of the INFO field where Ensembl VEP predictions are stored in the VCF file
Suffix of the INFO fields where calculated data will be stored
List of genomic ranges to analyse and visualise
Parameters for parallel calculations (using BiocParallel)

Genotype counts and allele frequencies

Calculated from the data of an ExpandedVCF objects (i.e. bi-allelic records)
Stored in INFO fields defined by the above suffixes
- overall counts and frequencies (i.e. across all samples)
- counts and frequencies within level(s) of phenotype(s)

Classes of VCF filter rules

Filter rules applicable to the fixed slot of an VCF object
Filter rules applicable to the info slot of an VCF object
Filter rules applicable to Ensembl VEP predictions stored in a given INFO field
A container for combinations of filter rules listed above
Subset VCF objects using the above filter rules

2 Installation

The VCF Tool Box can be installed using the following code:

# Currently:
devtools::install_github("kevinrue/TVTB")
# When hosted on Bioconductor:
source("http://bioconductor.org/biocLite.R")
biocLite("TVTB")

Once installed, the package can be loaded and attached as follows:

library(TVTB)

3 Recurrent settings: TVTBparam

Most functionalities in TVTB require recurrent information such as:

Genotype encoding (homozygote reference, heterozygote, homozygote alternate),
INFO key that contains the Ensembl VEP predictions in the VCF file,
List of genomic ranges within which data must be summarised or visualised
- affecting the genomic ranges to import from the VCF file
Suffixes of INFO keys where counts and frequencies of genotypes must be stored:
- Counts and frequencies may be calculated for individual levels of selected phenotypes, in which case the data will be stored under INFO keys formed as <phenotype>_<level>_<suffix>,
- Counts and frequencies across all samples are stored in INFO keys simply named <suffix>.
Settings for parallel calculations.

To reduce the burden of repetition during programming, and to facilitate analyses using consistent sets of parameters, TVTB implements the TVTBparam class. The TVTBparam class offer a container for parameters recurrently used across the package. A TVTBparam object may be initialised as follows:

tparam <- TVTBparam(Genotypes(
    ref = "0|0",
    het = c("0|1", "1|0", "0|2", "2|0", "1|2", "2|1"),
    alt = c("1|1", "2|2")),
    ranges = GenomicRanges::GRangesList(
        SLC24A5 = GenomicRanges::GRanges(
            seqnames = "15",
            IRanges::IRanges(
                start = 48413170, end = 48434757)
            )
        )
    )

TVTBparam objects have a convenient summary view and accessor methods:

tparam

## class: TVTBparam
## @genos: class: Genotypes
##     @ref (hom. ref.): "REF" {0|0}
##     @het (heter.): "HET" {0|1, 1|0, 0|2, 2|0, 1|2, 2|1}
##     @alt (hom. alt.): "ALT" {1|1, 2|2}
##   @ranges: 1 GRanges on 1 sequence(s)
##   @aaf (alt. allele freq.): "AAF"
##   @maf (minor allele freq.): "MAF"
##   @vep (Ensembl VEP key): "CSQ"
##   @svp: <ScanVcfParam object>
##   @bp: <SerialParam object>

In this example:

Genotypes
- Accessor: genos(x)
- Class: Genotypes
- Homozygote reference genotype is encoded "0|0".
- Counts of reference genotypes are stored in INFO keys suffixed with "REF".
- Heterozygote genotypes are encoded as "0|1", "1|0", "0|2", "2|0", "1|2", and "2|1".
- Counts of heterozygote genotypes are stored in INFO keys suffixed with "HET".
- Homozygote alternate genotype is encoded "1|1".
- Counts of alternate genotypes are stored in INFO keys suffixed with "ALT".
Genomic ranges
- Accessor: ranges(x)
- Class: GRangesList
- A gene-coding region on chromosome "15".
Alternate allele frequency
- Accessor: aaf(x)
- Calculated values will be stored under INFO keys suffixed with "AAF".
Minor allele frequency
- Accessor: maf(x)
- Calculated values will be stored under INFO keys suffixed with "MAF".
Ensembl VEP predictions
- Accessor: vep(x)
- Information will be imported from the INFO field "CSQ".
Parallel calculations
- Accessor: bp(x)
- Class: BiocParallelParam
- Serial evaluation (i.e. do not run parallel code)
VCF scan parameters
- Accessor: svp(x)
- Class: ScanVcfParam
- which slot automatically populated with reduce(unlist(ranges(x)))

Default values are provided for all slots except genotypes, as these may vary more frequently from one data set to another (e.g. phased, unphased, imputed).

4 Data import

4.1 Genetic variants

Functionalities in TVTB support CollapsedVCF and ExpandedVCF objects (both extending the virtual class VCF) of the VariantAnnotation package.

Typically, CollapsedVCF objects are produced by the VariantAnnotation readVcf method after parsing a VCF file, and ExpandedVCF objects result of the VariantAnnotation expand method applied to a CollapsedVCF object.

Any information that users deem relevant for the analysis may be imported from VCF files and stored in VCF objects passed to TVTB methods. However, to enable the key functionalities of the package, the slots of a VCF object should include at least the following information:

fixed(x)
- fields "REF" and "ALT".
info(x)
- field <vep>: where <vep> stands for the INFO key where the Ensembl VEP predictions are stored in the VCF object.
geno(x)
- GT: genotypes.
colData(x): phenotypes.

4.2 List of genomic ranges

In the near future, TVTB functionalities are expected to produce summary statistics and plots faceted by meta-features, each potentially composed of multiple genomic ranges.

For instance, burden tests may be performed on a set of transcripts, considering only variants in their respective sets of exons. The GenomicRanges GRangesList class is an ideal container in example, as each GRanges in the GRangesList would represent a transcript, and each element in the GRanges would represent an exon.

Furthermore, TVTBparam objects may be supplied as the param argument of the VariantAnnotation readVcf. In this case, the TVTBparam object is used to import only variants overlapping the relevant genomic regions. Moreover, the readVcf method also ensured that the vep slot of the TVTBparam object is present in the header of the VCF file.

svp <- as(tparam, "ScanVcfParam")
svp

## class: ScanVcfParam 
## vcfWhich: 1 elements
## vcfFixed: character() [All] 
## vcfInfo:  
## vcfGeno:  
## vcfSamples:

4.3 Phenotypes

Although VCF objects may be constructed without attached phenotype data, phenotype information is critical to interpret and compare genetic variants between groups of samples (e.g. burden of damaging variants in different phenotype levels).

VCF objects accept phenotype information (as S4Vectors DataFrame) in the colData slot. This practice has the key advantage of keeping phenotype and genetic information synchronised through operation such as subsetting and re-ordering, limiting workspace entropy and confusion.

4.4 Example

An ExpandedVCF object that contains the minimal data necessary for the rest of the vignette can be created as follows:

Step 1: Import phenotypes

phenoFile <- system.file(
    "extdata", "integrated_samples.txt", package = "TVTB")
phenotypes <- S4Vectors::DataFrame(
    read.table(file = phenoFile, header = TRUE, row.names = 1))

Step 2: Define the VCF file to parse

vcfFile <- system.file(
    "extdata", "chr15.phase3_integrated.vcf.gz", package = "TVTB")
tabixVcf <- Rsamtools::TabixFile(file = vcfFile)

Step 3: Define VCF import parameters

VariantAnnotation::vcfInfo(svp(tparam)) <- vep(tparam)
VariantAnnotation::vcfGeno(svp(tparam)) <- "GT"
svp(tparam)

## class: ScanVcfParam 
## vcfWhich: 1 elements
## vcfFixed: character() [All] 
## vcfInfo: CSQ 
## vcfGeno: GT 
## vcfSamples:

Of particular interest in the above chunk of code:

The TVTBparam constructor previously populated the which slot of svp with “reduced” (i.e. non-overlapping) genomic ranges defined in the ranges slot.
Only the INFO key defined in the vep slot will be imported
Only the matrix of called genotypes will be imported

Step 4: Import and pre-process variants

# Import variants as a CollapsedVCF object
vcf <- VariantAnnotation::readVcf(
    tabixVcf, param = tparam, colData = phenotypes)
# Expand into a ExpandedVCF object (bi-allelic records)
vcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)

Of particular interest in the above chunk of code, the readVcf method is given:

TVTBparam parameters, invoking the corresponding method signature
phenotypes
- The rownames of those phenotypes defines the sample identifiers that are queried from the VCF file.
- Those phenotypes are stored in the colData slot of the resulting VCF object.

The result is an ExpandedVCF object that includes variants in the targeted genomic range(s) and samples:

## class: ExpandedVCF 
## dim: 481 2504 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 1 column: CSQ
## info(header(vcf)):
##        Number Type   Description                                          
##    CSQ .      String Consequence annotations from Ensembl VEP. Format: ...
## geno(vcf):
##   SimpleList of length 1: GT
## geno(header(vcf)):
##       Number Type   Description
##    GT 1      String Genotype

5 Adding allele frequencies

Although interesting figures and summary tables may be obtained as soon as the first ExpandedVCF object is created (see section Summarising Ensembl VEP predictions), those methods may benefit from information added to additional INFO keys after data import, either manually by the user, or through various methods implemented in the TVTB package.

5.1 Adding overall frequencies

For instance, the method addOverallFrequencies uses the reference homozoygote (REF), heterozygote (HET), and homozygote alternate (ALT) genotypes defined in the TVTBparam object stored in the VCF metadata to obtain the count of each genotype in an ExpandedVCF object. Immediately thereafter, the method uses those counts to calculate alternate allele frequency (AAF) and minor allele frequency (MAF). Finally, the method stores the five calculated values (REF, HET, ALT, AAF, and MAF) in INFO keys defined by suffixes also declared in the TVTBparam object.

initialInfo <- colnames(info(vcf))
vcf <- addOverallFrequencies(vcf = vcf)
setdiff(colnames(info(vcf)), initialInfo)

## [1] "REF" "HET" "ALT" "AAF" "MAF"

Notably, the addOverallFrequencies method is synonym to the addFrequencies method missing the argument phenos:

vcf <- addFrequencies(vcf = vcf, force = TRUE)

5.2 Adding frequencies within phenotype level(s)

Similarly, the method addPhenoLevelFrequencies obtains the count of each genotype in samples associated with given level(s) of given phenotype(s), and stores the calculated values in INFO keys defined as <pheno>_<level>_<suffix>, with suffixes defined in the TVTBparam object stored in the VCF metadata.

initialInfo <- colnames(info(vcf))
vcf <- addPhenoLevelFrequencies(
    vcf = vcf, pheno = "super_pop", level = "AFR")
setdiff(colnames(info(vcf)), initialInfo)

## [1] "super_pop_AFR_REF" "super_pop_AFR_HET" "super_pop_AFR_ALT"
## [4] "super_pop_AFR_AAF" "super_pop_AFR_MAF"

Notably, the addPhenoLevelFrequencies method is synonym to the addFrequencies method called with the argument phenos given as a list where names are phenotypes, and values are character vectors of levels to process within each phenotype:

initialInfo <- colnames(info(vcf))
vcf <- addFrequencies(
    vcf = vcf, phenos = list(
        super_pop = c("EUR", "AFR"),
        pop = c("GBR", "FIN", "MSL")),
    force = TRUE)
setdiff(colnames(info(vcf)), initialInfo)

##  [1] "super_pop_EUR_REF" "super_pop_EUR_HET" "super_pop_EUR_ALT"
##  [4] "super_pop_EUR_AAF" "super_pop_EUR_MAF" "pop_GBR_REF"      
##  [7] "pop_GBR_HET"       "pop_GBR_ALT"       "pop_GBR_AAF"      
## [10] "pop_GBR_MAF"       "pop_FIN_REF"       "pop_FIN_HET"      
## [13] "pop_FIN_ALT"       "pop_FIN_AAF"       "pop_FIN_MAF"      
## [16] "pop_MSL_REF"       "pop_MSL_HET"       "pop_MSL_ALT"      
## [19] "pop_MSL_AAF"       "pop_MSL_MAF"

6 Filtering variants

Although VCF objects are straightforward to subset using either indices and row names (as they inherit from the SummarizedExperiment RangedSummarizedExperiment class), users may wish to identify variants that pass combinations of criteria based on information in their fixed slot, info slot, and Ensembl VEP predictions, a non-trivial task due to those pieces of information being stored in different slots of the VCF object, and the 1:N relationship between variants and EnsemblVEP predictions.

6.1 Definition of VCF filter rules

To facilitate the definition of VCF filter rules, and their application to VCF objects, TVTB extends the S4Vectors FilterRules class in four new classes of filter rules:

Motivation for each of the new classes extending `FilterRules` to define VCF filter rules.
Class	Motivation
`VcfFixedRules`	Filter rules applied to the `fixed` slot of a
	`VCF` object.

`VcfInfoRules`	Filter rules applied to the `info` slot of a
	`VCF` object.

`VcfVepRules`	Filter rules applied to the Ensembl VEP predictions
	stored in a given INFO key of a `VCF` object.

`VcfFilterRules`	Combination of `VcfFixedRules`, `VcfInfoRules`, and
	`VcfVepRules` applicable to a `VCF` object.

Note that FilterRules objects themselves are applicable to VCF objects, with two important difference from the above specialised classes:

Expressions must explicitely refer to the VCF slots
As a consequence, a single expression can refer to fields from different VCF slots, for instance:

fr <- S4Vectors::FilterRules(list(
    mixed = function(x){
        VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" &
            VariantAnnotation::info(x)[,"MAF"] >= 0.05
    }
))
fr

## FilterRules of length 1
## names(1): mixed

Instances of those classes may be initialised as follows:

VcfFixedRules

fixedR <- VcfFixedRules(list(
    pass = expression(FILTER == "PASS"),
    qual = expression(QUAL > 20)
    ))
fixedR

## VcfFixedRules of length 2
## names(2): pass qual

VcfInfoRules

infoR <- VcfInfoRules(exprs = list(
    common = expression(MAF > 0.1), # minor allele frequency
    present = expression(ALT + HET > 0) # count of non-REF homozygotes
    ),
    active = c(TRUE, FALSE))
infoR

## VcfInfoRules of length 2
## names(2): common present

FilterRules are initialised in an active state by default. The above chunk of code demonstrate how the active argument of their constructor may be used to initialise specific filter rules in an inactive state.

VcfVepRules

vepR <- VcfVepRules(exprs = list(
    missense = expression(Consequence %in% c("missense_variant")),
    CADD = expression(CADD_PHRED > 15)
    ))
vepR

## VcfVepRules of length 2
## names(2): missense CADD

VcfFilterRules

VcfFilterRules combine VCF filter rules of different types in a single object.

vcfRules <- VcfFilterRules(fixedR, infoR, vepR)
vcfRules

## VcfFilterRules of length 6
## names(6): pass qual common present missense CADD

This vignette offers only a brief peek into the utility and flexibility of VCF filter rules. More (complex) examples are given in a separate vignette, including filter rules using functions and pattern matching. The documentation of the S4Vectors package—where the parent class FilterRules is defined—can also be a source of inspiration.

6.2 Control of VCF filter rules

As the above classes of VCF filter rules inherit from the S4Vectors FilterRules class, they also benefit from its accessors and methods. For instance, VCF filter rules can easily be toggled between active and inactive states:

active(vcfRules)["CADD"] <- FALSE
active(vcfRules)

##     pass     qual   common  present missense     CADD 
##     TRUE     TRUE     TRUE    FALSE     TRUE    FALSE

A separate vignette describes in greater detail the use of classes that contain VCF filter rules.

6.3 Evaluation of VCF filter rules

Once defined, the above filter rules can be applied to ExpandedVCF objects, in the same way as FilterRules are evaluated in a given environment (see the S4Vectors documentation):

summary(eval(expr = infoR, envir = vcf))

##    Mode   FALSE    TRUE    NA's 
## logical     465      16       0

summary(eval(expr = vcfRules, envir = vcf))

##    Mode   FALSE    TRUE    NA's 
## logical     480       1       0

summary(evalSeparately(expr = vcfRules, envir = vcf))

##    pass           qual           common        present       
##  Mode:logical   Mode:logical   Mode :logical   Mode:logical  
##  TRUE:481       TRUE:481       FALSE:465       TRUE:481      
##  NA's:0         NA's:0         TRUE :16        NA's:0        
##                                NA's :0                       
##   missense         CADD        
##  Mode :logical   Mode:logical  
##  FALSE:454       TRUE:481      
##  TRUE :27        NA's:0        
##  NA's :0

7 Summarising Ensembl VEP predictions

As soon as genetic and phenotypic information are imported into an ExpandedVCF object, or after the object was extended with additional information, the scientific value of the data may be revealed by a variety of summary statistics and graphical representations. This section will soon present several ideas being implemented in TVTB, for instance:

Count of discrete Ensembl VEP predictions
- by phenotype level
- by genomic feature affected (i.e. transcript)
Distribution of continuous Ensembl VEP predictions
- by phenotype level
- by genomic feature affected (i.e. transcript)

8 Acknowledgement

Dr. Stefan Gräf and Mr. Matthias Haimel for advice on the VCF file format and the Ensembl VEP script. Prof. Martin Wilkins for his trust and support. Dr. Michael Lawrence for his helpful code review and suggestions.

Last but not least, the amazing collaborative effort of the rep("many",10) Bioconductor developers whose hard work appears through the dependencies of this package.

9 Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] TVTB_1.0.2      knitr_1.14      BiocStyle_2.2.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7                   plyr_1.8.4                   
##  [3] BiocInstaller_1.24.0          formatR_1.4                  
##  [5] GenomeInfoDb_1.10.0           XVector_0.14.0               
##  [7] AnnotationHub_2.6.0           GenomicFeatures_1.26.0       
##  [9] bitops_1.0-6                  tools_3.3.1                  
## [11] zlibbioc_1.20.0               biomaRt_2.30.0               
## [13] digest_0.6.10                 gtable_0.2.0                 
## [15] BSgenome_1.42.0               evaluate_0.10                
## [17] RSQLite_1.0.0                 tibble_1.2                   
## [19] lattice_0.20-34               Matrix_1.2-7.1               
## [21] shiny_0.14.1                  DBI_0.5-1                    
## [23] yaml_2.1.13                   parallel_3.3.1               
## [25] ensemblVEP_1.14.0             httr_1.2.1                   
## [27] rtracklayer_1.34.0            stringr_1.1.0                
## [29] Biostrings_2.42.0             S4Vectors_0.12.0             
## [31] IRanges_2.8.0                 stats4_3.3.1                 
## [33] grid_3.3.1                    Biobase_2.34.0               
## [35] R6_2.2.0                      AnnotationDbi_1.36.0         
## [37] XML_3.98-1.4                  BiocParallel_1.8.0           
## [39] rmarkdown_1.1                 reshape2_1.4.2               
## [41] ggplot2_2.1.0                 ensembldb_1.6.0              
## [43] magrittr_1.5                  scales_0.4.0                 
## [45] Rsamtools_1.26.1              htmltools_0.3.5              
## [47] BiocGenerics_0.20.0           GenomicRanges_1.26.1         
## [49] GenomicAlignments_1.10.0      assertthat_0.1               
## [51] SummarizedExperiment_1.4.0    colorspace_1.2-7             
## [53] xtable_1.8-2                  mime_0.5                     
## [55] interactiveDisplayBase_1.12.0 httpuv_1.3.3                 
## [57] stringi_1.1.2                 munsell_0.4.3                
## [59] RCurl_1.95-4.8                VariantAnnotation_1.20.0

References

McLaren, W., B. Pritchard, D. Rios, Y. Chen, P. Flicek, and F. Cunningham. 2010. “Deriving the Consequences of Genomic Variants with the Ensembl API and SNP Effect Predictor.” Journal Article. Bioinformatics 26 (16): 2069–70. doi:10.1093/bioinformatics/btq330.