--- title: Annotating Genomic Variants author: - name: Valerie Obenchain affiliation: Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., P.O. Box 19024, Seattle, WA, USA 98109-1024 date: 5 Sept 2017 vignette: > %\VignetteIndexEntry{Annotating Genomic Variants} %\VignetteEngine{knitr::rmarkdown} output: BiocStyle::html_document --- # Version Info ```{r, echo=FALSE, results="hide", warning=FALSE} suppressPackageStartupMessages({ library('variants') }) ```
**R version**: `r R.version.string`
**Bioconductor version**: `r BiocManager::version()`
**Package version**: `r packageVersion("variants")`
[ Back to top ]
## Convert gene symbols to gene ids Use the org.Hs.eg.db package to convert gene symbols to gene ids. ```{r} ## get entrez ids from gene symbols genesym <- c("TRPV1", "TRPV2", "TRPV3") geneid <- select(org.Hs.eg.db, keys=genesym, keytype="SYMBOL", columns="ENTREZID") geneid ```[ Back to top ]
## Create gene ranges We use the hg19 known gene track from UCSC to identify the TRPV gene ranges. These ranges will eventually be used to extract variants from a regions in the VCF file. Load the annotation package. ```{r} txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txdb ``` Our VCF file was aligned to a genome from NCBI and the known gene track was from UCSC. These institutions have different naming conventions for the chromosomes. In order to use these two pieces of data in a matching or overlap operation the chromosome names (also called sesqlevels) need to match. We will modify the txdb to match the VCF file. ```{r} txdb <- renameSeqlevels(txdb, gsub("chr", "", seqlevels(txdb))) txdb <- keepSeqlevels(txdb, "17") ``` Create a list of transcripts by gene: ```{r} txbygene = transcriptsBy(txdb, "gene") ``` Create the gene ranges for the TRPV genes ```{r} gnrng <- unlist(range(txbygene[geneid$ENTREZID]), use.names=FALSE) names(gnrng) <- geneid$SYMBOL ```[ Back to top ]
## Extract variant subsets A ScanVcfParam object is used to retrieve data subsets. This object can specify genomic coordinates (ranges) or individual VCF elements. Extractions of ranges (vs fields) requires a tabix index. See ?indexTabix for details. ```{r} param <- ScanVcfParam(which = gnrng, info = "DP", geno = c("GT", "cPd")) param ## Extract the TRPV ranges from the VCF file vcf <- readVcf(file, "hg19", param) ## Inspect the VCF object with the 'fixed', 'info' and 'geno' accessors vcf head(fixed(vcf)) geno(vcf) ```[ Back to top ]
## Variant location in the gene model The locateVariants function identifies where a variant falls with respect to gene structure, e.g., exon, utr, splice site, etc. We use the gene model from the TxDb.Hsapiens.UCSC.hg19.knownGene package loaded eariler. ```{r, eval=FALSE} ## Use the 'region' argument to define the region ## of interest. See ?locateVariants for details. cds <- locateVariants(vcf, txdb, CodingVariants()) five <- locateVariants(vcf, txdb, FiveUTRVariants()) splice <- locateVariants(vcf, txdb, SpliceSiteVariants()) intron <- locateVariants(vcf, txdb, IntronVariants()) ``` ```{r} all <- locateVariants(vcf, txdb, AllVariants()) ``` Each row in cds represents a variant-transcript match so multiple rows per variant are possible. If we are interested in gene-centric questions the data can be summarized by gene regardless of transcript. ```{r} ## Did any variants match more than one gene? table(sapply(split(mcols(all)$GENEID, mcols(all)$QUERYID), function(x) length(unique(x)) > 1)) ## Summarize the number of variants by gene: idx <- sapply(split(mcols(all)$QUERYID, mcols(all)$GENEID), unique) sapply(idx, length) ## Summarize variant location by gene: sapply(names(idx), function(nm) { d <- all[mcols(all)$GENEID %in% nm, c("QUERYID", "LOCATION")] table(mcols(d)$LOCATION[duplicated(d) == FALSE]) }) ```[ Back to top ]
## Amino acid coding changes in non-synonymous variants Amino acid coding for non-synonymous variants can be computed with the function predictCoding. The BSgenome.Hsapiens.UCSC.hg19 package is used as the source of the reference alleles. Variant alleles are provided by the user. ```{r} seqlevelsStyle(vcf) <- "UCSC" seqlevelsStyle(txdb) <- "UCSC" aa <- predictCoding(vcf, txdb, Hsapiens) ``` predictCoding returns results for coding variants only. As with locateVariants, the output has one row per variant-transcript match so multiple rows per variant are possible. ```{r} ## Did any variants match more than one gene? table(sapply(split(mcols(aa)$GENEID, mcols(aa)$QUERYID), function(x) length(unique(x)) > 1)) ## Summarize the number of variants by gene: idx <- sapply(split(mcols(aa)$QUERYID, mcols(aa)$GENEID, drop=TRUE), unique) sapply(idx, length) ## Summarize variant consequence by gene: sapply(names(idx), function(nm) { d <- aa[mcols(aa)$GENEID %in% nm, c("QUERYID","CONSEQUENCE")] table(mcols(d)$CONSEQUENCE[duplicated(d) == FALSE]) }) ``` The variants 'not translated' are explained by the warnings thrown when predictCoding was called. Variants that have a missing varAllele or have an 'N' in the varAllele are not translated. If the varAllele substitution had resulted in a frameshift the consequence would be 'frameshift'. See ?predictCoding for details.[ Back to top ]
# Exploring Package Content Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like help(package="VariantAnnotation") ?predictCoding to obtain an overview of help on the `VariantAnnotation` package, and the `predictCoding` function. View the package vignette with ```{r eval=FALSE} browseVignettes(package="VariantAnnotation") ``` To view vignettes providing a more comprehensive introduction to package functionality use ```{r eval=FALSE} help.start() ```[ Back to top ]
# sessionInfo() ```{r} sessionInfo() ```[ Back to top ]