--- title: "Using AnVIL VRS Toolkit in R" author: - name: Marcel Ramos affiliation: > CUNY Graduate School of Public Health and Health Policy, New York, NY USA output: BiocStyle::html_document: toc: true toc_depth: 3 number_sections: true date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{AnVILVRS Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} package: AnVILVRS --- # AnVILVRS ```{r prereqs,include=FALSE,error=0} library(BiocStyle) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = TRUE, out.width = "100%" ) has_env <- reticulate::virtualenv_exists("vrs_env") has_gcloud <- AnVILBase::has_avworkspace( strict = TRUE, platform = AnVILGCP::gcp() ) has_envs <- has_env && has_gcloud ``` # Introduction The AnVILVRS package provides an R interface to the [AnVIL VRS Toolkit](https://pypi.org/project/vrs-anvil-toolkit/), a Python library for working with the [Global Alliance for Genomics and Health (GA4GH) Variation Representation Specification (VRS)](https://vrs.ga4gh.org/en/stable/) standard. The package allows users to translate variant identifiers from various formats (e.g., gnomAD, SPDI, HGVS, Beacon) into GA4GH VRS Allele IDs and vice versa. Additionally, it provides functionality to retrieve allele frequency data from the 1000 Genomes Project based on VRS Allele IDs. # Installation To use the AnVILVRS package, you need to have Python installed on your system. The package requires Python 3.11, so ensure that you have this version installed. Install the AnVILVRS package from Bioconductor with: ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("AnVILVRS") ``` # Loading and Setup Load the `AnVILVRS` package and the `r CRANpkg("reticulate")` package for Python integration: ```{r load} library(AnVILVRS) library(reticulate) ``` After installing the package, you need to set up a Python virtual environment with the required dependencies. You can do this by running the following command in R: ```{r setup, eval=FALSE} install_AnVILVRS(envname = "vrs_env") ``` # Usage Once the virtual environment is set up, you can use the AnVILVRS package to translate variant identifiers and retrieve allele frequency data. First, load the package and activate the virtual environment: ```{r venv, eval=has_env} use_virtualenv("vrs_env", required = TRUE) ``` ## Translating Variant Identifiers You can translate variant identifiers from various formats into GA4GH VRS Allele IDs using the `get_vrs_id` function. Supported formats include "gnomad", "spdi", "hgvs", and "beacon". ```{r translate, eval=has_env} get_vrs_id("chr7-87509329-A-G", "gnomad") get_vrs_id("NC_000005.10:80656509:C:TT", "spdi") get_vrs_id("NC_000005.10:g.80656510delinsTT", "hgvs") get_vrs_id("5 : 80656489 C > T", "beacon") ``` ## Allele Object Retrieval You can also retrieve the VRS Allele object using the `get_vrs_allele` function: ```{r allele, eval=has_env} allele <- get_vrs_allele("5 : 80656489 C > T", "beacon") allele ``` ## Variant Retrieval from Allele Object You can convert a VRS Allele object back to a variant identifier in a specified format using the `get_variant_from_allele` function: ```{r variant_from_allele, eval=has_env} get_variant_from_allele(allele, "hgvs") ``` ## Calculating Cohort Allele Frequency ### Population Descriptor table download The `get_pop_descriptor` function downloads the population descriptor file from a known Google Storage URI to the `r Biocpkg("BiocFileCache")`. This file is used in the calculation of the Cohort Allele Frequency (CAF). The `get_caf` function makes use of the population descriptor `.tsv` file to dynamically provide a Cohort Allele Frequency based on the sub-population of interest. In our example, we will use the "USA" population code. ```{r download, eval=has_envs} library(readr) pop_desc <- get_pop_descriptor() read_tsv( file = pop_desc, col_types = cols( population_descriptor_id = col_character(), population_descriptor = col_character(), subject_id = col_character(), country_of_recruitment = col_character(), population_label = col_character() ) ) ``` ### SeqRepo reference data The `get_seqrepo` function downloads a SeqRepo archive from a specified URI to a local directory. This is useful for setting up the SeqRepo database required by the AnVIL VRS Toolkit. ```{r seqrepo, eval=FALSE} get_seqrepo(destdir = tempdir()) ``` ### Cohort Allele Frequency (CAF) calculation Finally, to calculate the Cohort Allele Frequency (CAF) using the 1000 Genomes Project based on a VRS Allele ID, use the `get_caf` function. A `.gz` zipped VCF file and its corresponding index file obtained from the 1000 Genomes Project is needed. It should include variants with allele frequency annotations. ```{r get_caf, eval=FALSE} use_virtualenv("vrs_env", required = TRUE) toolkit_dir <- setup_vrs_toolkit() vcf <- AnVILVRS:::.get_fixture_vcf() vcf_index <- build_vrs_index() variant_id <- "chr1-20094-TAA-T" vrs_id <- get_vrs_id(variant_id, "gnomad") pop_desc <- get_pop_descriptor() get_caf( vrs_id = vrs_id, vcf = vcf, vcf_index = vcf_index, phenotype = "USA", pop_desc_file = pop_desc, toolkit_dir = toolkit_dir ) ``` # Session Info {.unnumbered}
Click to expand session info ```{r session_info} sessionInfo() ```