--- title: "Using the NHLBI GRASP repository of GWAS test results with Bioconductor" author: "Vincent J. Carey" date: "October 9, 2014" output: BiocStyle::pdf_document: toc: yes BiocStyle::html_document: number_sections: yes toc: yes --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() suppressPackageStartupMessages(library(grasp2db)) ``` # Introduction GRASP (Genome-Wide Repository of Associations Between SNPs and Phenotypes) v2.0 was released in September 2014. The [primary GRASP web resource](http://apps.nhlbi.nih.gov/Grasp/Overview.aspx) includes links to a web-based query interface. This document describes a Bioconductor package that replicates information in the v2.0 [textual release](https://s3.amazonaws.com/NHLBI_public/GRASP/GraspFullDataset2.zip). The primary reference for version 2 is: Eicher JD, Landowski C, Stackhouse B, Sloan A, Chen W, Jensen N, Lien J-P, Leslie R, Johnson AD (2014) GRASP v 2.0: an update to the genome-wide repository of associations between SNPs and phenotypes. Nucl Acids Res, published online Nov 26, 2014 PMID 25428361 From the main web page: GRASP includes all available genetic association results from papers, their supplements and web-based content meeting the following guidelines: * All associations with P<0.05 from GWAS defined as >= 25,000 markers tested for 1 or more traits. * Study exclusion criteria: CNV-only studies, replication/follow-up studies testing <25K markers, non-human only studies, article not in English, gene-environment or gene-gene GWAS where single SNP main effects are not given, linkage only studies, aCGH/LOH only studies, heterozygosity/homozygosity (genome-wide or long run) studies, studies only presenting gene-based or pathway-based results, simulation-only studies, studies which we judge as redundant with prior studies since they do not provide significant inclusion of new samples or exposure of new results (e.g., many methodological papers on the WTCCC and FHS GWAS). * More detailed methods and resources used in constructing the catalog are described at [Methods & Resources](http://apps.nhlbi.nih.gov/Grasp/Resources.aspx). * Terms of Use for GRASP data: [http://apps.nhlbi.nih.gov/Grasp/Terms.aspx](http://apps.nhlbi.nih.gov/Grasp/Terms.aspx) * Medical disclaimer: [http://apps.nhlbi.nih.gov/Grasp/Overview.aspx] (http://apps.nhlbi.nih.gov/Grasp/Overview.aspx) # Installation Install the package in Bioconductor version 3.1 or later using `BiocInstaller::biocLite("grasp2db")`. The first time the grasp2 data base is referenced, via the `GRASP2()` function described below, a large (5.3Gb) file is downloaded to a local cache using `r Biocpkg("AnnotationHub")`. This may take a considerable length of time (10's of minutes, perhaps an hour or more depending on internet connections). Subsequent uses refer to the locally cached file, and are fast. # Demonstration ## Attachment and messaging Attach the package. ```{r loadup2} library(grasp2db) ``` Workflows start with a reference to the data base. ```{r grasp2} grasp2 <- GRASP2() grasp2 ``` There are three tables. The 'variant' table summarizes `r tbl(grasp2, "variant") %>% nrow` variants. The 'study' table contains `r tbl(grasp2, "study") %>% nrow` citations from which the data are extracted; the 'variant' and 'study' tables are related by PMID identifier. The 'count' table contains `r tbl(grasp2, "count") %>% nrow` records summarizing SNPs observed in Discovery and Replication samples from 12 distinct Populations; theh 'variant' and 'count' tables are related by NHLBIkey. ## Indexed p-value bins The database has been indexed on a number of fields, and an integer rounding of $-\log_{10}$ of the quantity recorded as the `Pvalue` of the association is available. ```{r lkp} variant <- tbl(grasp2, "variant") q1 = (variant %>% select(Pvalue, NegativeLog10PBin) %>% filter(NegativeLog10PBin > 8) %>% summarize(maxp = max(Pvalue), n=n())) q1 ``` This shows that the quantities in NegativeLog10PBin are upper bounds on the exponents of the p-values in the integer-labeled bins defined by this quantity. A useful utility from dplyr is the query explain method: ```{r lkex} explain(q1) ``` ## Tabulations This query select variants with large effect from the 'variant' table, and joins them to their published phenotypic effect in the 'study' table. ```{r lkp2} study <- tbl(grasp2, "study") large_effect <- variant %>% select(PMID, SNPid_dbSNP134, NegativeLog10PBin) %>% filter(NegativeLog10PBin > 5) phenotype <- left_join(large_effect, study %>% select(PMID, PaperPhenotypeDescription)) phenotype ``` ## Inspecting some relatively weak associations in asthma ```{r doasth} lkaw <- semi_join( variant %>% filter(NegativeLog10PBin <= 4) %>% select(PMID, chr_hg19, SNPid_dbSNP134, PolyPhen2), study %>% filter(PaperPhenotypeDescription == "Asthma")) ``` We materialize the filtered table into a data.frame and check how many PolyPhen2 notations including a substring of 'Damaging': ```{r dogre} lkaw %>% filter(PolyPhen2 %like% "%amaging%") ``` # Quick view of the basic interfaces ## dplyr-oriented ```{r lkbasic} grasp2 ``` ## RSQLite-oriented ```{r lkcon} gcon = grasp2$con library(RSQLite) gcon dbListTables(gcon) ``` Note that the package opens the SQLite data base in 'read-only' mode, but updates are possible (e.g., directly opening a connection to the data base without restricting access). There may be implicit control if the user does not have write access to the file. # Some QC (Consistency check): Are NHGRI GWAS catalog loci included? We have an image of the NHGRI GWAS catalog inheriting from GRanges. ```{r getgw} library(gwascat) data(gwrngs19) # hg19 addresses; NHGRI ships hg38 gwrngs19 ``` We would like to verify that the majority of the variants enumerated in the NHGRI catalog are also present in GRASP 2.0. We supply a function called checkAnti which obtains the anti-join between a chromosome-specific slice of the NHGRI catalogue and the slice of GRASP2 for the same chromosome. We compute for chromosome 22 the fraction of NHGRI records present in GRASP2. ```{r lkanti} gr22 = variant %>% filter(chr_hg19 == "22") abs22 = checkAnti("22") 1 - (abs22 %>% nrow())/(gr22 %>% nrow()) ``` The absent records can be seen to be relatively recent additions to the NHGRI catalog. ```{r lkabs} abs22 ```