Package: grasp2db
Author: Martin Morgan
Modification date: 2014-12-31
Compilation date: 2017-06-29
This document outlines steps taken to create Bioconductor’s version of the GRASP2 data base. GRASP (Genome-Wide Repository of Associations Between SNPs and Phenotypes) v2.0 was released in September 2014. The Bioconductor AnnotationHub resource is derived from the v 2.0.0.0 release.
The primary reference for version 2 is: Eicher JD, Landowski C, Stackhouse B, Sloan A, Chen W, Jensen N, Lien J-P, Leslie R, Johnson AD (2014) GRASP v 2.0: an update to the genome-wide repository of associations between SNPs and phenotypes. Nucl Acids Res, published online Nov 26, 2014 PMID 25428361.
Other vignettes in the grasp2db package contain details of the GRASP2 data base.
The script system.file(package="grasp2db", "scripts", "grasp2AnnotationHub.R")
processes GRASP2 to the Bioconductor sqlite representation. The script downloads the ZIP file, uncompresses the contents to a single tab-delimited text file, performs some necessary data cleaning, and stores the data in a partially normalized sqlite data base. The sqlite data base is distributed using the Bioconductor AnnotationHub package.
Data cleaning and transformation to sqlite are performed by the grasp2db:::.db_create()
function. The major steps include
Standardizing column names
Standardizing some aspects of data representation
Output to 3 sqlite tables.
Column names are standardized using grasp2db:::.db_clean_colnames()
. The following columns are renamed:
Original | Standardized |
---|---|
SNPid(dbSNP134) | SNPid_dbSNP134 |
chr(hg19) | chr_hg19 |
pos(hg19) | pos_hg19 |
SNPid(in paper) | SNPidInPaper |
InNHGRIcat(as of 3/31/12) | InNHGRIcat_3_31_12 |
Initial Sample Description | DiscoverySampleDescription |
LS SNP | LS_SNP |
All other column names were transformed to CamelCase by removing non-alphabetical characters and capitalizing the subsequent letter, e.g., Exclusively Male/Female
becomes ExclusivelyMaleFemale
.
grasp2db:::.db_clean_chunk()
standardized data.
NHLBIkey is supposed to be a unique integer-valued identifier, but the GRASP2fullDataset file contains 47 rows with keys 2.36501E+14
or 2.29412E+14
. These rows have been removed.
Columns TotalSamples(discovery+replication)
, TotalDiscoverySamples
, and Total replication samples
were removed (these values are easily calculated if desired).
A column NegativeLog10PBin
was created to represent decades of increasing log10 significance, round(-log10(Pvalue))
.
The CreationDate
and LastCurationDate
columns were standardized so that the dates 8/17/12
and 8/17/2012
are represented consistently as 8/17/2012
.
The HUBfield
date formats refering to Jan2014
or 14-Jan
were standardized to 1/1/2014
.
The LocationWithinPaper
entries without a space between Table12
, Figure12
, or FullData
were replaced with a space equivalent, e.g., Table 12
.
The dbSNPvalidation
column replaced ""
, "NO"
, "YES"
with logical NA
, FALSE
, TRUE
.
The dbSNPClinStatus
column entries were standardized to lower case.
The Phenotype
(and other?) column contains string representations (apparently) using the CP1250 encoding, as well as variants differing only by character case. In R and on platforms supporting CP1250 encoding, offending vectors can be transformed to their portable and cannonical representation using
P = iconv(Phenotype, "CP1250", "UTF-8")
p = tolower(P)
Phenotype = P[match(p, p)]
Data were partially normalized into 3 tables.
study
contains information on each publication present in the data base, using PMID
as a unique key. See grasp2db:::.db_accumulate_study()
.
count
contains the number of samples each variant was found in, summarized by sample (Discovery
or Replication
) and population (e.g., European
, Hispanic
), using NHLBIkey
as a unique key. See grasp2db:::.db_write_count()
.
variant
contains information about each variant, and in particular NHLBIkey
and PMID
to relate this table to the study
and count
tables. See grasp2db:::.db_write_variant()
.
Indexes were created on PMID (variant and study tables) and NHLBIkey (variant and count tables) fields, and on the Phenotype, dbSNPid, chromosome and position, and NegativeLog10PBin fields (variant table).
The database is available for use in this package as
library(grasp2db)
GRASP2() # dbplyr representation
or more directly as
library(AnnotationHub)
db <- AnnotationHub()[["AH21414"]]
In both cases, the (large) data base is downloaded to a local cache (see documentation in the AnnotationHub package); this can take several minutes the first time the data base is used.