If you use the MungeSumstats package, please cite
The MungeSumstats package is designed to facilitate the standardisation of GWAS summary statistics as utilised in our Nature Genetics paper1.
The package is designed to handle the lack of standardisation of output files by the GWAS community. There is a group who have now manually standardised many GWAS: R interface to the IEU GWAS database API • ieugwasr and gwasvcf but because a lot of GWAS remain closed access, these repositories are not all encompassing.
The GWAS-Download project has collated summary statistics from 200+ GWAS. This repository has been utilsed to identify the most common formats, all of which can be standardised with MungeSumstats.
Moreover, there is an emerging standard of VCF format for summary statistics files with multiple, useful, associated R packages such as vcfR. However, there is currently no method to convert VCF formats to a standardised format that matches older approaches.
The MungeSumstats package standardises both VCF and the most common summary statistic file formats to enable downstream integration and analysis.
MungeSumstats also offers a number of Quality Control (QC) steps which are important prerequisites for downstream analysis like Linkage disequilibrium score regression (LDSC) and MAGMA.
MungeSumstats will ensure that the all essential columns for analysis are present and syntactically correct. Generally, summary statistic files include (but are not limited to) the columns:
Tests run by MungeSumstats include:
If a test is failed, the user will be notified and if possible, the input will
be corrected. The QC steps from the checks above can also be adjusted to suit
the user’s analysis, see MungeSumstats::format_sumstats
.
The MungeSumstats package contains small subsets of GWAS summary statistics files. Firstly, on Educational Attainment by Okbay et al 2016: PMID: 27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b.
Secondly, a VCF file (VCFv4.2) relating to the GWAS Amyotrophic lateral sclerosis from ieu open GWAS project. Dataset: ebi-a-GCST005647: https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST005647/
These datasets will be used to showcase MungeSumstats functionality.
MungeSumstats is available on Bioconductor. To install the package on Bioconductor run the following lines of code:
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("MungeSumstats")
Once installed, load the package:
library(MungeSumstats)
To standardise the summary statistics’ file format, simply call
format_sumstats()
passing in the path to your summary statistics file.
You must also specify which genome build was used in the GWAS(GRCh37 or GRCh38).
The reference genome is used for multiple checks like deriving missing
data such SNP/BP/CHR/A1/A2 and for QC steps like removing non-biallelic
SNPs or strand-ambiguous SNPs. The path to the reformatted summary statistics
file is returned by the function call.
Note that for a number of the checks implored by MungeSumstats a reference
genome is used. If your GWAS summary statistics file of interest relates to
GRCh38, you will need to install SNPlocs.Hsapiens.dbSNP144.GRCh38
and
BSgenome.Hsapiens.NCBI.GRCh38
from Bioconductor as follows:
BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh38")
BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")
If your GWAS summary statistics file of interest relates to GRCh37, you will
need to install SNPlocs.Hsapiens.dbSNP144.GRCh37
and
BSgenome.Hsapiens.1000genomes.hs37d5
from Bioconductor as follows:
BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh37")
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
These may take some time to install and are not included in the package as some users may only need one of GRCh37/GRCh38.
The Educational Attainment by Okbay GWAS summary statistics file is saved as a text document in the package’s external data folder so we can just pass the file path to it straight to MungeSumstats.
eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
package="MungeSumstats")
reformatted <-
MungeSumstats::format_sumstats(path=eduAttainOkbayPth,ref_genome="GRCh37")
## First line of summary statistics file:
## MarkerName CHR POS A1 A2 EAF Beta SE Pval
## Succesfully finished preparing sumstats file, preview:
## SNP CHR BP A1 A2 FRQ BETA SE P
## rs12646808 4 3249828 T C 0.6418 0.016 0.003 4.002e-08
## rs301800 1 8490603 T C 0.1791 0.019 0.003 1.794e-08
The hyperparameters format_sumstats
in that control the level of QC conducted
by MungeSumstats are:
VCF files can also be standardised to the same format as other summary statistic files. A subset of the Amyotrophic lateral sclerosis GWAS from the ieu open GWAS project (a .vcf file) has been added to MungeSumstats to demonstrate this functionality.Simply pass the path to the file in the same manner you would for other summary statistic files:
#save ALS GWAS from the ieu open GWAS project to a temp directory
ALSvcfPth <- system.file("extdata","ALSvcf.vcf",package="MungeSumstats")
#set a low INFO filter so we get a return
reformatted_vcf <-
MungeSumstats::format_sumstats(path=ALSvcfPth, ref_genome="GRCh37",INFO_filter=0.1)
## VCF format detected, this will be converted to a standard summary statistics file format.
## Inputted VCF format has -log10 P-values, these will be converted to unadjusted p-values in the 'P' column.
## First line of summary statistics file:
## CHROM POS REF ALT INFO ES SE LP AF ID P
## 12 SNPs are not on the reference genome. These will be corrected from the reference genome.
## Reordering so first three column headers are SNP, CHR and BP in this order.
## 67 SNPs are below the INFO threshold of 0.1 and will be removed
## Succesfully finished preparing sumstats file, preview:
## SNP CHR BP A1 A2 INFO BETA SE LP AF P
## rs58108140 1 10583 G A 0.1589 0.0312 0.0393 0.369267 0.1589 0.427300105456596
## rs806731 1 30923 G T 0.7843 -0.0114 0.0353 0.126854 0.7843 0.746699739815279
The MungeSumstats package aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted by MungeSumstats feel free to report the bug on github: https://github.com/neurogenomics/MungeSumstats along with your summary statistic file header.
We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by MungeSumstats but correspond to one of:
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS,
SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ
feel free to update the MungeSumstats::sumstatsColHeaders
following the
approach in the data.R file and add your mapping. Then use a pull request on
github and we will incorporate this change into the package.
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MungeSumstats_1.0.1 BiocStyle_2.20.2
##
## loaded via a namespace (and not attached):
## [1] restfulr_0.0.13
## [2] bslib_0.2.5.1
## [3] compiler_4.1.0
## [4] BiocManager_1.30.16
## [5] jquerylib_0.1.4
## [6] GenomeInfoDb_1.28.0
## [7] XVector_0.32.0
## [8] MatrixGenerics_1.4.0
## [9] bitops_1.0-7
## [10] tools_4.1.0
## [11] zlibbioc_1.38.0
## [12] digest_0.6.27
## [13] lattice_0.20-44
## [14] jsonlite_1.7.2
## [15] evaluate_0.14
## [16] BSgenome_1.60.0
## [17] rlang_0.4.11
## [18] Matrix_1.3-4
## [19] DelayedArray_0.18.0
## [20] yaml_2.2.1
## [21] parallel_4.1.0
## [22] xfun_0.24
## [23] GenomeInfoDbData_1.2.6
## [24] rtracklayer_1.52.0
## [25] stringr_1.4.0
## [26] knitr_1.33
## [27] Biostrings_2.60.1
## [28] S4Vectors_0.30.0
## [29] sass_0.4.0
## [30] IRanges_2.26.0
## [31] grid_4.1.0
## [32] stats4_4.1.0
## [33] data.table_1.14.0
## [34] Biobase_2.52.0
## [35] R6_2.5.0
## [36] XML_3.99-0.6
## [37] BiocParallel_1.26.0
## [38] rmarkdown_2.9
## [39] bookdown_0.22
## [40] magrittr_2.0.1
## [41] GenomicAlignments_1.28.0
## [42] Rsamtools_2.8.0
## [43] matrixStats_0.59.0
## [44] htmltools_0.5.1.1
## [45] BiocGenerics_0.38.0
## [46] GenomicRanges_1.44.0
## [47] SummarizedExperiment_1.22.0
## [48] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
## [49] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20
## [50] stringi_1.6.2
## [51] RCurl_1.98-1.3
## [52] rjson_0.2.20
## [53] crayon_1.4.1
## [54] BiocIO_1.2.0
1. Nathan G. Skene, T. E. B., Julien Bryois. Genetic identification of brain cell types underlying schizophrenia. Nature Genetics (2018). doi:10.1038/s41588-018-0129-5