probeSNPffer is available at https://bioconductor.org and can be installed via:
This vignette demonstrates how to identify CpG probes in a DNA methylation array dataset that overlap with known SNPs, and calculate important metrics related to these probes. Our functions do this in two steps:
First, extract.probe.regions()
takes positional,
strand, and technology type information about a set of CpG probes
(i.e. derived from an Illumina DNA methylation array manifest) and
returns the genomic coordinates of the entire probe region in BED
format.
Second, flag.overlap()
takes the BED file generated
in the previous step, along with a BED file containing SNPs to be
queried, and returns their intersection.
Here, we show how these functions can be used for quality control using examples from the results of Li et al. 20221.
We start with a data frame containing information about the genomic position of our CpG sites of interest, as well as their strand and assay technology. This information is available in the manifest that accompanies the array platform used. There are also R packages for the most common DNA methylation arrays.
Our function require this data frame to have these exact column names in this exact order in order to avoid ambiguity:
Additional columns (6 and beyond) can also be included containing
metadata, such as, in our case, the difference in ancestry-specific
effect size (‘beta_diff’). CpG info
contains information for the 135 CpG probes corresponding to the
ancestry-specific meQTL identified by Li et al., and reported in their
Supplementary Data 8. The largest effect size difference (beta_diff) was
taken for probes associated with multiple ancestry-specific meQTL. This
data frame is previewed below:
CpG_id | chr | CpG_pos | strand | type | beta_diff | |
---|---|---|---|---|---|---|
71 | cg00116315 | chr6 | 35466268 | + | II | 0.480 |
80 | cg00295418 | chr8 | 2021420 | - | II | 0.749 |
29 | cg00443946 | chr16 | 86370962 | - | I | 0.612 |
114 | cg00495681 | chr13 | 53174319 | + | II | 0.565 |
13 | cg00688297 | chr8 | 145752292 | + | I | 0.330 |
We can easily extract the genomic coordinates for the probes corresponding to these 135 CpG sites using the following code:
This results in a BED-formatted file that can be used for the next step:
chr | start | end | CpG_id | strand | type | CpG_pos | beta_diff | |
---|---|---|---|---|---|---|---|---|
71 | 6 | 35466268 | 35466318 | cg00116315 | + | 2 | 35466268 | 0.480 |
80 | 8 | 2021371 | 2021421 | cg00295418 | - | 2 | 2021420 | 0.749 |
29 | 16 | 86370914 | 86370964 | cg00443946 | - | 1 | 86370962 | 0.612 |
114 | 13 | 53174319 | 53174369 | cg00495681 | + | 2 | 53174319 | 0.565 |
13 | 8 | 145752291 | 145752341 | cg00688297 | + | 1 | 145752292 | 0.330 |
Now, we want to identify SNPs that fall within the probe. For this
vignette, we are particularly interested in SNPs that are extremely
differentiated between populations of European and West African genetic
ancestry. The object SNP_bed
is a data frame containing
BED-formatted information for an example selection of SNPs with Fst >
.3 between African and European super-populations, as ascertained in the
1000
Genomes panel. This data frame looks like:
CHROM | POS | POS.1 | rsid | REF | ALT | WEIR_AND_COCKERHAM_FST | AFR_AF | EUR_AF | |
---|---|---|---|---|---|---|---|---|---|
43038962 | 2 | 183416696 | 183416697 | rs6711313 | A | G | 0.434935 | 0.2322 | 0.7594 |
68849666 | 7 | 110201110 | 110201111 | rs13243744 | G | A | 0.385094 | 0.0227 | 0.4056 |
17594097 | 15 | 53234107 | 53234108 | rs11070941 | T | G | 0.366440 | 0.0431 | 0.4304 |
69852397 | 7 | 147628417 | 147628418 | rs2263054 | T | C | 0.320888 | 0.6044 | 0.172 |
55252230 | 4 | 172519029 | 172519030 | rs4362785 | A | G | 0.302976 | 0.4251 | 0.0517 |
52696640 | 4 | 80038606 | 80038607 | rs28539890 | C | T | 0.328489 | 0.3578 | 0 |
11082008 | 12 | 122451186 | 122451187 | rs11043288 | A | G | 0.453011 | 0.0333 | 0.4911 |
71896340 | 8 | 49046361 | 49046362 | rs7000484 | G | T | 0.411004 | 0.7927 | 0.2883 |
63007068 | 6 | 75640275 | 75640276 | rs4085731 | A | G | 0.548159 | 0.5893 | 0.006 |
As you can see, this BED file contains the genomic locations of our SNP set, as well as rsID and population-specific allele frequency information. The BED file only needs to contain the SNP’s chromosome and BED formatted positions (must be columns 1,2,3) and REF, ALT alleles (must be columns 5,6). Using this and the CpG probe BED file we generated in the previous step, we can extract all instances of overlap between SNPs and CpG probes using the following line of code. This function also checks for color-channel switching SNPs at the single base extension (SBE) position of Type 1 probes and will mark them as “cc_switch” if the REF/ALT pair will bias measurements or “not_cc_switch” if they do not bias measurements. If the SNP is not a SBE SNP, this function witll mark the SNP as “not_SBE”. SBE SNPs that are non color channel switching can be ignored and we drop these from the intersection dataframe.
CpG_SNP_intersection <- flag.overlap(probe_bed = CpG_probe_bed, SNP_bed = SNP_bed[,1:6])
#> Calculating overlap between probe list and SNP list...
#> Annotating colour channel switching SNPs...
CpG_SNP_intersection <- CpG_SNP_intersection[CpG_SNP_intersection$col_chan_switching!= "not_cc_switch",]
chr | SNP_start | SNP_end | SNP_id | SNP_ref | SNP_alt | CpG_start | CpG_end | CpG_id | CpG_strand | CpG_type | CpG_pos | SNP_CpG_distance | col_chan_switching | fst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 14 | 106091981 | 106091982 | rs10136838 | G | A | 106091932 | 106091982 | cg14837792 | - | 2 | 106091981 | 1 | not_SBE | 0.416470 |
13 | 17 | 19361210 | 19361211 | rs10491097 | T | C | 19361181 | 19361231 | cg19949948 | - | 2 | 19361230 | 19 | not_SBE | 0.661177 |
2 | 10 | 134876495 | 134876496 | rs10857704 | G | A | 134876446 | 134876496 | cg04194432 | - | 2 | 134876495 | 1 | not_SBE | 0.352647 |
6 | 12 | 132537251 | 132537252 | rs10902488 | G | A | 132537203 | 132537253 | cg06813297 | - | 1 | 132537251 | 1 | not_SBE | 0.443998 |
32 | 2 | 242710953 | 242710954 | rs10933569 | A | G | 242710938 | 242710988 | cg01997813 | + | 1 | 242710939 | 15 | not_SBE | 0.450959 |
3 | 11 | 47213116 | 47213117 | rs11039122 | G | A | 47213067 | 47213117 | cg10938684 | - | 2 | 47213116 | 1 | not_SBE | 0.349096 |
Now we have the data we need to evaluate the extent to which these SNPs in CpG probe sequences are biasing the results of the local ancestry-specific meQTL analysis in Li et al. Here we plot the impact of probe SNP distance on delta effect size for this subset of SNPs.
Here we also plot the impact of probe SNP Fst on delta effect size, using the highest Fst SNP for probes with multiple SNPs.
# Session Information
All of the output in this vignette was produced under the following conditions:
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 dplyr_1.1.4 tibble_3.2.1
#> [4] probeSNPffer_0.99.15 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [7] IRanges_2.41.1 S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [10] generics_0.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 jsonlite_1.8.9 compiler_4.5.0
#> [4] tidyselect_1.2.1 jquerylib_0.1.4 scales_1.3.0
#> [7] yaml_2.3.10 fastmap_1.2.0 R6_2.5.1
#> [10] XVector_0.47.0 labeling_0.4.3 knitr_1.49
#> [13] munsell_0.5.1 GenomeInfoDbData_1.2.13 bslib_0.8.0
#> [16] pillar_1.9.0 rlang_1.1.4 utf8_1.2.4
#> [19] cachem_1.1.0 xfun_0.49 sass_0.4.9
#> [22] cli_3.6.3 withr_3.0.2 magrittr_2.0.3
#> [25] zlibbioc_1.53.0 digest_0.6.37 grid_4.5.0
#> [28] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.1
#> [31] glue_1.8.0 farver_2.1.2 colorspace_2.1-1
#> [34] fansi_1.0.6 rmarkdown_2.29 httr_1.4.7
#> [37] tools_4.5.0 pkgconfig_2.0.3 htmltools_0.5.8.1
#> [40] UCSC.utils_1.3.0
Li, B., Aouizerat, B. E., Cheng, Y., Anastos, K., Justice, A. C., Zhao, H. & Xu, K. Incorporating local ancestry improves identification of ancestry-associated methylation signatures and meQTLs in African Americans. Commun. Biol. 5, 401 (2022).↩︎