Identifying probe SNPs in DNA methylation array data

Gillian Meeks and Shyamalika Gopalan

2024-12-02

Installation

probeSNPffer is available at https://bioconductor.org and can be installed via:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("probeSNPffer")
library(probeSNPffer)

Introduction

This vignette demonstrates how to identify CpG probes in a DNA methylation array dataset that overlap with known SNPs, and calculate important metrics related to these probes. Our functions do this in two steps:

Here, we show how these functions can be used for quality control using examples from the results of Li et al. 20221.

Extracting genomic regions corresponding to CpG probs

We start with a data frame containing information about the genomic position of our CpG sites of interest, as well as their strand and assay technology. This information is available in the manifest that accompanies the array platform used. There are also R packages for the most common DNA methylation arrays.

Our function require this data frame to have these exact column names in this exact order in order to avoid ambiguity:

  1. ‘CpG_id’ (e.g. ‘cg00916680’)
  2. ‘chr’ (chromosome, eg. ‘chrX’ or ‘X’)
  3. ‘CpG_position’ (base pair position on the corresponding chromosome, eg. 152529487)
  4. ‘strand’ (i.e. ‘+’ or ‘F’ in the case of the forward strand; ‘-’ or ‘R’ in case of the reverse strand)
  5. ‘type’ (i.e. ‘I’ or 1 in case of Type 1 probes; ‘II’ or 2 in the case of Type 2 probes)

Additional columns (6 and beyond) can also be included containing metadata, such as, in our case, the difference in ancestry-specific effect size (‘beta_diff’). CpG info contains information for the 135 CpG probes corresponding to the ancestry-specific meQTL identified by Li et al., and reported in their Supplementary Data 8. The largest effect size difference (beta_diff) was taken for probes associated with multiple ancestry-specific meQTL. This data frame is previewed below:

CpG_id chr CpG_pos strand type beta_diff
71 cg00116315 chr6 35466268 + II 0.480
80 cg00295418 chr8 2021420 - II 0.749
29 cg00443946 chr16 86370962 - I 0.612
114 cg00495681 chr13 53174319 + II 0.565
13 cg00688297 chr8 145752292 + I 0.330

We can easily extract the genomic coordinates for the probes corresponding to these 135 CpG sites using the following code:

CpG_probe_bed <- extract.probe.regions(manifest_anno_object = CpG_info)

This results in a BED-formatted file that can be used for the next step:

chr start end CpG_id strand type CpG_pos beta_diff
71 6 35466268 35466318 cg00116315 + 2 35466268 0.480
80 8 2021371 2021421 cg00295418 - 2 2021420 0.749
29 16 86370914 86370964 cg00443946 - 1 86370962 0.612
114 13 53174319 53174369 cg00495681 + 2 53174319 0.565
13 8 145752291 145752341 cg00688297 + 1 145752292 0.330

Identifying CpG probes that overlap with SNPs

Now, we want to identify SNPs that fall within the probe. For this vignette, we are particularly interested in SNPs that are extremely differentiated between populations of European and West African genetic ancestry. The object SNP_bed is a data frame containing BED-formatted information for an example selection of SNPs with Fst > .3 between African and European super-populations, as ascertained in the 1000 Genomes panel. This data frame looks like:

CHROM POS POS.1 rsid REF ALT WEIR_AND_COCKERHAM_FST AFR_AF EUR_AF
43038962 2 183416696 183416697 rs6711313 A G 0.434935 0.2322 0.7594
68849666 7 110201110 110201111 rs13243744 G A 0.385094 0.0227 0.4056
17594097 15 53234107 53234108 rs11070941 T G 0.366440 0.0431 0.4304
69852397 7 147628417 147628418 rs2263054 T C 0.320888 0.6044 0.172
55252230 4 172519029 172519030 rs4362785 A G 0.302976 0.4251 0.0517
52696640 4 80038606 80038607 rs28539890 C T 0.328489 0.3578 0
11082008 12 122451186 122451187 rs11043288 A G 0.453011 0.0333 0.4911
71896340 8 49046361 49046362 rs7000484 G T 0.411004 0.7927 0.2883
63007068 6 75640275 75640276 rs4085731 A G 0.548159 0.5893 0.006

As you can see, this BED file contains the genomic locations of our SNP set, as well as rsID and population-specific allele frequency information. The BED file only needs to contain the SNP’s chromosome and BED formatted positions (must be columns 1,2,3) and REF, ALT alleles (must be columns 5,6). Using this and the CpG probe BED file we generated in the previous step, we can extract all instances of overlap between SNPs and CpG probes using the following line of code. This function also checks for color-channel switching SNPs at the single base extension (SBE) position of Type 1 probes and will mark them as “cc_switch” if the REF/ALT pair will bias measurements or “not_cc_switch” if they do not bias measurements. If the SNP is not a SBE SNP, this function witll mark the SNP as “not_SBE”. SBE SNPs that are non color channel switching can be ignored and we drop these from the intersection dataframe.

CpG_SNP_intersection <- flag.overlap(probe_bed = CpG_probe_bed, SNP_bed = SNP_bed[,1:6])
#> Calculating overlap between probe list and SNP list...
#> Annotating colour channel switching SNPs...
CpG_SNP_intersection <- CpG_SNP_intersection[CpG_SNP_intersection$col_chan_switching!= "not_cc_switch",]
chr SNP_start SNP_end SNP_id SNP_ref SNP_alt CpG_start CpG_end CpG_id CpG_strand CpG_type CpG_pos SNP_CpG_distance col_chan_switching fst
10 14 106091981 106091982 rs10136838 G A 106091932 106091982 cg14837792 - 2 106091981 1 not_SBE 0.416470
13 17 19361210 19361211 rs10491097 T C 19361181 19361231 cg19949948 - 2 19361230 19 not_SBE 0.661177
2 10 134876495 134876496 rs10857704 G A 134876446 134876496 cg04194432 - 2 134876495 1 not_SBE 0.352647
6 12 132537251 132537252 rs10902488 G A 132537203 132537253 cg06813297 - 1 132537251 1 not_SBE 0.443998
32 2 242710953 242710954 rs10933569 A G 242710938 242710988 cg01997813 + 1 242710939 15 not_SBE 0.450959
3 11 47213116 47213117 rs11039122 G A 47213067 47213117 cg10938684 - 2 47213116 1 not_SBE 0.349096

Example Analysis of CpG probe-SNP overlap: Distance Effects

Now we have the data we need to evaluate the extent to which these SNPs in CpG probe sequences are biasing the results of the local ancestry-specific meQTL analysis in Li et al. Here we plot the impact of probe SNP distance on delta effect size for this subset of SNPs.

Fst effects

Here we also plot the impact of probe SNP Fst on delta effect size, using the highest Fst SNP for probes with multiple SNPs.

# Session Information

All of the output in this vignette was produced under the following conditions:

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ggplot2_3.5.1        dplyr_1.1.4          tibble_3.2.1        
#>  [4] probeSNPffer_0.99.4  GenomicRanges_1.59.1 GenomeInfoDb_1.43.2 
#>  [7] IRanges_2.41.1       S4Vectors_0.45.2     BiocGenerics_0.53.3 
#> [10] generics_0.1.3      
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6            jsonlite_1.8.9          compiler_4.5.0         
#>  [4] tidyselect_1.2.1        jquerylib_0.1.4         scales_1.3.0           
#>  [7] yaml_2.3.10             fastmap_1.2.0           R6_2.5.1               
#> [10] XVector_0.47.0          labeling_0.4.3          knitr_1.49             
#> [13] munsell_0.5.1           GenomeInfoDbData_1.2.13 bslib_0.8.0            
#> [16] pillar_1.9.0            rlang_1.1.4             utf8_1.2.4             
#> [19] cachem_1.1.0            xfun_0.49               sass_0.4.9             
#> [22] cli_3.6.3               withr_3.0.2             magrittr_2.0.3         
#> [25] zlibbioc_1.53.0         digest_0.6.37           grid_4.5.0             
#> [28] lifecycle_1.0.4         vctrs_0.6.5             evaluate_1.0.1         
#> [31] glue_1.8.0              farver_2.1.2            colorspace_2.1-1       
#> [34] fansi_1.0.6             rmarkdown_2.29          httr_1.4.7             
#> [37] tools_4.5.0             pkgconfig_2.0.3         htmltools_0.5.8.1      
#> [40] UCSC.utils_1.3.0

  1. Li, B., Aouizerat, B. E., Cheng, Y., Anastos, K., Justice, A. C., Zhao, H. & Xu, K. Incorporating local ancestry improves identification of ancestry-associated methylation signatures and meQTLs in African Americans. Commun. Biol. 5, 401 (2022).↩︎