This vignette is intended to be a script that can easily be modified for your own COCOA analysis. It is unevaluated because COCOA normally involves large objects. For a description of the goals of the method and a practical example with only a few region sets, see the “Introduction to Coordinate Covariation Analysis” vignette. Visualization functions can also be found in that vignette.
To begin, let’s load the necessary packages:
library(LOLA)
library(COCOA)
For this vignette, we will demonstrate COCOA with single base pair resolution DNA methylation data.
First, we need the results from PCA of the DNA methylation data, obtained with prcomp()
. prcomp
could take a while for large datasets (eg longer than 30 minutes). We will also need the genomic coordinates for each cytosine in the DNA methylation data. As one way of doing this, we could store the methlylation level matrix and the coordinates as items in a named list called “methylData”. The first item in the list – “coordinates” – would be a GRanges object with coordinates for the cytosines (or alternatively a data.frame with “chr” and “start” columns). The second item in the list – “methylProp” – would be a matrix of the methylation levels, with cytosines as rows (same order as “coordinates”) and samples as columns. We will assume that structure for the following code.
# the genomic coordinates
signalCoord <- methylData$coordinates
# consider whether centering or scaling is needed
mPCA <- prcomp(x=t(methylData$methylProp), center=TRUE, scale.=FALSE)
After the PCA, you might want to look at plots of the first few PCs and consider removing extreme outliers and rerunning the PCA.
Besides the PCA results, we need region sets. Normally we should use COCOA with a large collection of region sets. The collection of region sets can come from anywhere but in this vignette, we will use the LOLA Core region set database which includes region sets from public sources such as the ENCODE, CODEX, and Cistrome Projects. Documentation on LOLA is here and you can check out LOLA region set databases here. The AnnotationHub package also contains region sets and may be of interest. In this vignette, we are only using the LOLA Core database. Let’s load the database (this may take a few minutes):
# reading in the region sets
# load LOLA database
lolaPath <- paste0("path/to/LOLACore/genomeVersion/")
regionSetDB <- loadRegionDB(lolaPath)
# metadata about the region sets
loRegionAnno <- regionSetDB$regionAnno
lolaCoreRegionAnno <- loRegionAnno
collections <- c("cistrome_cistrome", "cistrome_epigenome", "codex",
"encode_segmentation", "encode_tfbs", "ucsc_features")
collectionInd <- lolaCoreRegionAnno$collection %in% collections
lolaCoreRegionAnno <- lolaCoreRegionAnno[collectionInd, ]
regionSetName <- lolaCoreRegionAnno$filename
regionSetDescription <- lolaCoreRegionAnno$description
# the actual region sets
GRList <- GRangesList(regionSetDB$regionGRL[collectionInd])
# since we have what we need, we can delete this to free up memory
rm("regionSetDB")
Now we test how much each region set is associated with each principal component:
PCsToAnnotate <- paste0("PC", 1:6)
regionSetScores <- runCOCOA(loadingMat=mPCA$rotation,
signalCoord=signalCoord,
GRList=GRList,
PCsToAnnotate=PCsToAnnotate,
scoringMetric="regionMean")
regionSetScores$regionSetName <- regionSetName
regionSetScores$regionSetDescription <- regionSetDescription
We can sort the results to see which region sets are most associated with each PC.
View(regionSetScores[order(regionSetScores$PC1, decreasing=TRUE), ])
View(regionSetScores[order(regionSetScores$PC2, decreasing=TRUE), ])
# etc.
Then we can do further analysis and visualization of the top scoring region sets, as described in the vignette “Introduction to Coordinate Covariation Analysis”.
In the big picture, COCOA can help you understand sources of variation in your data. For COCOA with DNA methylation data, top region sets should have DNA methylation changes among samples along the PC axis. You can screen out region sets that have a low number of regions/cytosines covered in the DNA methylation data since these are more likely to be ranked highly by chance. Top region sets may capture some of the DNA methylation changes happening along the PC axes. However, they probably just capture a small window of the broader changes taking place and so it is likely that many other changes are taking place as well. Therefore, top region sets can be interpreted as reflecting broader changes but not necessarily defining a given PC. As further confirmation of your COCOA results, you can look at the DNA methylation in the regions of your top region sets that have the highest loading values. These regions should show changes in DNA methylation along the PC axis. By identifying region sets where DNA methylation varies along the PC axis, you can understand variation in the data better and this can inspire further directions of research.
When using COCOA with the “regionMean” scoring method (preferred), a database of 2000 region sets, and 450,000 cytosines in the DNA methylation data, it is estimated that COCOA will take between 30 minutes and 1.5 hours. The computer will need enough memory for the large objects used although some memory could be saved by splitting the region set database into parts, running separately on each part, and then combining the results.