Epigenomics 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 24 August, 2014
This case study takes a quick tour through the minfi Bioconductor package. The main goal is to become familiar with the use of Bioconductor objects and methods; see the vignette accompanying the minfi package for more background on analysis of Illumina arrays for methylation analysis.
Start by attaching the minfi
and minfiData
packages. Use
browseVignettes("minfi")
to access the vignette for additional
background information.
require(minfi)
require(minfiData)
browseVignettes("minfi")
The first step in any work flow is to read in the data. A sample data set is provided at the following location.
baseDir <- system.file("extdata", package = "minfiData")
baseDir
## [1] "/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata"
dir(baseDir)
## [1] "5723646052" "5723646053" "SampleSheet.csv"
dir(file.path(baseDir, "5723646052"))
## [1] "5723646052_R02C02_Grn.idat" "5723646052_R02C02_Red.idat"
## [3] "5723646052_R04C01_Grn.idat" "5723646052_R04C01_Red.idat"
## [5] "5723646052_R05C02_Grn.idat" "5723646052_R05C02_Red.idat"
Of course your own data would be at another location, and you might
enter the path (with 'tab completion') instead of using
system.file()
. A typical organization is that each 'slide'
(containing 12 arrays) is stored in a separate directory. The
top-level directory contains a .csv
file describing the samples;
inside each slide directory are IDAT files representing the output of
the Illumina scanner.
Next read in the sample sheet, and then the raw probe-level data. Take
a moment to use R to explore the sample sheet. Read the 'man' page for
read.450k.sheet
and read.450k.exp
to see what options are
available.
## 'pData'
targets <- read.450k.sheet(baseDir)
## [read.450k.sheet] Found the following CSV files:
## [1] "/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/SampleSheet.csv"
head(targets)
## Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID person age sex
## 1 GroupA_3 H5 NA GroupA NA id3 83 M
## 2 GroupA_2 D5 NA GroupA NA id2 58 F
## 3 GroupB_3 C6 NA GroupB NA id3 83 M
## 4 GroupB_1 F7 NA GroupB NA id1 75 F
## 5 GroupA_1 G7 NA GroupA NA id1 75 F
## 6 GroupB_2 H7 NA GroupB NA id2 58 F
## status Array Slide
## 1 normal R02C02 5.724e+09
## 2 normal R04C01 5.724e+09
## 3 cancer R05C02 5.724e+09
## 4 cancer R04C02 5.724e+09
## 5 normal R05C02 5.724e+09
## 6 cancer R06C02 5.724e+09
## Basename
## 1 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646052/5723646052_R02C02
## 2 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646052/5723646052_R04C01
## 3 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646052/5723646052_R05C02
## 4 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646053/5723646053_R04C02
## 5 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646053/5723646053_R05C02
## 6 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646053/5723646053_R06C02
## 'raw' probe-level data
RGset <- read.450k.exp(base = baseDir, targets = targets)
As a basic quality assessment, visualize the distribution of beta values across each array, coloring the density functions by sample. Are there any concerns about the data?
## Basic QA -- comparable densities across samples?
densityPlot(RGset, sampGroups = RGset$Sample_Group,
main = "Beta", xlab = "Beta")
A technical artifact is that probe intensities differ depending on
their sequence composition, so it is necessary to perform a
'background correction'. Also, the distribution of probe intensities
differ from one another as a consequence of sample preparation steps,
e.g., slightly different initial amounts of DNA from one sample
compared to another. Basic steps in microarray analysis (many of these
steps are shared by other high-throughput assays) are therefore
background correction and between-array normalization. Use the
preprocessIllumina()
command to perform these steps. Are there other
normalization strategies available in this package?
## background correction and normalization
## like Illumina Genome Studio (other approaches available)
MSet.norm <- preprocessIllumina(RGset, bg.correct = TRUE,
normalize = "controls", reference = 2)
Once data are background-corrected and normalized, it is possible to
compare the vector of methylation values of each sample. Use the
mdsPlot()
function to visualize the multidimensional relationship
between samples in reduced dimensions. Use arguments of mdsPlot()
to
name and highlight the different sample groups.
## How similar are the samples to one another?
mdsPlot(MSet.norm, numPositions = 1000, sampGroups = MSet.norm$Sample_Group,
sampNames = MSet.norm$Sample_Name)
Take a portion of the data (the first 100,000 probes), retrieve the
logit-transformed beta values, and then use dmpFinder()
CpGs
where methylation status is associated with sample group. From the
help page, references, and your own knowledge, any ideas about
shrinkVar
?
## Identify probes with methylation status differing between groups
mset <- MSet.norm[1:100000,]
## logit(beta)
M <- getM(mset, type = "beta", betaThreshold = 0.001)
dmp <- dmpFinder(M, pheno=mset$Sample_Group, type="categorical")
head(dmp)
## intercept f pval qval
## cg18207348 3.916 4499 2.960e-07 0.01382
## cg07485273 4.273 3630 4.545e-07 0.01382
## cg02025578 -6.526 2970 6.788e-07 0.01382
## cg16650529 -2.236 2739 7.977e-07 0.01382
## cg12019520 9.964 2011 1.479e-06 0.01936
## cg10805483 -9.964 1706 2.053e-06 0.01936
Visualize differential methylation using plotCpg()
plotCpg(mset, cpg=rownames(dmp)[1], pheno=mset$Sample_Group)
Probes interrogate genomic locations. Use mapToGenome()
to translate
the probe identifiers to genomic coordinates. This transforms mset
into an object of class SummarizedExperiment
. A
SummarizedExperiment
is similar to an expression set, bui with
GRanges
to annotate rows, rather than a data.frame
. Use
rowData()
to extract the GRanges
from the SummarizedExperiment
;
explore it. Add the annotations about differentially expressed probes
to the row data of the SummarizedExperiment
.
## Genomic locations
mset <- mset[rownames(dmp),]
mse <- mapToGenome(mset) # 'SummarizedExperiment'
rowData(mse)
## GRanges with 100000 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg13869341 chr1 [ 15865, 15865] *
## cg12045430 chr1 [ 29407, 29407] *
## cg20826792 chr1 [ 29425, 29425] *
## cg00381604 chr1 [ 29435, 29435] *
## cg17149495 chr1 [530959, 530959] *
## ... ... ... ...
## cg02050847 chrY [22918038, 22918038] *
## cg25427172 chrY [23566730, 23566730] *
## cg10267609 chrY [24453757, 24453757] *
## cg26983430 chrY [24549675, 24549675] *
## cg08265308 chrY [28555550, 28555550] *
## ---
## seqlengths:
## chr1 chr2 chr3 chr4 chr5 chr6 ... chr13 chr14 chr15 chrX chrY
## NA NA NA NA NA NA ... NA NA NA NA NA
mcols(rowData(mse)) <- cbind(mcols(rowData(mse)), dmp)