Package: similaRpeak
Authors: Astrid Deschenes [cre, aut], Elsa Bernatchez [aut],
Charles Joly Beauparlant [aut], Fabien Claude Lamaze [aut],
Rawane Samb [aut], Pascal Belleau [aut], Arnaud Droit [aut]
Version: 1.24.0
Compiled date: 2021-05-19
License: Artistic-2.0
Astrid Deschenes, Elsa Bernatchez, Charles Joly Beauparlant, Fabien Claude Lamaze, Rawane Samb, Pascal Belleau and Arnaud Droit.
This package and the underlying similaRpeak code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.
The similaRpeak package calculates metrics to estimate the level of similarity between two ChIP-Seq profiles.
The metrics are:
RATIO_AREA: The ratio between the profile areas.
The first profile is always divided by the second profile. NA
is returned
if minimal area threshold is not respected for at least one of the profiles.
DIFF_POS_MAX: The difference between the maximal peaks positions. The
difference is always the first profile value minus the second profile value.
NA
is returned if minimal peak value is not respected. A profile can have
more than one position with the maximum value. In that case, the median
position is used. A threshold argument can be set to consider all positions
within a certain range of the maximum value. A threshold argument can also be
set to ensure that the distance between two maximum values is not too wide.
When this distance is not respected, it is assumed that more than one peak
is present in the profile and NA
is returned.
RATIO_MAX_MAX: The ratio between the peaks values
in each profile. The first profile is always divided by the second profile.
NA
if minimal peak value threshold is not respected for at least one of the
profiles.
RATIO_INTERSECT: The ratio between the intersection area and the total
area of the two profiles. NA
if minimal area threshold is not respected for
the intersection area.
RATIO_NORMALIZED_INTERSECT: The ratio between the intersection area and
the total area of two normalized profiles. The profiles are normalized by
dividing them by their average value. NA
if minimal area threshold is not
respected for the intersection area.
SPEARMAN_CORRELATION: The Spearman’s rho statistic between profiles.
library(similaRpeak)
ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. For a specific region, the read count of aligned sequences at each position of the region is used to generate the ChIP-Seq profile for the region.
To estimate the level of similarity between two ChIP-Seq profiles for a
specific region, two vector
containing the profiles values, as reported in
reads per million (RPM) for each position of the selected region, are needed.
Both vector
should have the same length and should not contain any negative
value.
Aligned sequences are usually stored in BAM files. As example, a slimmed BAM
file (align1.bam) is selected as well as a specific region
(chr1:172081530-172083529). Using BAM file and
region information, represented by as a GRanges
object, the coverage for the
specified region is extracted using specialized Bioconductor packages.
suppressMessages(library(GenomicAlignments))
suppressMessages(library(rtracklayer))
suppressMessages(library(Rsamtools))
bamFile01 <- system.file("extdata/align1.bam", package = "similaRpeak")
region <- GRanges(Rle(c("chr1")), IRanges(start = 172081530, end = 172083529),
strand= Rle(strand(c("*"))))
param <- ScanBamParam(which = region)
alignments01 <- readGAlignments(bamFile01, param = param)
coveragesRegion01 <- coverage(alignments01)[region]
coveragesRegion01
## RleList of length 1
## $chr1
## integer-Rle of length 2000 with 92 runs
## Lengths: 297 30 2 4 14 18 19 36 ... 15 21 15 35 36 65 36 307
## Values : 0 1 2 3 2 1 0 1 ... 1 2 1 0 1 0 2 0
The coverages01
can
easily be transformed to a vector of numerical value to obtain the raw
ChIP-Seq profile for the selected region.
coveragesRegion01 <- as.numeric(coveragesRegion01$chr1)
length(coveragesRegion01)
## [1] 2000
summary(coveragesRegion01)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.907 1.000 6.000
To convert the raw coverage to reads per million (RPM), the total number of
reads present in the BAM file is needed to assign a weight at the coverage
function.
param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery=FALSE))
count01 <- countBam(bamFile01, param = param)$records
coveragesRPMRegion01 <- coverage(alignments01, weight=1000000/count01)[region]
coveragesRPMRegion01 <- as.numeric(coveragesRPMRegion01$chr1)
length(coveragesRPMRegion01)
## [1] 2000
summary(coveragesRPMRegion01)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 195.7 215.7 1294.5
The read per millions values are quite low for the coveragesRPMRegion01
because the original BAM file has been reduced in size to simplify the example.
Other examples are available on the worflows section of the Bioconductor website.
Finally, the metagene package , available on Bioconductor, can also be used to generate ChIP-Seq profiles. An example is available on metagene wiki.
Mathematically, a metric is considered as a function that quantifies the similarity between two objects.The function must return zero when the two objects are perfectly similar (identity of indiscernibles) and a non-negative value when are dissimilar.
The metrics present in the similaRpeak package do not strictly respect this standard but can all be translated to pseudometrics. A pseudometric is a function d which satisfies the axioms for a metric, except that instead of the identity of indiscernibles axiom, the metric must only return zero when it compare an object to itself.
By using the absolute value of the DIFF_POS_MAX metric, the definition of a pseudometric is formally respected. However, the respective position of the maximum peak of profiles is lost.
\[ |DIFF\_POS\_MAX| \]
By using the absolute value of the logarithm of the RATIO_AREA, RATIO_MAX_MAX, RATIO_INTERSECT and RATIO_NORMALIZED_INTERSECT metrics, the definition of a pseudometric is formally respected.
\[ |\log(RATIO)| \]
To ease comparison, the same ChIP-Seq profiles are used in each metric description section. Those are ChIP-Seq profiles of two histone post-transcriptional modifications linked to highly active enhancers H3K27ac (DCC accession: ENCFF000ASG) and H3K4me1 (DCC accession: ENCFF000ARY) from the Encyclopedia of DNA Elements (ENCODE) data (Dunham I et al. 2012).
Here is how to load the demoProfiles
data used in following sections. The
ChIP-Seq profiles of the enhancers H3K27ac and H3K4me1 for 4 specifics regions
are in reads per million (RPM).
data(demoProfiles)
str(demoProfiles)
## List of 4
## $ chr2.70360770.70361098 :List of 2
## ..$ H3K27ac: num [1:329] 13 12 12 11 11 11 9 9 7 7 ...
## ..$ H3K4me1: num [1:329] 6 7 7 7 7 8 8 8 8 9 ...
## $ chr3.73159773.73160145 :List of 2
## ..$ H3K27ac: num [1:373] 1 0 0 0 0 0 0 0 0 0 ...
## ..$ H3K4me1: num [1:373] 0 0 0 0 0 0 0 0 0 0 ...
## $ chr8.43092918.43093442 :List of 2
## ..$ H3K27ac: num [1:525] 82 82 81 81 80 80 79 78 78 76 ...
## ..$ H3K4me1: num [1:525] 1404 1398 1392 1383 1367 ...
## $ chr19.27739373.27739767:List of 2
## ..$ H3K27ac: num [1:395] 4 3 3 3 2 2 0 0 0 0 ...
## ..$ H3K4me1: num [1:395] 17 15 13 13 11 6 3 0 0 0 ...
The RATIO_AREA metric is the ratio between the profile areas. The first
profile (profile1
parameter) is always divided by the second profile
(profile2
parameter). NA
is returned if minimal area threshold
(ratioAreaThreshold
parameter, default = 1) is not respected for at least
one of the profiles.
The RATIO_AREA metric can be useful to detect regions with similar coverage even if the profiles are different.
Â
similarity(profile1, profile2, ratioAreaThreshold = 1)
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac,
demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_AREA
## [1] 1.025942
metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac,
demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_AREA
## [1] 0.05928535
metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac,
demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_AREA
## [1] 2.228563
metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac,
demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_AREA
## [1] 0.1245884
The DIFF_POS_MAX metric is the difference between the maximal peaks
positions. The difference is always the first profile value
(profile1
parameter) minus the second profile value
(profile2
parameter). NA
is returned if minimal peak value is not
respected. A profile can have more than one position with the maximum value.
In that case, the median position is used. A threshold (diffPosMaxTolerance
parameter) can be set to consider all positions within a certain range of the
maximum value. A threshold (diffPosMaxThresholdMaxDiff
parameter) can also
be set to ensure that the distance between two maximum values is not too wide.
When this distance is not respected, it is assumed that more than one peak is
present in the profile and NA
is returned. Finally, a threshold
(diffPosMaxThresholdMinValue
parameter) can be set to ensure that the
maximum value is egal or superior to a minimal value. When this minimum value
is not respected, it is assumed that no peak is present in the profile and
NA
is returned.
The DIFF_POS_MAX metric can be useful to detect regions with shifted peaks.