Dimensionality reduction is used to represent high dimensional data in a more tractable form. It is commonly used in RNA-seq analysis, where each sample is characterised by tens of thousands of gene expression values, to visualise samples in a 2D plane with distances between points representing similarity and dissimilarity. For RNA-seq the data used is generally gene counts, for methylation there are generally two relevant count matrices, the count of methylated bases, and the count of unmethylated bases. The information from these two matrices can be combined by taking log-methylation ratios as done in Chen et al. 2018.
It is assumed that users of this package have imported the data into the gzipped tabix format as described in the “Importing Data” vignette. From there, further processing is required to create the log-methylation-ratio matrix used in dimensionality reduction. Namely we go through the BSseq format as it is easily coerced into the desired matrix and is itself useful for various other analyses.
library(NanoMethViz)
# import example NanoMethResult object
nmr <- load_example_nanomethresult()
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
nmr
## An object of class "NanoMethResult"
## Slot "methy":
## [1] "/tmp/RtmptRv4fu/Rinst1931126bfd4d83/NanoMethViz/methy_subset.tsv.bgz"
##
## Slot "samples":
## # A tibble: 6 × 2
## sample group
## <chr> <fct>
## 1 B6Cast_Prom_1_bl6 bl6
## 2 B6Cast_Prom_1_cast cast
## 3 B6Cast_Prom_2_bl6 bl6
## 4 B6Cast_Prom_2_cast cast
## 5 B6Cast_Prom_3_bl6 bl6
## 6 B6Cast_Prom_3_cast cast
##
## Slot "exons":
## # A tibble: 303 × 7
## gene_id chr strand start end transcript_id symbol
## <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 12189 chr11 - 101551769 101551879 92234 Brca1
## 2 12189 chr11 - 101549015 101549113 92234 Brca1
## 3 12189 chr11 - 101539981 101540034 92234 Brca1
## 4 12189 chr11 - 101535528 101535605 92234 Brca1
## 5 12189 chr11 - 101533939 101534027 92234 Brca1
## 6 12189 chr11 - 101532020 101532159 92234 Brca1
## 7 12189 chr11 - 101530992 101531094 92234 Brca1
## 8 12189 chr11 - 101529791 101529836 92234 Brca1
## 9 12189 chr11 - 101528001 101528077 92234 Brca1
## 10 12189 chr11 - 101523325 101526639 92234 Brca1
## # ℹ 293 more rows
# convert to bsseq
bss <- methy_to_bsseq(nmr)
bss
## An object of type 'BSseq' with
## 4778 methylation loci
## 6 samples
## has not been smoothed
## All assays are in-memory
We can generate the log-methylation-ratio based on individual methylation sites or computed over genes, or other feature types. Aggregating over features will generally provide more stable and robust results, here we wil use genes.
# create gene annotation from exon annotation
gene_anno <- exons_to_genes(NanoMethViz::exons(nmr))
# create log-methylation-ratio matrix
lmr <- bsseq_to_log_methy_ratio(bss, regions = gene_anno)
NanoMethViz currently provides two options, a MDS plot based on the limma implementation of MDS, and a PCA plot using BiocSingular.
plot_mds(lmr) +
ggtitle("MDS Plot")
plot_pca(lmr) +
ggtitle("PCA Plot")
Additional coloring and labeling options can be provided via arguments to either function. Further customisations can be done using typical ggplot2 commands.
new_labels <- gsub("B6Cast_Prom_", "", colnames(lmr))
new_labels <- gsub("(\\d)_(.*)", "\\2 \\1", new_labels)
groups <- gsub(" \\d", "", new_labels)
plot_mds(lmr, labels = new_labels, groups = groups) +
ggtitle("MDS Plot") +
scale_colour_brewer(palette = "Set1")