CytoMethIC
is a comprehensive package that provides
model data and functions for easily using machine learning models that
use data from the DNA methylome to classify cancer type and phenotype
from a sample. The primary motivation for the development of this
package is to abstract away the granular and accessibility-limiting code
required to utilize machine learning models in R. Our package provides
this abstraction for RandomForest, e1071 Support Vector, Extreme
Gradient Boosting, and Tensorflow models. This is paired with an
ExperimentHub component, which contains our lab’s models developed for
epigenetic cancer classification and predicting phenotypes. This
includes CNS tumor classification, Pan-cancer classification, race
prediction, cell of origin classification, and subtype classification
models.
Models available are listed below:
EHID | ModelID | PredictionLabel |
---|---|---|
NA | Age_HM450_20240504 | Age prediction (year) |
NA | Age_HM450_20240611 | Age prediction (year) |
NA | Age_MM285_20220101 | Age prediction (year) |
NA | Age_MM285_20230101 | Age prediction (year) |
NA | CellMethID_mouseBlood_MM285 | Deconvolution model for mouse blood components |
NA | LeukoFrac_HM27_20240614 | Leukocyte fraction prediction (%) |
NA | LeukoFrac_HM450_20240614 | Leukocyte fraction prediction (%) |
NA | MIR200C_EPIC_20240315 | Mesenchymal score based on Mir200C meth [0-1] |
NA | Race3_InfHum3_20240114 | Races (N=3) |
EH8421 | Race5_rfc | Races (N=5) |
NA | Race5_rfcTCGA_InfHum3 | Races (N=5) |
NA | RepliTali_EPIC_20240315 | Replication/mitotic age (scale-less) |
NA | Sex2_HM450_20240114 | Sex (N=2) |
NA | Sex2_MM285_20240114 | Sex (N=2) |
NA | TissueComp_EPIC_20240717 | Tissue composition (%) |
NA | TissueComp_EPICv2_20240717 | Tissue composition (%) |
NA | TissueComp_HM450_20240827 | Tissue composition (%) |
NA | TissueComp_MSA_20240717 | Tissue composition (%) |
NA | TissueType_EPIC_20240610 | Dominating tissue type |
NA | TissueType_EPIC_20240624 | Dominating tissue type |
NA | TissueType_EPICv2_20240716 | Dominating tissue type |
One can access the model using the EHID above in
ExperimentHub()[["EHID"]]
.
More models (if EHID is NA) are available in the following Github
Repo. You can directly download them and load with
readRDS()
. Some examples using either approach are
below.
library(sesame)
library(CytoMethIC)
betasHM450 = imputeBetas(sesameDataGet("HM450.1.TCGA.PAAD")$betas)
To make models work for incompatible platforms, you could try the mLiftOver. Here is an example:
model = readRDS(url("https://github.com/zhou-lab/CytoMethIC_models/raw/refs/heads/main/models/Sex2_HM450_20240114.rds"))
cmi_predict(betasHM450, model)
## $score
## [1] 0.8132805
##
## $sex
## [1] "MALE"
model = readRDS(url("https://github.com/zhou-lab/CytoMethIC_models/raw/refs/heads/main/models/Age_HM450_20240504.rds"))
cmi_predict(betasHM450, model)
## $age
## [1] 84.13913
##
## $x
## [1] 3.054244
The below snippet shows a demonstration of the cmi_predict function working to predict the ethnicity of the patient.
model = ExperimentHub()[["EH8421"]] # the same as "https://github.com/zhou-lab/CytoMethIC_models/raw/refs/heads/main/models/Race5_rfcTCGA_InfHum3.rds"
cmi_predict(betasHM450, model)
## $response
## [1] "WHITE"
##
## $prob
## WHITE
## 0.886
## leukocyte fractions
model = readRDS(url("https://github.com/zhou-lab/CytoMethIC_models/raw/refs/heads/main/models/LeukoFrac_HM450_20240614.rds"))
cmi_predict(betasHM450, model)
## $leukoFrac
## [1] 0.1960776
Cell-type deconvolution using Loyfer et al. conferences:
model = readRDS(url("https://github.com/zhou-lab/CytoMethIC_models/raw/refs/heads/main/models/TissueComp_HM450_20240827.rds"))
cell_comps = cmi_predict(betasHM450, model)
cell_comps = enframe(cell_comps$frac, name="cell_type", value="frac")
cell_comps = cell_comps |> filter(frac>0)
ggplot(cell_comps, aes(x="", y=frac, fill=cell_type)) +
geom_bar(stat="identity", width=1) +
coord_polar(theta="y") +
theme_void() + labs(fill = "Cell Type") +
theme(plot.title = element_text(hjust = 0.5))
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] sesame_1.25.2 sesameData_1.25.0 CytoMethIC_1.3.2
## [4] ExperimentHub_2.15.0 AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [7] dbplyr_2.5.0 BiocGenerics_0.53.3 generics_0.1.3
## [10] knitr_1.49
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] blob_1.2.4 filelock_1.0.3
## [5] Biostrings_2.75.3 fastmap_1.2.0
## [7] digest_0.6.37 mime_0.12
## [9] lifecycle_1.0.4 KEGGREST_1.47.0
## [11] RSQLite_2.3.9 magrittr_2.0.3
## [13] compiler_4.5.0 rlang_1.1.4
## [15] sass_0.4.9 tools_4.5.0
## [17] yaml_2.3.10 S4Arrays_1.7.1
## [19] bit_4.5.0.1 curl_6.0.1
## [21] DelayedArray_0.33.3 plyr_1.8.9
## [23] RColorBrewer_1.1-3 abind_1.4-8
## [25] BiocParallel_1.41.0 withr_3.0.2
## [27] purrr_1.0.2 grid_4.5.0
## [29] stats4_4.5.0 preprocessCore_1.69.0
## [31] wheatmap_0.2.0 colorspace_2.1-1
## [33] ggplot2_3.5.1 scales_1.3.0
## [35] SummarizedExperiment_1.37.0 cli_3.6.3
## [37] rmarkdown_2.29 crayon_1.5.3
## [39] reshape2_1.4.4 httr_1.4.7
## [41] tzdb_0.4.0 DBI_1.2.3
## [43] cachem_1.1.0 stringr_1.5.1
## [45] parallel_4.5.0 AnnotationDbi_1.69.0
## [47] BiocManager_1.30.25 XVector_0.47.1
## [49] matrixStats_1.4.1 vctrs_0.6.5
## [51] Matrix_1.7-1 jsonlite_1.8.9
## [53] IRanges_2.41.2 hms_1.1.3
## [55] S4Vectors_0.45.2 bit64_4.5.2
## [57] jquerylib_0.1.4 glue_1.8.0
## [59] codetools_0.2-20 stringi_1.8.4
## [61] gtable_0.3.6 BiocVersion_3.21.1
## [63] GenomeInfoDb_1.43.2 GenomicRanges_1.59.1
## [65] UCSC.utils_1.3.0 munsell_0.5.1
## [67] tibble_3.2.1 pillar_1.10.0
## [69] rappdirs_0.3.3 htmltools_0.5.8.1
## [71] randomForest_4.7-1.2 GenomeInfoDbData_1.2.13
## [73] R6_2.5.1 evaluate_1.0.1
## [75] Biobase_2.67.0 lattice_0.22-6
## [77] readr_2.1.5 png_0.1-8
## [79] memoise_2.0.1 BiocStyle_2.35.0
## [81] bslib_0.8.0 Rcpp_1.0.13-1
## [83] SparseArray_1.7.2 xfun_0.49
## [85] MatrixGenerics_1.19.0 pkgconfig_2.0.3