To use CytoMethIC, you need to install the package from Bioconductor. If you don’t have the BiocManager package installed, install it first:
CytoMethIC
is a comprehensive package that provides
model data and functions for easily using machine learning models that
use data from the DNA methylome to classify cancer type and phenotype
from a sample. The primary motivation for the development of this
package is to abstract away the granular and accessibility-limiting code
required to utilize machine learning models in R. Our package provides
this abstraction for RandomForest, e1071 Support Vector, Extreme
Gradient Boosting, and Tensorflow models. This is paired with an
ExperimentHub component, which contains our lab’s models developed for
epigenetic cancer classification and predicting phenotypes. This
includes CNS tumor classification, Pan-cancer classification, race
prediction, cell of origin classification, and subtype classification
models.
For these examples, we’ll be using models from ExperimentHub and a sample from sesameData.
ModelID | PredictionLabelDescription |
---|---|
rfc_cancertype_TCGA33 | TCGA cancer types (N=33) |
svm_cancertype_TCGA33 | TCGA cancer types (N=33) |
xgb_cancertype_TCGA33 | TCGA cancer types (N=33) |
mlp_cancertype_TCGA33 | TCGA cancer types (N=33) |
rfc_cancertype_CNS66 | CNS Tumor Class (N=66) |
svm_cancertype_CNS66 | CNS Tumor Class (N=66) |
xgb_cancertype_CNS66 | CNS Tumor Class (N=66) |
mlp_cancertype_CNS66 | CNS Tumor Class (N=66) |
NA | NA |
NA | NA |
NA | NA |
The below snippet shows a demonstration of the model abstraction working on random forest and support vector models from CytoMethIC models on ExperimentHub.
## for missing data
betas = imputeBetas(sesameDataGet("HM450.1.TCGA.PAAD")$betas)
cmi_predict(betas, ExperimentHub()[["EH8395"]])
## $response
## [1] "PAAD"
##
## $prob
## PAAD
## 0.852
## $response
## [1] "PAAD"
##
## $prob
## betas[, attr(model$terms, "term.labels")]
## 0.9864795
The below snippet shows a demonstration of the cmi_predict function working to predict the subtype of the cancer.
## $response
## [1] "GI.CIN"
##
## $prob
## GI.CIN
## 0.462
The below snippet shows a demonstration of the cmi_predict function working to predict the ethnicity of the patient.
## $response
## [1] "WHITE"
##
## $prob
## WHITE
## 0.886
The below snippet shows a demonstration of the cmi_predict function working to predict the cell of origin of the cancer.
## $response
## [1] "C20:Mixed (Stromal/Immune)"
##
## $prob
## C20:Mixed (Stromal/Immune)
## 0.768
In addition to ExperimentHub Models, this package also supports using models from GitHub URLs. Note that https://github.com/zhou-lab/CytoMethIC_models will be the most frequently updated public repository of our lab’s classifiers.
base_url = "https://github.com/zhou-lab/CytoMethIC_models/raw/main/models"
cmi_model = readRDS(url(sprintf("%s/Race3_rfcTCGA_InfHum3.rds", base_url)))
betas = openSesame(sesameDataGet("EPICv2.8.SigDF")[[1]], mask=FALSE)
betas = mLiftOver(betas, "HM450")
cmi_predict(betas, cmi_model)
## $response
## [1] "WHITE"
##
## $prob
## WHITE
## 0.69
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.49 sesame_1.25.1 sesameData_1.25.0
## [4] CytoMethIC_1.3.1 ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [7] BiocFileCache_2.15.0 dbplyr_2.5.0 BiocGenerics_0.53.3
## [10] generics_0.1.3
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] blob_1.2.4 filelock_1.0.3
## [5] Biostrings_2.75.3 fastmap_1.2.0
## [7] digest_0.6.37 mime_0.12
## [9] lifecycle_1.0.4 KEGGREST_1.47.0
## [11] RSQLite_2.3.9 magrittr_2.0.3
## [13] compiler_4.5.0 rlang_1.1.4
## [15] sass_0.4.9 tools_4.5.0
## [17] yaml_2.3.10 S4Arrays_1.7.1
## [19] bit_4.5.0.1 curl_6.0.1
## [21] DelayedArray_0.33.3 plyr_1.8.9
## [23] RColorBrewer_1.1-3 abind_1.4-8
## [25] BiocParallel_1.41.0 withr_3.0.2
## [27] purrr_1.0.2 grid_4.5.0
## [29] stats4_4.5.0 preprocessCore_1.69.0
## [31] wheatmap_0.2.0 e1071_1.7-16
## [33] colorspace_2.1-1 ggplot2_3.5.1
## [35] MASS_7.3-61 scales_1.3.0
## [37] SummarizedExperiment_1.37.0 cli_3.6.3
## [39] rmarkdown_2.29 crayon_1.5.3
## [41] reshape2_1.4.4 httr_1.4.7
## [43] tzdb_0.4.0 proxy_0.4-27
## [45] DBI_1.2.3 cachem_1.1.0
## [47] stringr_1.5.1 zlibbioc_1.53.0
## [49] parallel_4.5.0 AnnotationDbi_1.69.0
## [51] BiocManager_1.30.25 XVector_0.47.0
## [53] matrixStats_1.4.1 vctrs_0.6.5
## [55] Matrix_1.7-1 jsonlite_1.8.9
## [57] IRanges_2.41.2 hms_1.1.3
## [59] S4Vectors_0.45.2 bit64_4.5.2
## [61] jquerylib_0.1.4 glue_1.8.0
## [63] codetools_0.2-20 stringi_1.8.4
## [65] gtable_0.3.6 BiocVersion_3.21.1
## [67] GenomeInfoDb_1.43.2 GenomicRanges_1.59.1
## [69] UCSC.utils_1.3.0 munsell_0.5.1
## [71] tibble_3.2.1 pillar_1.10.0
## [73] rappdirs_0.3.3 htmltools_0.5.8.1
## [75] randomForest_4.7-1.2 GenomeInfoDbData_1.2.13
## [77] R6_2.5.1 evaluate_1.0.1
## [79] Biobase_2.67.0 lattice_0.22-6
## [81] readr_2.1.5 png_0.1-8
## [83] memoise_2.0.1 BiocStyle_2.35.0
## [85] bslib_0.8.0 class_7.3-22
## [87] Rcpp_1.0.13-1 SparseArray_1.7.2
## [89] xfun_0.49 MatrixGenerics_1.19.0
## [91] pkgconfig_2.0.3