Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. One critical unmet challenge is that molecular disease subtypes characterized by relevant clinical differences, such as survival, are difficult to differentiate. With the advancement of multi-omics technologies, subtyping methods have shifted toward data integration in order to differentiate among subtypes from a holistic perspective that takes into consideration phenomena at multiple levels. However, these integrative methods are still limited by their statistical assumption and their sensitivity to noise. In addition, they are unable to predict the risk scores of patients using multi-omics data.
To address this problem, we introduce Subtyping via Consensus Factor Analysis (SCFA), a novel method for cancer subtyping and risk prediction using consensus factor analysis. SCFA follows a three-stage hierarchical process to ensure the robustness of the discovered subtypes. First, the method uses an autoencoder to filter out genes with an insignificant contribution in characterizing each patient. Second, it applies a modified factor analysis to generate a collection of factor representations of the high-dimensional multi-omics data. Finally, it utilizes a consensus ensemble to find subtypes that are shared across all factor representations.
To install SCFA
, you need to install the R pacakge from Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("SCFA")
After that, install tensorflow
and keras
in python using keras
R package. You may also need to install Anaconda if you does not have it. For more information about installation of keras, please visit https://keras.rstudio.com/
if(is(try(reticulate::conda_version()), "try-error")) reticulate::install_miniconda(force = T)
keras::install_keras(method = "conda", tensorflow = "1.10.0")
Load the example data GBM
. GBM is the Glioblastoma cancer dataset.
#Load required library
library(SCFA)
library(survival)
# Load example data (GBM dataset), for other dataset, download the rds file from the Data folder at https://bioinformatics.cse.unr.edu/software/scfa/Data/ and load the rds object
data("GBM")
# List of one matrix of microRNA data, other examples would have 3 matrices of 3 data types
dataList <- GBM$data
# Survival information
survival <- GBM$survival
We can use the main funtion SCFA
to generate subtypes from multi-omics data. The input of this function is a list of matrices from different data types. Each matrix has rows as samples and cells as features. The output of this function is subtype assignment for each patient. We can perform survival analysis to determine the significance in survival differences between discovered subtypes.
# Generating subtyping result
set.seed(1)
subtype <- SCFA(dataList, seed = 1, ncores = 4L)
## The number of clusters before voting is: 4
## The optimal number of clusters for ensemble clustering is: 4
## The number of clusters before voting is: 4
## The optimal number of clusters for ensemble clustering is: 4
# Perform survival analysis on the result
coxFit <- coxph(Surv(time = Survival, event = Death) ~ as.factor(subtype), data = survival, ties="exact")
coxP <- round(summary(coxFit)$sctest[3],digits = 20)
print(coxP)
## pvalue
## 0.01235006
We can use the function SCFA.class
to predict risk score of patients using available survival information from training data. We need to provide the function with training data with survival information, and testing data. The output is the risk score of each patient. Patient with higher risk scores have higher probablity to experience event before the other patient. Concordance index is use to confirm the correlation between predicted risk scores and survival information.
# Split data to train and test
set.seed(1)
idx <- sample.int(nrow(dataList[[1]]), round(nrow(dataList[[1]])/2) )
survival$Survival <- survival$Survival - min(survival$Survival) + 1 # Survival time must be positive
trainList <- lapply(dataList, function(x) x[idx, ] )
trainSurvival <- Surv(time = survival[idx,]$Survival, event = survival[idx,]$Death)
testList <- lapply(dataList, function(x) x[-idx, ] )
testSurvival <- Surv(time = survival[-idx,]$Survival, event = survival[-idx,]$Death)
# Perform risk prediction
result <- SCFA.class(trainList, trainSurvival, testList, seed = 1, ncores = 4L)
# Validation using concordance index
c.index <- survival::concordance(coxph(testSurvival ~ result))$concordance
print(c.index)
## [1] 0.5783241