--- title: "ClustAll User's Guide" author: "Asier Ortega-Legarreta and Sara Palomino-Echeverria" package: ClustAll output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{ClustALL User's Guide} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis', include=FALSE} BiocStyle::markdown() knitr::opts_chunk$set( collapse = TRUE, cache = FALSE, comment = "#>" ) ``` # Introduction ClustAll is an R package designed for patient stratification in complex diseases. It addresses common challenges encountered in clinical data analysis and provides a versatile framework for identifying patient subgroups. Patient stratification is essential in biomedical research for understanding disease heterogeneity, identifying prognostic factors, and guiding personalized treatment strategies. The ClustAll underlying concept is that a robust stratification should be reproducible through various clustering methods. ClustAll employs diverse distance metrics (Correlation-based distance and Gower distance) and clustering methods (K-Means, K-Medoids, and H-Clust). ## ClustAll key features: - **Handles Diverse Data Types**, including missing values, mixed data, and correlated variables. - **Provides Multiple Stratification Solutions**, enabling exploration of different clustering algorithms and parameters. - **Robustness Analysis**, to identify stable and reproducible clusters. - **Validation** , for assessing the reliability of clustering results using clinical phenotypes (ground truth) if available. - **Visualization** functions for interpreting clustering results and comparing different stratifications. ## Interpreting ClustAll Stratification Output The names of ClustAll stratification outputs consist of a letter followed by a number, such as *cuts_a\_9*. The letter denotes the combination of distance metric and clustering method utilized to generate the particular stratification, while the number corresponds to the embedding derived from the depth at which the dendrogram with grouped variables was cut. ```{r, echo=FALSE} table <- data.frame(Nomenclature=c("a","b","c","d"), `Distance Metric`=rep(c("Correlation", "Gower"), each=2), `Clustering Method`=c("K-means", "Hierarchical Clustering", "K-medoids", "Hierarchical-Clustering")) ``` ```{r, echo=FALSE} knitr::kable(table, align = c("l", "l", "l"), caption = "ClustAll Stratification Output Interpretation") ``` ![Schematic representation of the ClustAll pipeline](./img/workflow.jpg) # Installation ClustAll is developed using S4 object-oriented programming, and requires R (\>=4.2.0). It utilizes other R packages that are currently available from CRAN and Bioconductor. You can find the package repository on GitHub, [ClustAll](https://github.com/translationalBioinformaticsUnit/ClustAll). The ClustAll package can be downloaded and installed by running the following code within R: ```{r installation, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager ") BiocManager::install("ClustAll") ``` After installation, load the ClustAll package: ```{r load package} library(ClustAll) ``` # ClustAll Application Example We will use the data provided in the data package [ClustAll](https://github.com/translationalBioinformaticsUnit/ClustAll) to demonstrate how to stratify a population using clinical data. **Breast Cancer Wisconsin (Diagnostic)** ClustAll includes a real dataset of breast cancer, described at [doi: 10.24432/C5DW2B](https://doi.org/10.24432/C5DW2B). This dataset comprises two types of features ---categorical and numerical--- derived from a digitized image of a fine needle aspirate (FNA) of a breast mass from 659 patients. Each patient is characterized by 30 features (10x3) and belongs to one of two target classes: 'malignant' or 'benign'. To showcase ClustAll's when dealing with missing data, a modification with random missing values was applied to the dataset, demonstrating the package's resilience in handling missing data. The breast cancer dataset includes the following features: a) **radius:** Mean of distances from the center to points on the perimeter. b) **texture:** Standard deviation of gray-scale values. c) **perimeter:** Perimeter of the breast mass affected by the cancer. d) **area:** Area of the breast mass affected by the cancer e) **smoothness:** Local variation in radius lengths. f) **compactness:** (Perimeter\^2 / Area) - 1.0. g) **concavity:** Severity of concave portions of the contour. h) **concave points:** Number of concave portions of the contour. i) **symmetry:** Degree of symmetry in the shape and structure of the breast mass, with higher values indicating greater symmetry and lower values indicating asymmetry. j) **fractal dimension:** "Coastline approximation" - 1. The dataset also includes the patient ID and diagnosis (M = malignant, B = benign). We denote the data set as BreastCancerWisconsin (wdbc). ## Get data from example We load the breast cancer dataset, which is available in [Kaggle](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data). The data set can be loaded with the following code: ```{r load data} data("BreastCancerWisconsin", package = "ClustAll") data_use <- subset(wdbc, select=-ID) ``` An initial exploration reveals the absence of missing values. The dataset comprises 30 numerical features and one categorical feature (the ground truth). As the initial data does not contain missing values we will apply the ClustAll workflow accordingly. ```{r check missing values} sum(is.na(data_use)) dim(data_use) ``` ## Create the ClustAll object First, the ClustAllObject is created and stored. In this step, we indicate if there is a feature that contains the ground truth in the argument `colValidation`. This feature is not consider to compute the stratification. In this specific case, it corresponds to "Diagnosis". ```{r create ClustAllObject} obj_noNA <- createClustAll(data = data_use, nImputation = NULL, dataImputed = NULL, colValidation = "Diagnosis") ``` ## Execute the ClustAll algorithm Next, we apply the ClustAll algorithm. The output is stored in a ClustAllObject, which contains the clustering results. ```{r run ClustAll, results=FALSE} obj_noNA1 <- runClustAll(Object = obj_noNA, threads = 2, simplify = FALSE) ``` We show the object: ```{r check result} obj_noNA1 ``` ## Represent the Jaccard Distance between population-robust stratifications To display population-robust stratifications (\>85% bootstrapping stability), we call `plotJaccard`, using the ClustAllObject as input. In addition, we specify the threshold to consider similar a pair of stratifications in the `stratification_similarity` argument. In this specific case, a similarity of 0.9 reveals three different groups of alternatives for stratifying the population, indicated by the the red rectangles: ```{r plot1, fig.height=12, fig.width=12, warning=FALSE, fig.cap = "Correlation matrix heatmap. It depcits the similarity between population-robust stratifications. The discontinuous red rectangles highlight alternative stratifications solutions based on those stratifications that exhibit certain level of similarity. The heatmap row annotation describes the combination of parameters —distance metric, clustering method, and embedding number— from which each stratification is derived."} plotJACCARD(Object = obj_noNA1, stratification_similarity = 0.9, paint = TRUE) ``` ## Retrieve stratification representatives We can displayed the centroids (a representative) from each group of alternative stratification solutions (highlighted in red squares in the previous step) using `resStratification`. Each representative stratification illustrates the number of clusters and the patients belonging to each cluster. In this case, the alternative stratifications have been computed with the following specifications: - cuts_a\_28: This stratification was generated using Embedding 28 with the correlation distance metric and the kmeans clustering algorithm. It consists of two clusters, with 183 and 386 patients, respectively. - cuts_c\_9: This stratification was generated using Embedding 9 with the gower distance metric and the kmedoids clustering algorithm. It consists of two clusters, with 197 and 372 patients, respectively. - cuts_c\_4: This stratification was generated using Embedding 4 with the Gower distance metric and the kmedoids algorithm. It consists of two clusters, with 199 and 370 patients, respectively. ```{r explore representative clusters} resStratification(Object = obj_noNA1, population = 0.05, stratification_similarity = 0.9, all = FALSE) ``` ## Generate Sankey diagrams comparing pairs of stratifications, or a stratification with the ground truth In order to compare two sets of representative stratifications, `plotSankey` was implemented. The ClustAllObject is used as input. In addition, we specify the pairs of stratifications we want to compare in the `clusters` argument. In this case, the first Sankey plot illustrates patient transitions between two sets of representative stratifications (cuts_c\_9 and cuts_a\_28), revealing the flow and distribution of patients across the clusters. The second Sankey plot illustrates patient transitions between a representative stratifications (cuts_a\_28) and the ground truth, revealing the flow and distribution of patients across the clusters. ```{r plot2, fig.cap = "Flow and distribution of patients across clusters. Patient transitions between representative stratifications cuts_c_3 and cuts_a_9.", echo = FALSE} plotSANKEY(Object = obj_noNA1, clusters = c("cuts_c_9","cuts_a_28"), validationData = FALSE) ``` ```{r plot3, fig.cap = "Flow and distribution of patients across clusters. Patient transitions between representative stratifications cuts_c_9 and the ground truth.", echo = FALSE} plotSANKEY(Object = obj_noNA1, clusters = c("cuts_c_9"), validationData = TRUE) ``` ## Retrieve the original dataset with the selected ClustAll stratification(s) The stratification representatives can be added to the initial dataset to facilitate further exploration. In this case, we add the three stratification representatives to the dataset. For simplicity, we show the two top rows of the dataset: ```{r export stratifications} df <- cluster2data(Object = obj_noNA1, stratificationName = c("cuts_c_9","cuts_a_28","cuts_c_4")) head(df,3) ``` ## Assess the results the sensitivity and specifity of the selected ClustAll stratifications against ground truth (if available) To evaluate the performance of the selected ClustAll stratifications against ground truth, sensitivity and specificity can be assessed using `validateStratification`. Higher values indicate greater precision in the stratification process. In this particular case, our method retrieves three stratification representatives with a sensitivity and specificity exceeding 80% and 90%, respectively, despite being computed using different methods. These results underscore the notion that a robust stratification should be consistent across diverse clustering methods. ```{r validate stratifications} # STRATIFICATION 1 validateStratification(obj_noNA1, "cuts_a_28") # STRATIFICATION 2 validateStratification(obj_noNA1, "cuts_c_9") # STRATIFICATION 3 validateStratification(obj_noNA1, "cuts_c_4") ``` # Session Info ```{r session info} sessionInfo() ```