--- title: "Comparing samples with SETA" date: "`r BiocStyle::doc_date()`" author: - name: Kyle Kimler affiliation: - &CDN Cell Discovery Network - &BCH Boston Children's Hospital - &UVIC-UCC University of Vic - Central University of Catalonia email: kkimler@broadinstitute.org - name: Marc Elosua-Bayes affiliation: - &CDN Cell Discovery Network - &BCH Boston Children's Hospital email: marc.elosuabayes@childrens.harvard.edu package: "`r BiocStyle::pkg_ver('SETA')`" vignette: > %\VignetteIndexEntry{Comparing samples with SETA} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} output: BiocStyle::html_document editor_options: markdown: wrap: 80 chunk_output_type: console --- ```{=html} ``` ## Introduction With compositional data analysis (CoDA), we can make sample-level comparisons of scRNAseq data based on the relative abundance celltypes that compose them. In this vignette we show how to: - **Calculate compositional distances** between samples - **Perform statistical tests** with a focus on rank-based methods, which are amenable to relative data - **Correlate celltype compositions and metadata** - **Create predictive models** using popular statistical modeling methods This example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included. *This vignette is modular; users can update or replace the dataset-specific settings* *(e.g., metadata column names, reference cell type) as appropriate for their own data.* ## Installation ```{r installation, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SETA") ``` ## Load libraries ```{r load libraries, message=FALSE, warning=FALSE, echo=TRUE} library(SingleCellExperiment) library(SETA) library(ggplot2) library(dplyr) library(tidyr) library(corrplot) library(caret) have_ml <- all(vapply(c("caret","glmnet"), requireNamespace, logical(1), quietly = TRUE)) if (!have_ml) { warning("Some machine learning packages are not installed. ML chunks will be skipped.") } library(TabulaMurisSenisData) ``` ## Load and prepare data ```{r load data, message=FALSE, warning=FALSE, echo = FALSE} sce <- TabulaMurisSenisDroplet(tissues = "Lung")$Lung sce <- sce[, colData(sce)$subtissue != "immune-endo-depleted"] ``` Quick data exploration: ```{r tabular data exploration} table(sce$free_annotation, sce$age) ``` ```{r palettes} # Set up a color palette for plots # 3-group distinct categoricals age_palette <- c( "1m" = "#90EE90", "3m" = "#4CBB17", "18m" = "#228B22", "21m" = "#355E3B", "30m" = "#023020") # continuous palette of similar look c_palette <- colorRampPalette(c("#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"))(100) ``` ## Extracting the Taxonomic Counts Matrix The `setaCounts` function extracts a cell-type counts matrix given that the object contains cell-level metadata. For this dataset, we want the cell-type annotations to come from Celltype and the sample identifiers from donor_id. If necessary, we reassign these columns before extraction. ```{r setaCounts} df <- data.frame(colData(sce)) df$mouse.id <- gsub("/", "", df$mouse.id) taxa_counts <- setaCounts( df, cell_type_col = "free_annotation", sample_col = "mouse.id", bc_col = "cell") taxa_counts[1:5, 1:5] ``` ### Prepare metadata Extract sample-level metadata from a dataframe. It ensures that each metadata column contains unique values per sample. If a metadata column contains non-unique values within any sample, that column is excluded from the output, and the user is notified via a warning. ```{r setaMetadata} meta_df <- setaMetadata( df, sample_col = "mouse.id", meta_cols = c("age", "sex")) meta_df[1:5, ] ``` ## Calculate distances between samples Normalize the data using Centered Log Ratio transformation (CLR) ```{r setaTransform} clr_transformed <- setaTransform(taxa_counts, method = "CLR") ``` Compute distance in the CLR space In compositional analysis, the Euclidean distance is the preferred metric once the bounding problem is solved, in other words, once we transform the data. John Aitchison, who wrote "The Statistical Analysis of Compositional Data" preferred the Euclidean distance of CLR transformed data, and this distance is now referred to as "Aitchison Distance" ```{r setaDistances} dist_df <- setaDistances(clr_transformed) # Merge metadata merged_dist <- dist_df |> left_join(meta_df, by = c("from" = "sample_id")) |> left_join(meta_df, by = c("to" = "sample_id"), suffix = c(".from", ".to")) # Create a age-age category for comparison merged_dist$age_pair <- paste(merged_dist$age.from, merged_dist$age.to, sep = "-") ``` Visualize distances ```{r viz distances, fig.width = 9, fig.height = 6} ggplot(merged_dist, aes(x = age_pair, y = distance)) + geom_boxplot(fill = "grey90") + geom_jitter(width = 0.2, color = "black") + labs(title = "Aitchison Distances Between Age Groups", x = "Age Pair", y = "Aitchison Distance") + theme_minimal(base_size = 16) + theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) ``` ## Perform wilcoxon rank-sum tests on CoDA transformed data ```{r preprocess for comparisons} clr_long <- as.data.frame(clr_transformed$counts) colnames(clr_long) <- c("sample", "Celltype", "CLR") clr_long <- clr_long |> left_join(meta_df, by = c("sample" = "sample_id")) ``` ```{r viz pairwise distances, fig.width = 15, fig.height = 10} # Apply pairwise Wilcoxon tests and plot using ggpubr ggplot(clr_long, aes(x = age, y = CLR, fill = age, color = age)) + geom_boxplot(position = position_dodge(0.8), alpha = 0.7) + geom_jitter(size = 1.5, shape = 21) + # stat_compare_means(method = "wilcox.test", # label = "p.signif", # comparisons = list(c("normal", "influenza"), # c("normal", "COVID-19"), # c("influenza", "COVID-19")), # position = position_dodge(0.8)) + facet_wrap(~ Celltype) + theme_minimal(base_size = 12) + scale_fill_manual(values = age_palette) + scale_color_manual(values = age_palette) + theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 1)) + labs(title = "CLR by Celltype and Age", x = "Age", y = "CLR-transformed Composition") ``` ## Correlate celltype compositions with metadata ```{r metadata corr} clr_df <- clr_transformed$counts |> data.frame() |> # as.data.frame converts it to long form pivot_wider(names_from='Var2', values_from = "Freq") |> rename(`mouse.id` = Var1) clr_metadata <- clr_df |> left_join(meta_df, by = c("mouse.id" = "sample_id")) clr_data <- clr_metadata |> select(where(is.numeric)) # One-hot encode 'age' and clean column names oh <- model.matrix(~age - 1, data = clr_metadata) colnames(oh) <- sub("^age", "", colnames(oh)) rownames(clr_data) <- clr_metadata$mouse.id # Combine CLR data and metadata combined <- cbind(clr_data, oh) # Compute full correlation matrix full_cor_mat <- cor(combined, method = "pearson") p_mat <- cor.mtest(full_cor_mat)$p ``` Visualize correlations ```{r corr viz, fig.width = 12, fig.height = 12} corrplot(full_cor_mat, method = "circle", type = "full", addrect = 4, col = c_palette, p.mat = p_mat, sig.level = c(.001, .01, .05), insig = "label_sig", tl.cex = 0.8, pch.cex = 1.5, tl.col = "black", order = "hclust", diag = FALSE) ``` ## Use SETA transformed data to create predictive models with caret ```{r caret, warning=FALSE, eval = have_ml} set.seed(687) train_df <- clr_metadata |> select(-`mouse.id`) |> mutate(age = factor(age)) train_control <- trainControl(method = "cv", number = 5) model <- train(age ~ ., data = subset(train_df), method = "glmnet", trControl = train_control) importance <- varImp(model) ``` ```{r plot caret, fig.width = 12, fig.height = 12, eval = have_ml} plot(importance, main = "Cross-Validated Variable Importance", sub = "Caret GLMnet model" ) ``` ## Conclusion Compositional data provides an intuitive basis for making sample-level comparisons in scRNAseq data, as it provides a single sample x feature space within which to make comparisons. Let us know if you have more ideas for using CoDA! ## Session Info ```{r packages used} sessionInfo() ```