--- title: "Introduction to SETA ecological transforms and sample-level latent spaces" date: "`r BiocStyle::doc_date()`" author: - name: Kyle Kimler affiliation: - &CDN Cell Discovery Network - &BCH Boston Children's Hospital - &UVIC-UCC University of Vic - Central University of Catalonia email: kkimler@broadinstitute.org - name: Marc Elosua-Bayes affiliation: - &CDN Cell Discovery Network - &BCH Boston Children's Hospital email: marc.elosuabayes@childrens.harvard.edu package: "`r BiocStyle::pkg_ver('SETA')`" vignette: > %\VignetteIndexEntry{Introduction to SETA ecological transforms and sample-level latent spaces} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} output: BiocStyle::html_document editor_options: markdown: wrap: 80 --- ```{=html} ``` ## Introduction ### Why use SETA? `SETA` (Single Cell Ecological Taxonomic Analysis) performs compositional analysis of single-cell data by combining established compositional data analysis (CoDA) methods with ecological principles. While other excellent Bioconductor packages address cell-type compositions, `SETA` is uniquely designed to teach and apply a variety of CoDA methods. ### Compositional Analysis of Single-Cell RNA-seq Data Single-cell RNA-seq experiments produce data where cell-type abundances are inherently compositional - the proportions of different cell types sum to 1 (or 100%). This compositional constraint means that changes in one cell type affect all others, and standard statistical methods can produce misleading results. `SETA` applies well-established compositional data analysis (CoDA) methods from ecology to properly handle these constraints. ### What SETA Does Well Several Bioconductor packages already address single-cell compositional analysis, each with strong statistical background, including `DCATS`, `speckle`, `muscat` and `sccomp`. While these packages, especially sccomp, provides the statistical rigor for complex analyses, `SETA` aims to open up the entire compositional analysis workflow to users who want to understand and apply these methods step-by-step. `SETA` focuses on simpler models that respect the compositional constraint. **SETA's unique strengths:** - **Explicit distance calculations**: Aitchison distances and compositional geometry - **Ecological metrics**: Balance transforms, diversity indices, and ecological framing - **Comprehensive transforms**: CLR, ALR, ILR, and user-defined balance transforms - **Educational focus**: separating each step of the workflow into separate functions, `SETA` aims to make the workflow transparent and easier to explain. ### Why These Steps Should Be Executed The workflow presented in this vignette is essential because: 1. **Proper Data Handling**: Cell-type counts must be treated as compositional data to avoid spurious correlations and misleading statistical results 2. **Appropriate Transforms**: Log-ratio transforms (CLR, ALR, ILR) maintain compositional properties while enabling standard statistical analysis 3. **Ecological Insights**: Applying ecological methods provides novel perspectives on cell-type community structure and diversity 4. **Educational Value**: `SETA` opens up the entire compositional analysis workflow to users, teaching the mathematical foundations rather than hiding them in black-box functions 5. **Unique Capabilities**: `SETA` provides explicit distance calculations, ecological metrics, and comprehensive compositional transforms in a single, educational framework ### Package Overview `SETA` provides a set of functions for compositional analysis of single-cell RNA-seq data. In this vignette we show how to: - **Extract a taxonomic counts matrix** from various objects (e.g. a `Seurat` object) - **Apply compositional transforms** such as `CLR` and `ALR` - **Compute a latent space** using `PCA` (with options for `PCoA` or `NMDS`) This example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included. *This vignette is modular; users can update or replace the dataset-specific settings* *(e.g., metadata column names, reference cell type) as appropriate for their own data.* ## Installation ```{r installation, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SETA") ``` ## Loading the Data ```{r setup, message=FALSE, warning=FALSE, echo=TRUE} library(SingleCellExperiment) library(SETA) library(ggplot2) library(dplyr) library(TabulaMurisSenisData) library(reshape2) # library(tidytext) ``` ## Load and prepare data ```{r load data, message=FALSE, warning=FALSE, echo=TRUE} sce <- TabulaMurisSenisDroplet(tissues = "Lung")$Lung # Two samples in this dataset have had certain celltype lineages depleted. # For compositional comparisons, we remove these for simplicity. sce <- sce[, colData(sce)$subtissue != "immune-endo-depleted"] ``` Quick data exploration: ```{r tabular data exploration} table(sce$free_annotation, sce$age) table(sce$free_annotation, sce$sex) ``` ```{r palettes} # Set up a color palette for plots # 3-group distinct categoricals age_palette <- c( "1m" = "#90EE90", "3m" = "#4CBB17", "18m" = "#228B22", "21m" = "#355E3B", "30m" = "#023020") ``` ## Extracting the Taxonomic Counts Matrix The first step in compositional analysis is to extract a cell-type counts matrix from single-cell objects. The `setaCounts()` function converts cell-level metadata into a sample-by-cell-type count matrix, where each row represents a sample and each column represents a cell type. This transformation is crucial because compositional analysis requires sample-level data, not individual cell data. For this dataset, we want the cell-type annotations to come from the `free_annotation` column and the sample identifiers from `mouse.id`. The function aggregates individual cells by sample and cell type, creating a counts matrix suitable for compositional analysis. If necessary, we reassign these columns before extraction to match the expected column names. ```{r data wrangling, message=FALSE, warning=FALSE, echo=TRUE} # Extract the taxonomic counts matrix using custom metadata column names # (Users can specify these column names according to their dataset.) df <- data.frame(colData(sce)) # Fix special characters in mouse.id before working with the data df$mouse.id <- gsub("/", "", df$mouse.id) taxa_counts <- setaCounts( df, cell_type_col = "free_annotation", sample_col = "mouse.id", bc_col = "cell") head(taxa_counts) ``` ## Applying Compositional Transforms Once we have our cell-type counts matrix, we must apply appropriate compositional transforms before statistical analysis. Raw count data cannot be used directly because it violates the compositional constraint (proportions must sum to 1). Log-ratio transforms solve this problem by converting the data to a space where standard statistical methods can be applied. Below we demonstrate four different compositional transforms, each with specific advantages: - **Centered Log-Ratio (CLR)**: Centers log-transformed data around the geometric mean, useful when no reference cell type is available - **Additive Log-Ratio (ALR)**: Uses a specific cell type as reference, appropriate when one cell type remains stable across conditions - **Isometric Log-Ratio (ILR)**: Projects data onto orthonormal basis, reducing dimensionality by one while maintaining all information - **Simple transforms**: Percentages and log-CPM for comparison and visualization CLR Transformation ```{r transformation, echo=TRUE} # CLR Transformation with a pseudocount of 1 (default) clr_out <- setaCLR(taxa_counts, pseudocount = 1) print(clr_out$counts[1:5, 1:5]) ``` For the ALR transform we need to choose a reference cell type whose numbers remain stable across conditions pre-sampling. In an ideal scenario this could a spike-in of a known number of cells. In this example, we assume that one of the reference cell types are "NK cells". If it does not exist in your dataset, change the reference accordingly or use CLR. ```{r ALR} # ALR Transformation: using "NK cell" as the reference cell type if ("Natural Killer" %in% colnames(taxa_counts)) { alr_out <- setaALR(taxa_counts, ref = "Natural Killer", pseudocount = 1) print(alr_out$counts[1:5, 1:5]) } else { message("Reference 'NK cell' not found in taxa_counts. Please choose an available cell type.") } ``` ILR Transform using Helmert basis (boxcox_p = 0 for standard ILR) This is the method used by Cacoa (Viktor Petukhov & Kharchenko Lab) ```{r ILR} ilr_out <- setaILR(taxa_counts, boxcox_p = 0, pseudocount = 1) print(ilr_out$counts[1:5, 1:5]) ``` Simple Percentage Transform ```{r percent} pct_out <- setaPercent(taxa_counts) print(pct_out$counts[1:5, 1:5]) ``` LogCPM (counts per 10k) Transform ```{r logCPM} lcpm_out <- setaLogCPM(taxa_counts, pseudocount = 1) print(lcpm_out$counts[1:5, 1:5]) ``` ## Latent Space Analysis After applying compositional transforms, we can perform dimensionality reduction to visualize and analyze the relationships between samples. The transformed data can be analyzed using standard multivariate methods, but it's important to remember that we're now working in log-ratio space, not the original count space. We demonstrate three different latent space methods, each with specific advantages: - **Principal Component Analysis (PCA)**: Linear dimensionality reduction that maximizes variance, provides loadings for interpretation - **Principal Coordinate Analysis (PCoA)**: Works with distance matrices, useful when relationships between samples are more important than individual features - **Non-metric Multidimensional Scaling (NMDS)**: Non-linear method that preserves rank-order relationships, robust to outliers ```{r latent spaces, echo=TRUE} latent_pca <- setaLatent(clr_out, method = "PCA", dims = 5) # PCA Latent Space Coordinates: head(latent_pca$latentSpace) # Variance Explained: latent_pca$varExplained ``` ## Visualization Below we show how to create basic visualizations of SETA latent spaces ### Variance Explained Plot Plot the variance explained by the first few principal components. ```{r variance explained, message=FALSE, warning=FALSE, echo=TRUE} ve_df <- data.frame( PC = factor(seq_along(latent_pca$varExplained), levels = seq_along(latent_pca$varExplained)), VarianceExplained = cumsum(latent_pca$varExplained) ) ``` ```{r variance plot, fig.width = 6, fig.height = 4} ggplot(ve_df, aes(x = PC, y = VarianceExplained)) + geom_line(color = "#2ca1db", size = 1.5, group = 1) + geom_point(color = "#f44323", size = 3) + labs(title = "Cumulative Variance Explained by Principal Components", x = "Principal Component", y = "Cumulative Variance Explained (%)") + theme_minimal() + ylim(c(0, 1)) + theme(text = element_text(size = 12)) ``` ### PCA Scatter plot We overlay the PCA latent space coordinates (using the `CLR` transform) with metadata from the `Seurat` object. In this example, points are colored by age. ```{r PCA scatter} # Extract latent space coordinates from PCA result. pca_coords <- latent_pca$latentSpace # Extract metadata meta_df <- setaMetadata( df, sample_col = "mouse.id", meta_cols = c("age", "sex")) # Merge latent coordinates with metadata pca_plot_df <- cbind(pca_coords, meta_df) ``` ```{r PCA scatter plot, fig.width = 6, fig.height = 4} # Create the scatter plot ggplot(pca_plot_df, aes(x = PC1, y = PC2, color = age)) + geom_text(aes(label = sample_id), size = 5) + scale_color_manual( values = age_palette ) + labs(title = "PCA Scatter Plot", x = "PC1", y = "PC2", color = "Age") + theme_linedraw() + xlab(sprintf("PC1 (%s%%)", signif(latent_pca$varExplained[1], 4) * 100)) + ylab(sprintf("PC2 (%s%%)", signif(latent_pca$varExplained[2], 4) * 100)) ``` ### Loadings Plot Plot the variance explained by interesting principal components. ```{r loadings, message=FALSE, warning=FALSE, echo=TRUE} loadings_df <- as.data.frame(latent_pca$loadings) loadings_df$Celltype <- rownames(loadings_df) # Melt the loadings into long format for `ggplot2` loadings_long <- melt(loadings_df, id.vars = "Celltype", variable.name = "PC", value.name = "Loading") ``` Visualize the PC loadings for each cell ```{r loadings viz, fig.width = 12, fig.height = 9} # Subset rows where PC is "PC1" or "PC2" loading_df <- loadings_long[loadings_long$PC %in% c("PC1", "PC2"), ] # Append PC to Celltype to mimic tidytext::reorder_within() loading_df$Celltype_PC <- paste(loading_df$Celltype, loading_df$PC, sep = "___") # Reorder by Loading *within each PC* loading_df$Celltype_PC <- with(loading_df, reorder(Celltype_PC, Loading)) ggplot(loading_df, aes(x = Loading, y = Celltype, label = Celltype, fill = Loading)) + geom_col() + facet_wrap(~PC, scales = "free") + # scale_y_reordered() + labs(title = "PC Loadings", x = "Loading", y = "Cell Type") + theme_linedraw() + scale_fill_viridis_c() + expand_limits(x = c(-0.2, 0.4)) ``` ## Conclusion This vignette illustrated a basic workflow using the SETA package: 1. Extracting a taxonomic counts matrix from a `Seurat` Object with `setaCounts()` 2. Applying compositional transforms (`CLR`, `ALR`, `ILR`, and simple transforms) 3. Performing latent space analysis (`PCA`) 4. Visualizing sample-level latent spaces Each step is modular, so you can easily swap in your own data, update metadata column names, or choose different parameters (e.g., pseudocount, reference cell type) to suit your analysis needs. For further details, see the package documentation and relevant literature on compositional data analysis (e.g., Aitchison, 1982). ## Session Info ```{r packages used} sessionInfo() ```