--- title: "SpaceTrooper utilities overview" author: - name: "Benedetta Banzi" - name: "Dario Righelli" date: "`r BiocStyle::doc_date()`" output: BiocStyle::html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{Imaging-based Spatial Transcriptomics Data Quality Control with SpaceTrooper} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} # Set chunk options: suppress echo, messages, and warnings in code output knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) ``` ## Introduction `SpaceTrooper` is an `R/Bioconductor` package for Quality Control (QC) of imaging-based spatial transcriptomics and proteomics data. It provides multi-platform data harmonization, cell-level QC, and visualization utilities. The package leverages [SpatialExperiment](https://www.bioconductor.org/packages/release/bioc/html/SpatialExperiment.html) objects to support data from **CosMx**, **Xenium**, and **MERFISH** technologies. This vignette describes additional SpaceTrooper utilities that extend the standard workflow and are intended to further improve the user experience. Please, be aware that this is an appendix vignette and does not present an exhaustive analytical workflow. Therefore, users should consult [RNA](https://bioconductor.org/packages/devel/bioc/vignettes/SpaceTrooper/inst/doc/RNA_tutorial_vignette.html) and [Protein](https://bioconductor.org/packages/devel/bioc/vignettes/SpaceTrooper/inst/doc/Protein_tutorial_vignette.html) tutorials before proceeding. In addition, this vignette provides practical tips for making the most effective use of the package. It is organized into three main sections, corresponding to the typical analysis pipeline, with relevant insights for each step: - Data loading - Quality Control (QC) - Visualization ## Data loading ### SpatialExperiment object creation This section explains how to specify the inputs required by each data-reading function. Both `readCosmxSPE` and `readXeniumSPE` are wrapper functions built on the [SpatialExperimentIO](https://bioconductor.org/packages/release/bioc/html/SpatialExperimentIO.html) infrastructure. Further details on their usage are available in the corresponding function documentation, e.g. `help(readCosmxSPE)`. **COSMX** ```{r load-cosmx} library(SpaceTrooper) library(ggplot2) cosmxFolder <- system.file(file.path("extdata", "CosMx_DBKero_Tiny"), package="SpaceTrooper") spe <- readCosmxSPE(dirName=cosmxFolder, sampleName="DBKero_CosMx", coordNames=c("CenterX_global_px", "CenterY_global_px"), countMatFPattern="exprMat_file.csv", metadataFPattern="metadata_file.csv", polygonsFPattern="polygons.csv", fovPosFPattern="fov_positions_file.csv", fovdims=c(xdim = 4256, ydim = 4256) ) ``` Here are summarized the default values of the input parameters. The `dirName` parameter specifies the path to the directory containing the flat files produced by the CosMx platform: - expression matrix, (e.g. `exprMat_file.csv`) - cell metadata, (e.g. `metadata_file.csv`) - Fields of View positions, (e.g. `fov_positions_file.csv`) - cell polygons, (e.g. `polygons.csv`) These files represent the minimum set of inputs required to run the data-reading function. The `sampleName` parameter defines the dataset label used as the title in plotting functions. The `coordNames` parameter specifies the column names in the cell metadata corresponding to centroid coordinates. These are stored in the `spatialCoords` slot of the `SpatialExperiment` and can be assessed via `spatialCoords(spe)`. They are also the columns used by `plotCentroid` function. Additional details on working with `SpatialExperiment` objects are available in the [SpatialExperiment](https://www.bioconductor.org/packages/release/bioc/html/SpatialExperiment.html) documentation. The `countMatFPattern`, `metadataFPattern`, `polygonsFPattern` and `fovPosFPattern` define file name patterns used to identify the expression matrix, cell metadata, Fields of view (FoV) positions, and polygon files, respectively. As file naming conventions may change over time with evolving technologies, these parameters allow users to adapt to different file names or extensions. Currently, CSV files are supported, including compressed `.csv.gz.` formats. **IMPORTANT**: the cell polygon file is the only input that is not imported directly into the `SpatialExperiment` object by the reading function. However, because polygons can be added later using the `readAndAddPolygonsToSPE` function (see the [Load polygons](#load-polygons) subsection), the path to the cell polygon file is stored in the `metadata` slot and accessed via `metadata(spe)$polygons`. The `fovdims` parameter specifies the FoV dimensions as a vector containing the x and y sizes. While FoV size and shape have changed over time, e.g. from rectangular to square, in recent CosMx versions they have stabilized at 4,256 pixels per side. **COSMX PROTEIN** ```{r load-cosmx-protein} protfolder <- system.file( "extdata", "S01_prot", package="SpaceTrooper") prot <- readCosmxProteinSPE(dirName=protfolder, sampleName="CosMx_Protein_Tonsil", coordNames=c("CenterX_global_px", "CenterY_global_px"), countMatFPattern="exprMat_file.csv", metadataFPattern="metadata_file.csv", polygonsFPattern="polygons.csv", fovPosFPattern="fov_positions_file.csv", fovdims=c(xdim = 4256, ydim = 4256) ) # code line for shift correction metadata(prot)$fov_positions$y_global_px <- metadata(prot)$fov_positions$y_global_px - 4256 ``` `readCosmxProteinSPE` is a wrapper around `readCosmxSPE`, thus the required input files and parameters are identical. Input files may be provided in compressed `csv.gz` file format, and the `countMatFPattern`, `metadataFPattern`, `polygonsFPattern` and `fovPosFPattern` parameters are used to specify the corresponding file name patterns. **XENIUM** ```{r load-xenium, message = TRUE, warning = TRUE} xeniumFolder <- system.file( "extdata", "Xenium_small", package="SpaceTrooper") xen <- readXeniumSPE(dirName=xeniumFolder, sampleName="Xenium_small", type=c("HDF5", "sparse"), coordNames=c("x_centroid", "y_centroid"), boundariesType=c("parquet", "csv"), computeMissingMetrics=TRUE, keepPolygons=FALSE, countsFilePattern="cell_feature_matrix", metadataFPattern="cells.csv.gz", polygonsFPattern="cell_boundaries", polygonsCol="polygons", txPattern="transcripts", addFOVs=FALSE ) ``` `readXeniumSPE` function requires the path to the directory containing the Xenium output files. This directory is typically generated automatically by the Xenium pipeline and usually includes the pattern `outs` in its name. If the directory exists but this pattern is missing from the provided path, a warning is issued to allow the path to be corrected. Xenium outputs are often provided redundantly in multiple formats. The minimum set of files required to run the `SpaceTrooper` pipeline includes: - expression matrix - cell metadata - cell polygons Optionally: - transcript file The `dirName`, `sampleName` and `coordNames` parameters behave in the same way as in `readCosmxSPE`. The remaining parameters are described below: `type` specifies the format of the expression matrix to be loaded. Currently only the following two formats are supported: MEX for sparse matrices (Cell Ranger output) and HDF5. By default, the HDF5 file is used. If only the sparse MEX format is available, you should provide a vector containing only that format. `boundariesType` specifies the format of the polygon file and must be either Parquet or CSV. By default, the function uses Parquet format. If only the CSV file is available, you should provide a vector containing only that format. `computeMissingMetrics` is set to TRUE by default and enables the computation of `Area_um` and `AspectRatio` cell metrics from boundary polygons. **IMPORTANT**: it is strongly recommended to keep this parameter set to `TRUE` as both metrics are required to run `computeQCScore()`, even if they are not used in the final formula. `keepPolygons` controls whether polygons are retained in the `SpatialExperiment` object and is **ONLY** effective when `computeMissingMetrics=TRUE`. While missing metrics can be computed without keeping polygons, polygons cannot be retained without computing these metrics. However, polygons may also be loaded at a later stage (see the [Load polygons](#load-polygons) subsection). `countsFilePattern` and `polygonsFPattern` define file name patterns (excluding the file extension) for the expression matrix and polygon files, correspondingly, since their formats are already specified via `type` and `boundariesType`. `metadataFPattern` requires the full file name (including extension) of the cell metadata file. Only `.csv.gz` and `.parquet` files are supported. `polygonsCol` defines the name of the column in `colData` that stores polygon geometries. Optionally, FoV information can be retrieved for each cell from the transcript file. The `txPattern` parameter specifies the file name pattern for the transcript file (Parquet format only), while `addFOVs` controls whether FoV information is extracted. This option is not mandatory and is therefore set to `FALSE` by default. **OF NOTE**, because `readXeniumSPE` is a wrapper function around `SpatialExperimentIO::readXeniumSXE`, and multiple probe identifier sets could be available, Gene Symbols are selected and used as row names. **MERFISH** ```{r load-merfish, message = TRUE, warning= TRUE} merfishFolder <- system.file("extdata", "Merfish_Tiny", package="SpaceTrooper") mer <- readMerfishSPE(dirName=merfishFolder, sampleName="Merfish_Tiny", computeMissingMetrics=TRUE, keepPolygons=FALSE, boundariesType=c("parquet"), countmatFPattern="cell_by_gene.csv", metadataFPattern="cell_metadata.csv", polygonsFPattern="cell_boundaries.parquet", coordNames=c("center_x", "center_y"), polygonsCol="polygons") ``` As with the other data-reading functions, `readMerfishSPE` requires the path to the directory containing the MERFISH output files. The minimum set of inputs required to run the `SpaceTrooper` pipeline includes: - expression matrix - cell metadata - cell polygons The expression matrix and cell metadata are typically provided in CSV format. Polygon data can be supplied either as a single Parquet file (recommended when available) or as multiple HDF5 files. In MERFISH/MERSCOPE experiments, HDF5 polygon files are generated per FoV and may easily scale to thousands of files, making this option computationally intensive and time-consuming. Fortunately, in more recent datasets, a Parquet file is usually available. For this reason, it is recommended to override the default setting and provide a vector containing only `Parquet` to the `boundariesType` parameter. **IMPORTANT**: please, pay attention to warnings. `computeMissingMetrics` parameter must be set to `TRUE` whenever it is desired to compute Quality Score (QS). In MERFISH experiments, transcript hybridization signals are captured also across the z dimension, and cell size is therefore represented by cell volume in the provided metadata. In contrast, cell area can also be computed by `SpaceTrooper` from polygon boundaries obtained from the MERFISH experiment but from a single z plane only (by running `computeMissingMetricsMerfish()` with `useVolume=FALSE`). That said, cell volume appears to be a more appropriate measure of cell size for QS computation with MERFISH data. In the current version of `SpaceTrooper`, when `computeMissingMetrics=TRUE`, the `volume` is renamed as `Area_um` for compatibility with the `computeQCScore` function, whereas cell area is not computed. ### Load polygons For all supported technologies, polygons can be loaded after `SpatialExperiment` object creation. The function `readAndAddPolygonsToSPE` performs the following steps: - reads the cell polygon file; - standardizes polygon formats and converts them into an `sf` object; - validates polygon geometries (e.g. checks that polygons are properly closed and contain at least four vertices etc.); - removes cells from the `SpatialExperiment` object if no valid polygon is associated with them; - stores the resulting `sf` object in a `colData` column whose name can be specified via the `polygonsCol` argument; ```{r load-poly, message = TRUE} # polygon loading spe <- readAndAddPolygonsToSPE(spe, polygonsCol="polygons", keepMultiPol=TRUE, boundariesType=c("csv", "HDF5", "parquet") ) spe$polygons ``` In some technologies, segmentation may produce multi-polygons (indicated as `TRUE` in the `is_multi` column of the `sf` object shown here), meaning that more than one polygon is associated with a single cell. This can cause issues when computing certain derived metrics, particularly the`AspectRatio`. By default, the `keepMultiPol` parameter is set to `TRUE`. Under this setting, cells associated with multi-polygons are retained in the dataset; however, their `AspectRatio` is `NA`. ```{r load-poly-xen, message = TRUE} # xenium polygon loading xen <- readAndAddPolygonsToSPE(xen, boundariesType=c("parquet")) ``` ```{r load-poly-mer, message = TRUE} # merfish polygon loading mer <- readAndAddPolygonsToSPE(mer, boundariesType=c("parquet")) ``` ```{r load-poly-prot, message = TRUE} # protein polygon loading prot <- readAndAddPolygonsToSPE(prot, boundariesType=c("csv")) ``` Multiple geometries files can be loaded into a `SpatialExperiment` object using the `readAndAddPolygonsToSPE` function. The `polygonsCol` parameter allows users to specify distinct column names under which these geometries are stored in `colData`. ## Quality Control ### Compute QC metrics This section describes the cell metadata and quality metrics computed by the `spatialPerCellQC` function. When inspecting `colData`, additional metadata and metrics provided natively by each technology may also be present; these are technology-specific and are generally not shared across all platforms. For details on these additional fields, please refer to the corresponding technology reference documentation. ```{r analysis-qc-1} spe <- spatialPerCellQC(spe, rmZeros=TRUE, negProbList=c("NegPrb", "Negative", "SystemControl")) xen <- spatialPerCellQC(xen, rmZeros=TRUE) mer <- spatialPerCellQC(mer, rmZeros=TRUE) prot <- spatialPerCellQC(prot, rmZeros=TRUE) ``` ```{r analysis-qc-2} # check added columns to colData colnames(colData(spe)) ``` This function is a wrapper around `scuttle::addPerCellQC()`, hence a subset of the reported metrics is derived from that function: - `sample_id`: character string specifying the dataset sample name. This matches the value provided to the `sampleName` argument in the reading function; - `sum`/`total`: numeric value representing total probe counts (RNA) or total protein intensity per cell. In this context, `sum` and `total` are equivalent; - `subsets_controlProbeName_sum`: numeric value giving sum of the counts (intensities) obtained from each type of control probe (control protein). For each one, a separate column is created by replacing "controlProbeName" with the corresponding pattern from `negProbList` , e.g. negative control probes (RNA), negative codewords (RNA), negative control antibodies (protein), etc.; - `subsets_controlProbeName_detected`: numeric value indicating the number of uniquely detected control probes (proteins) per each type; - `subsets_controlProbeName_percent`: numeric value representing the percentage of counts (intensities) from each type of control probe (protein) relative to the total probe counts (protein intensities) per cell; - `control_sum`: numeric value indicating total control probe counts (protein intensities) per cell. It is the sum of counts (intensities) of all types of controls; - `control_detected`: numeric value representing the number of uniquely detected control probes (proteins) per cell; - `target_sum`: numeric value giving target probe counts (protein intensities) per cell, obtained as `sum` - `control_sum`; - `target_detected`: numeric value indicating number of uniquely detected target probes (proteins) per cell; - `ctrl_total_ratio`: numeric value representing the ratio between total control probe counts (protein intensities) and total probe counts (protein intensities) per cell; - `log2Ctrl_total_ratio`: numeric value indicating log2-transformed `ctrl_total_ratio` - `Area_um`: numeric value giving cell area in μm² (for MERFISH technology it corresponds to `volume` in μm³); - `dist_border_x`: numeric value representing the distance in pixels from the cell to the closest vertical border of its FoV (only available for CosMx); - `dist_border_y`: numeric value representing the distance in pixels from the cell to the closest horizontal border of its FoV (only available for CosMx); - `dist_border`: numeric value representing the distance from the cell to the nearest border of its FoV (only available for CosMx), computed as the minimum of `dist_border_x` and `dist_border_y`; - `log2AspectRatio`: numeric value indicating log2-transformed ratio between cell maximum length along the x dimension and cell maximum length along the y dimension in pixels; - `SignalDensity`: numeric value indicating the ratio between total probe counts (protein intensities) per cell and cell area in µm² (or volume in µm³ for MERFISH technology); - `log2SignalDensity`: log2-transformed `SignalDensity`. ### Compute Quality Score (QS) This section provides further details about the core component of the method: the Quality Score (QS). As already mentioned in [RNA](https://bioconductor.org/packages/devel/bioc/vignettes/SpaceTrooper/inst/doc/RNA_tutorial_vignette.html) and [Protein](https://bioconductor.org/packages/devel/bioc/vignettes/SpaceTrooper/inst/doc/Protein_tutorial_vignette.html) tutorial vignettes, a random seed should be set to ensure reproducible results, as the method involves underlying stochastic processes. **IMPORTANT**: when working with recent MERSCOPE datasets, or with any dataset that uses numeric cell identifiers, it is strongly recommended to convert cell IDs to character strings before calling the `computeQCScore` function. This prevents errors during the selection of training examples when estimating the coefficients of the QS formula. ```{r compute-QS, message = TRUE} # MERSCOPE cell_id conversion # spe$cell_id <- as.character(spe$cell_id) set.seed(1713) spe <- computeQCScore(spe, verbose=TRUE) ``` When `verbose=TRUE`, additional information regarding the underlying steps is reported during the estimation of the QS formula coefficients. For this purpose, bad and good example cells are selected to train a model that has to assign an appropriate QS to each cell. Bad example cells are identified separately for each metric according to several criteria, based on outlier detection through a statistical test. The specific test applied depends on the distribution of the metric: for asymmetric distributions the Medcouple method is used, whereas for symmetric distributions the Median Absolute Deviation (MAD) is applied, as implemented in `scuttle::isOutlier()` with default parameters. For a detailed description of the selection criteria used for each metric, please refer to the Methods section of the [paper](https://www.biorxiv.org/content/10.64898/2025.12.24.696336v1). When inspecting the printed output, the following information is reported: - The first section summarizes the number of outliers identified for each metric. The three reported values correspond to outliers detected in the right tail, the left tail of the distribution, and the remaining non-outlier cells, respectively. Outlier detection is performed on the full dataset after excluding cells with zero counts, if these were retained when running the `spatialPerCellQC` function. This step is carried out for all metrics, even those that are not ultimately included in the final QS formula; however, outliers from excluded metrics are not used for model training. - The next section reports the number of selected bad example cells, followed by the number of good example cells, which matches the former. Because the criteria for selecting good examples are intentionally permissive, these cells are sampled at random. - The final section displays the QS formula used for model training and score computation, along with the estimated coefficients for each retained term and all pairwise interactions. **IMPORTANT**: at present, QS computation requires `log2SignalDensity` as a mandatory metric in the QS formula. If an insufficient number of outliers is detected for this metric (fewer than 0.1% of the dataset after excluding zero-count cells), QS computation cannot proceed using the remaining metrics. In such cases, the provided code will still add a`QC_score` column to `colData`. This column is populated if the minimum requirement is met and contains `NA` values otherwise. ```{r compute-QS-safe-run, message = TRUE} # safe run function safe_run <- function(expr) { tryCatch( list(result = expr, error = NULL), error = function(e) list(result = NULL, error = e) ) } out <- safe_run(computeQCScore(spe)) if (!is.null(out$error)) { message("Failed: ", out$error$message) colData(spe)$QC_score <- NA } else { spe <- out$result } ``` ### Single metric flagging The `SpaceTrooper` package allows the identification of numerically aberrant cells through outlier detection for each quantitative metric, using the same procedures applied internally by the `computeQCScore` function. This enables metric-specific cell flagging in a manner analogous to standard scRNA-seq quality control workflows. ```{r compute-outliers-1, message = TRUE, warning = TRUE} # detecting outliers for cell area in Xenium dataset with both Medcouple # and MAD xen <- computeSpatialOutlier(xen, computeBy="Area_um", method = "both") ``` Pay attention to warnings: setting the method to `both`, applies both the Medcouple and MAD approaches to identify outliers. Outlier results are stored in two separate `colData` columns, with names formed by appending `_outlier_mc` and `_outlier_sc` to the variable name. If a warning indicates that the variable is symmetric, only the `_outlier_sc` column should be used for downstream analyses. ```{r compute-outliers-3, message = TRUE} spe <- computeSpatialOutlier(spe, computeBy="Mean.CD68", method = "both") table(spe$Mean.CD68_outlier_mc) ``` The three numbers correspond to outliers in the right tail (`HIGH`), in the left tail (`LOW`) of the variable distribution, and remaining non-outlier cells (`NO`). ## Visualization Variable distributions can be visualized using the `plotMetricsHist` function, as demonstrated in the [RNA](https://bioconductor.org/packages/devel/bioc/vignettes/SpaceTrooper/inst/doc/RNA_tutorial_vignette.html) and [Protein](https://bioconductor.org/packages/devel/bioc/vignettes/SpaceTrooper/inst/doc/Protein_tutorial_vignette.html) tutorial vignette. If the `computeSpatialOutlier` function has already been executed, the resulting columns can be provided as input to `useFences` parameter to display lines corresponding to the upper and lower thresholds, used to define `HIGH` and `LOW` outliers, respectively. ```{r plot-hist-1} spe <- computeSpatialOutlier(spe, computeBy="log2SignalDensity", method = "both") plotMetricHist(spe, metric="log2SignalDensity", useFences="log2SignalDensity_outlier_mc") ``` The threshold values are indicated in the plot legend. The `plotZoomFovsMap` function provides a zoomed-in view of selected FoVs, with cell polygons colored according to the metric specified in the `colourBy` parameter. In addition, it includes an overview map of all FoVs, analogous to that produced by `plotCellsFovs` function. ```{r plot-zoom-fov-1} # plotting FoVs map and zoom in of selected FoV colored by QS plotZoomFovsMap(spe, fovs=16, colourBy="QC_score", csize=3, scaleBars=TRUE) ``` For this plot, `scaleBars` parameter controls whether scale bars are displayed in both plots simultaneously. To show the scale bar in only one of the two plots, set either `scaleBarMap=FALSE` or `scaleBarPol=FALSE`. ```{r plot-zoom-fov-2} plotZoomFovsMap(prot, fovs = 344, colourBy="Area_um", csize=3, scaleBarMap=TRUE, scaleBarPol=FALSE) ``` For the other plots, the `scaleBar` parameter can be set to `FALSE` to hide the scale bar. ```{r plot-centroids} plotCentroids(spe, colourBy="QC_score", size=3, scaleBar=FALSE) + scale_fill_viridis_c(option = "plasma") + scale_color_viridis_c(option = "plasma") ``` ```{r plot-polygons} plotPolygons(prot, colourBy="log2SignalDensity", scaleBar=FALSE) + scale_fill_viridis_c(option="mako") ``` ## Session Information ```{r sessionInfo} sessionInfo() ```