--- title: "Performing a spatial analysis of multiplexed tissue imaging data." params: test: FALSE author: - name: Alexander Nicholls affiliation: - &WIMR Westmead Institute for Medical Research, University of Sydney, Australia - name: Nicholas Robertson affiliation: - School of Mathematics and Statistics, University of Sydney, Australia - name: Nicholas Canete affiliation: - &WIMR Westmead Institute for Medical Research, University of Sydney, Australia - name: Elijah Willie affiliation: - &WIMR Westmead Institute for Medical Research, University of Sydney, Australia - School of Mathematics and Statistics, University of Sydney, Australia - name: Ellis Patrick affiliation: - &WIMR Westmead Institute for Medical Research, University of Sydney, Australia - School of Mathematics and Statistics, University of Sydney, Australia date: 27 July, 2022 vignette: > %\VignetteIndexEntry{"Introduction to a spicy workflow"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- ```{r setup, include=FALSE, message=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) library(BiocStyle) ``` # Version Info

**R version**: `r R.version.string`
**Bioconductor version**: `r BiocManager::version()`
# Introduction Understanding the interplay between different types of cells and their immediate environment is critical for understanding the mechanisms of cells themselves and their function in the context of human diseases. Recent advances in high dimensional in situ cytometry technologies have fundamentally revolutionised our ability to observe these complex cellular relationships providing an unprecedented characterisation of cellular heterogeneity in a tissue environment. ## Motivation for submitting to Bioconductor We have developed an analytical framework for analysing data from high dimensional in situ cytometry assays including CODEX, CycIF, IMC and High Definition Spatial Transcriptomics. Implemented in R, this framework makes use of functionality from our Bioconductor packages spicyR, lisaClust, treekoR, FuseSOM, simpleSeg and ClassifyR. Below we will provide an overview of key steps which are needed to interrogate the comprehensive spatial information generated by these exciting new technologies including cell segmentation, feature normalisation, cell type identification, micro-environment characterisation, spatial hypothesis testing and patient classification. Ultimately, our modular analysis framework provides a cohesive and accessible entry point into spatially resolved single cell data analysis for any R-based bioinformaticians. # Installation To install the current release of spicyWorkflow, run the following code. ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("spicyWorkflow") ``` # Loading R packages ```{r load libraries, echo=FALSE, results="hide", warning=FALSE} suppressPackageStartupMessages({ library(cytomapper) library(dplyr) library(ggplot2) library(simpleSeg) library(FuseSOM) library(ggpubr) library(scater) library(spicyR) library(ClassifyR) library(lisaClust) }) ``` ```{r, eval=FALSE} library(cytomapper) library(dplyr) library(ggplot2) library(simpleSeg) library(FuseSOM) library(ggpubr) library(scater) library(spicyR) library(ClassifyR) library(lisaClust) ``` # Global paramaters It is convenient to set the number of cores for running code in parallel. Please choose a number that is appropriate for your resources. Set the `use_mc` flag to `TRUE` if you would like to use parallel processing for the rest of the vignette. A minimum of 2 cores is suggested since running this workflow is rather computationally intensive. ```{r set parameters} use_mc <- FALSE if (use_mc) { nCores <- max(parallel::detectCores() - 1, 1) } else { nCores <- 2 } BPPARAM <- simpleSeg:::generateBPParam(nCores) theme_set(theme_classic()) ``` # Context In the following we will re-analyse some MIBI-TOF data [(Risom et al, 2022)](https://www.sciencedirect.com/science/article/pii/S0092867421014860?via%3Dihub#!) profiling the spatial landscape of ductal carcinoma in situ (DCIS), which is a pre-invasive lesion that is thought to be a precursor to invasive breast cancer (IBC). The key conclusion of this manuscript (amongst others) is that spatial information about cells can be used to predict disease progression in patients. We will use our spicy workflow to make a similar conclusion. The R code for this analysis is available on github [https://github.com/SydneyBioX/spicyWorkflow](https://github.com/SydneyBioX/spicyWorkflow). A mildly [processed](https://github.com/SydneyBioX/spicyWorkflow/blob/master/organisePublishedData.R) version of the data used in the manuscript is available in this repository. # Read in images The images are stored in the `images` folder within the `data` folder. Here we use `loadImages()` from the `cytomapper` package to load all the tiff images into a `CytoImageList` object and store the images as h5 file on-disk. ```{r load images} pathToImages <- system.file("extdata/images", package = "spicyWorkflow") # Store images in a CytoImageList on_disk as h5 files to save memory. images <- cytomapper::loadImages( pathToImages, single_channel = TRUE, on_disk = TRUE, h5FilesPath = HDF5Array::getHDF5DumpDir(), BPPARAM = BPPARAM ) gc() ``` # Load the clinical data To associate features in our image with disease progression, it is important to read in information which links image identifiers to their progression status. We will do this here, making sure that our `imageID` match. ## Read the clinical data ```{r load clincal data} # Read in clinical data, manipulate imageID and select columns clinical <- read.csv( system.file( "extdata/1-s2.0-S0092867421014860-mmc1.csv", package = "spicyWorkflow" ) ) clinical <- clinical |> mutate(imageID = paste0( "Point", PointNumber, "_pt", Patient_ID, "_", TMAD_Patient )) image_idx <- grep("normal", clinical$Tissue_Type) clinical$imageID[image_idx] <- paste0(clinical$imageID[image_idx], "_Normal") clinicalVariables <- c( "imageID", "Patient_ID", "Status", "Age", "SUBTYPE", "PAM50", "Treatment", "DCIS_grade", "Necrosis" ) rownames(clinical) <- clinical$imageID ``` ## Put the clinical data into the colData of SingleCellExperiment We can then store the clinical information in the `mcols` of the `CytoImageList`. ```{r add clinical data} # Add the clinical data to mcols of images. mcols(images) <- clinical[names(images), clinicalVariables] ``` # SimpleSeg: Segment the cells in the images Our simpleSeg R package on [https://github.com/SydneyBioX/simpleSeg](https://github.com/SydneyBioX/simpleSeg) provides a series of functions to generate simple segmentation masks of images. These functions leverage the functionality of the [EBImage](https://bioconductor.org/packages/release/bioc/vignettes/EBImage/inst/doc/EBImage-introduction.html) package on Bioconductor. For more flexibility when performing your segmentation in R we recommend learning to use the EBimage package. A key strength of the simpleSeg package is that we have coded multiple ways to perform some simple segmentation operations as well as incorporating multiple automatic procedures to optimise some key parameters when these aren't specified. ## Run simpleSeg If your images are stored in a `list` or `CytoImageList` they can be segmented with a simple call to `simpleSeg()`. Here we have ask `simpleSeg` to do multiple things. First, we would like to use a combination of principal component analysis of all channels guided by the H33 channel to summarise the nuclei signal in the images. Secondly, to estimate the cell body of the cells we will simply dilate out from the nuclei by 2 pixels. We have also requested that the channels be square root transformed and that a minimum cell size of 40 pixels be used as a size selection step. ```{r segment} # Generate segmentation masks masks <- simpleSeg( images, nucleus = c("HH3"), cellBody = "dilate", transform = "sqrt", sizeSelection = 40, discSize = 2, pca = TRUE, cores = nCores ) ``` ## Visualise separation The `display` and `colorLabels` functions in `EBImage` make it very easy to examine the performance of the cell segmentation. The great thing about `display` is that if used in an interactive session it is very easy to zoom in and out of the image. ```{r visualise segmentation} # Visualise segmentation performance one way. EBImage::display(colorLabels(masks[[1]])) ``` ## Visualise outlines The `plotPixels` function in `cytomapper` make it easy to overlay the masks on top of the intensities of 6 markers. Here we can see that the segmentation appears to be performing reasonably. ```{r} # Visualise segmentation performance another way. cytomapper::plotPixels( image = images[1], mask = masks[1], img_id = "imageID", colour_by = c("PanKRT", "GLUT1", "HH3", "CD3", "CD20"), display = "single", colour = list( HH3 = c("black", "blue"), CD3 = c("black", "purple"), CD20 = c("black", "green"), GLUT1 = c("black", "red"), PanKRT = c("black", "yellow") ), bcg = list( HH3 = c(0, 1, 1.5), CD3 = c(0, 1, 1.5), CD20 = c(0, 1, 1.5), GLUT1 = c(0, 1, 1.5), PanKRT = c(0, 1, 1.5) ), legend = NULL ) ``` # Summarise cell features. In order to characterise the phenotypes of each of the segmented cells, `measureObjects` from `cytomapper` will calculate the average intensity of each channel within each cell as well as a few morphological features. The channel intensities will be stored in the `counts assay` in a `SingleCellExperiment`. Information on the spatial location of each cell is stored in `colData` in the `m.cx` and `m.cy` columns. In addition to this, it will propagate the information we have store in the `mcols` of our `CytoImageList` in the `colData` of the resulting `SingleCellExperiment`. ```{r} # Summarise the expression of each marker in each cell cells <- cytomapper::measureObjects( masks, images, img_id = "imageID", BPPARAM = BPPARAM ) ``` # Normalise data We should check to see if the marker intensities of each cell require some form of transformation or normalisation. Here we extract the intensities from the `counts` assay. Looking at CK7 which should be expressed in the majority of the tumour cells, the intensities are clearly very skewed. ```{r, fig.width=5, fig.height=5} # Extract marker data and bind with information about images df <- as.data.frame(cbind(colData(cells), t(assay(cells, "counts")))) # Plots densities of CK7 for each image. ggplot(df, aes(x = CK7, colour = imageID)) + geom_density() + theme(legend.position = "none") ``` We can transform and normalise our data using the `normalizeCells` function. Here we have taken the intensities from the `counts` assay, performed a square root transform, then for each image trimmed the 99 quantile and min-max scaled to 0-1. This modified data is then stored in the `norm` assay by default. We can see that this normalised data appears more bimodal, not perfectly, but likely to a sufficient degree for clustering. ```{r, fig.width=5, fig.height=5} # Transform and normalise the marker expression of each cell type. # Use a square root transform, then trimmed the 99 quantile cells <- normalizeCells(cells, transformation = "asinh", method = c("trim99", "minMax", "PC1"), assayIn = "counts", cores = nCores ) # Extract normalised marker information. norm_df <- as.data.frame(cbind(colData(cells), t(assay(cells, "norm")))) # Plots densities of normalised CK7 for each image. ggplot(norm_df, aes(x = CK7, colour = imageID)) + geom_density() + theme(legend.position = "none") ``` # FuseSOM: Cluster cells into cell types Our FuseSOM R package on [https://github.com/ecool50/FuseSOM](https://github.com/ecool50/FuseSOM) and provides a pipeline for the clustering of highly multiplexed in situ imaging cytometry assays. This pipeline uses the Self Organising Map architecture coupled with Multiview hierarchical clustering and provides functions for the estimation of the number of clusters. Here we cluster using the `runFuseSOM` function. We have chosen to specify the same subset of markers used in the original manuscript for gating cell types. We have also specified the number of clusters to identify to be `numClusters = 24`. In addition to this, while FuseSOM can automatically estimate a grid size for the self organising map. ## Perform the clustering ```{r FuseSOM} # The markers used in the original publication to gate cell types. useMarkers <- c( "PanKRT", "ECAD", "CK7", "VIM", "FAP", "CD31", "CK5", "SMA", "CD45", "CD4", "CD3", "CD8", "CD20", "CD68", "CD14", "CD11c", "HLADRDPDQ", "MPO", "Tryptase" ) # Set seed. set.seed(51773) # Generate SOM and cluster cells into 20 groups. cells <- runFuseSOM( cells, markers = useMarkers, assay = "norm", numClusters = 24 ) ``` ## Attempt to interpret the phenotype of each cluster We can begin the process of understanding what each of these cell clusters are by using the `plotGroupedHeatmap` function from `scater`. At the least, here we can see we capture all the major immune populations that we expect to see. ```{r} # Visualise marker expression in each cluster. scater::plotGroupedHeatmap( cells, features = useMarkers, group = "clusters", exprs_values = "norm", center = TRUE, scale = TRUE, zlim = c(-3, 3), cluster_rows = FALSE ) ``` ## Check how many clusters should be used. We can check to see how reasonable our choice of 24 clusters is using the `estimateNumCluster` and the `optiPlot` functions. Here we examine the Gap method, others such as Silhouette and Within Cluster Distance are also available. As we can be seen below, we chose the second elbow point as the optimal number of clusters. ```{r} # Generate metrics for estimating the number of clusters. # As I've already run runFuseSOM I don't need to run generateSOM(). cells <- estimateNumCluster(cells, kSeq = 2:30) optiPlot(cells, method = "gap") ``` ## Check cluster frequencies We find it always useful to check the number of cells in each cluster. Here we can see that cluster 4 is contains lots of (most likely tumour) cells and cluster 16 contains very few cells. ```{r} # Check cluster frequencies. colData(cells)$clusters |> table() |> sort() ``` ## Dimension reduction As our data is stored in a `SingleCellExperiment` we can also use `scater` to perform and visualise our data in a lower dimension to look for cluster differences. ```{r} set.seed(51773) # Perform dimension reduction using UMP. cells <- scater::runUMAP( cells, subset_row = useMarkers, exprs_values = "norm" ) # Select a subset of images to plot. someImages <- unique(colData(cells)$imageID)[c(1, 10, 20, 40, 50, 60)] # UMAP by cell type cluster. scater::plotReducedDim( cells[, colData(cells)$imageID %in% someImages], dimred = "UMAP", colour_by = "clusters" ) ``` # Test For association between the proportion of each cell type and progression status We recommend using a package such as `diffcyt` for testing for changes in abundance of cell types. However, the `colTest` function allows us to quickly test for associations between the proportions of the cell types and progression status using either Wilcoxon rank sum tests or t-tests. Here we see a p-value less than 0.05, but this does not equate to a small FDR. ```{r} # Select cells which belong to individuals with progressor status. cellsToUse <- cells$Status %in% c("nonprogressor", "progressor") # Perform simple wicoxon rank sum tests on the columns of the proportion matrix. testProp <- colTest(cells[, cellsToUse], condition = "Status", feature = "clusters" ) testProp ``` ```{r} imagesToUse <- rownames(clinical)[clinical[, "Status"] %in% c("nonprogressor", "progressor")] prop <- getProp(cells, feature = "clusters") clusterToUse <- rownames(testProp)[1] boxplot(prop[imagesToUse, clusterToUse] ~ clinical[imagesToUse, "Status"]) ``` # spicyR: test spatial relationships Our spicyR package (https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html)[https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html] provides a series of functions to aid in the analysis of both immunofluorescence and mass cytometry imaging data as well as other assays that can deeply phenotype individual cells and their spatial location. Here we use the `spicy` function to test for changes in the spatial relationships between pair-wise combinations of cells. We quantify spatial relationships using a combination of three radii `Rs = c(20, 50, 100)` and mildly account for some global tissue structure using `sigma = 50`. ```{r} # Test for changes in pair-wise spatial relationships between cell types. spicyTest <- spicy( cells[, cellsToUse], condition = "Status", cellType = "clusters", imageID = "imageID", spatialCoords = c("m.cx", "m.cy"), Rs = c(20, 50, 100), sigma = 50, BPPARAM = BPPARAM ) topPairs(spicyTest, n = 10) ``` We can visualise these tests using `signifPlot` where we observe that cell type pairs appear to become less attractive (or avoid more) in the progression sample. ```{r} # Visualise which relationships are changing the most. signifPlot( spicyTest, breaks = c(-1.5, 3, 0.5) ) ``` # lisaClust: Find cellular neighbourhoods Our lisaClust package (https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html)[https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html] provides a series of functions to identify and visualise regions of tissue where spatial associations between cell-types is similar. This package can be used to provide a high-level summary of cell-type co-localisation in multiplexed imaging data that has been segmented at a single-cell resolution. Here we use the `lisaClust` function to clusters cells into 5 regions with distinct spatial ordering. ```{r} set.seed(51773) # Cluster cells into spatial regions with similar composition. cells <- lisaClust( cells, k = 5, Rs = c(20, 50, 100), sigma = 50, spatialCoords = c("m.cx", "m.cy"), cellType = "clusters", BPPARAM = BPPARAM ) ``` ## Region - cell type enrichment heatmap We can try to interpret which spatial orderings the regions are quantifying using the `regionMap` function. This plots the frequency of each cell type in a region relative to what you would expect by chance. ```{r, fig.height=5, fig.width=5} # Visualise the enrichment of each cell type in each region regionMap(cells, cellType = "clusters", limit = c(0.2, 5)) ``` ## Visualise regions By default, these identified regions are stored in the `regions` column in the `colData` of our object. We can quickly examine the spatial arrangement of these regions using `ggplot`. ```{r} # Extract cell information and filter to specific image. df <- colData(cells) |> as.data.frame() |> filter(imageID == "Point2206_pt1116_31620") # Colour cells by their region. ggplot(df, aes(x = m.cx, y = m.cy, colour = region)) + geom_point() ``` While much slower, we have also implemented a function for overlaying the region information as a hatching pattern so that the information can be viewed simultaneously with the cell type calls. ```{r eval = FALSE} # Use hatching to visualise regions and cell types. hatchingPlot( cells, useImages = "Point2206_pt1116_31620", cellType = "clusters", spatialCoords = c("m.cx", "m.cy") ) ``` This plot is a ggplot object and so the scale can be modified with `scale_region_manual`. ```{r} # Use hatching to visualise regions and cell types. # Relabel the hatching of the regions. hatchingPlot( cells, useImages = "Point2206_pt1116_31620", cellType = "clusters", spatialCoords = c("m.cx", "m.cy"), window = "square", nbp = 300, line.spacing = 41 ) + scale_region_manual(values = c( region_1 = 2, region_2 = 1, region_3 = 5, region_4 = 4, region_5 = 3 )) + guides(colour = guide_legend(ncol = 2)) ``` ## Test for association with progression If needed, we can again quickly use the `colTest` function to test for associations between the proportions of the cells in each region and progression status using either Wilcoxon rank sum tests or t-tests. Here we see an adjusted p-value less than 0.05. ```{r} # Test if the proportion of each region is associated # with progression status. testRegion <- colTest( cells[, cellsToUse], feature = "region", condition = "Status" ) testRegion ``` # ClassifyR: Classification Our ClassifyR package, [https://github.com/SydneyBioX/ClassifyR](https://github.com/SydneyBioX/ClassifyR), formalises a convenient framework for evaluating classification in R. We provide functionality to easily include four key modelling stages; Data transformation, feature selection, classifier training and prediction; into a cross-validation loop. Here we use the `crossValidate` function to perform 100 repeats of 5-fold cross-validation to evaluate the performance of an elastic net model applied to three quantification of our MIBI-TOF data; cell type proportions, average mean of each cell type and region proportions. ```{r message=FALSE, warning=FALSE} # Create list to store data.frames data <- list() # Add proportions of each cell type in each image data[["props"]] <- getProp(cells, "clusters") # Add pair-wise associations data[["dist"]] <- getPairwise( cells, spatialCoords = c("m.cx", "m.cy"), cellType = "clusters", Rs = c(20, 50, 100), sigma = 50, BPPARAM = BPPARAM ) data[["dist"]] <- as.data.frame(data[["dist"]]) # Add proportions of each region in each image # to the list of dataframes. data[["regions"]] <- getProp(cells, "region") # Subset data images with progression status and NA clinical variables. measurements <- lapply(data, function(x) x[imagesToUse, ]) # Set seed set.seed(51773) # Perform cross-validation of an elastic net model # with 100 repeats of 5-fold cross-validation. cv <- crossValidate( measurements = measurements, outcome = clinical[imagesToUse, "Status"], classifier = "GLM", nFolds = 5, nRepeats = 100, nCores = nCores ) ``` ## Visualise cross-validated prediction performance Here we use the `performancePlot` function to assess the AUC from each repeat of the 5-fold cross-validation. We see that the lisaClust regions appear to capture information which is predictive of progression status of the patients. ```{r} # Calculate AUC for each cross-validation repeat and plot. performancePlot( cv, metric = "AUC", characteristicsList = list(x = "Assay Name") ) ``` # Summary Here we have used a pipeline of our spatial analysis R packages to demonstrate an easy way to segment, cluster, normalise, quantify and classify high dimensional in situ cytometry data all within R. # sessionInfo ```{r} sessionInfo() ```