--- title: "Functional Data Analysis of Spatial Metrics" author: - name: "Martin Emons" affiliation: - &DMLS Department of Molecular Life Sciences, University of Zurich, Switzerland - &SIB SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland email: "martin.emons@uzh.ch" - name: Mark D. Robinson affiliation: - *DMLS - *SIB package: "`r BiocStyle::Biocpkg('spatialFDA')`" output: BiocStyle::html_document abstract: > A package to calculate spatial statistics metrics, explore them with functional principal component analysis and compare them with functional additive mixed models vignette: > %\VignetteIndexEntry{Functional Data Analysis of Spatial Metrics} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} bibliography: spatialFDA.bib editor_options: chunk_output_type: console --- ```{r v1, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) ``` # Introduction This vignette demonstrates how to use `r BiocStyle::Biocpkg('spatialFDA')` to perform functional data analysis on spatial statistics metrics. The main aim of this package is to detect differential spatial arrangements within and between celltypes given several samples/conditions. It does so by calculating spatial statistics metrics via the `r BiocStyle::CRANpkg('spatstat')` package and comparing differences using functional additive mixed models as implemented in the `r BiocStyle::CRANpkg('refund')` package [@spatstat2005; @refund2024]. The use case is a dataset from the `r BiocStyle::Biocpkg('imcdatasets')` package, which contains images from 12 human donors [@damondMapHumanType2019]. This package is similar to other packages in `python` and `R`. The following table shows the main differences in terms of functionality [@ali2024graphcompass; @canete2022spicyr; @wrobel2024mxfda]. |Package name | Foundation | Testing procedure | | ------------------------------------ | --------------- | ------------------- | | `r BiocStyle::Biocpkg('spicyR')` | $L$-function | Scalar comparison | | `GraphCompass` | Graph-based | Graph and scalar comparison | | `r BiocStyle::CRANpkg('mxfda')` | K-, G- and L- function | Function as input for survival modelling | | `r BiocStyle::Biocpkg('spatialFDA')` | most `r BiocStyle::CRANpkg('spatstat')` functions | Functional comparison over domain | # Installation `r BiocStyle::Biocpkg('spatialFDA')` can be installed and loaded from Bioconductor as follows ```{r installation, include = TRUE, eval = FALSE} if (!requireNamespace("BiocManager")) { install.packages("BiocManager") } BiocManager::install("spatialFDA") ``` ```{r setup, warning = FALSE, message = FALSE} library("spatialFDA") library("dplyr") library("ggplot2") library("tidyr") library("stringr") library("dplyr") library("patchwork") ``` # Getting started In this vignette we will analyse a diabetes dataset acquired by imaging mass cytometry (IMC) as developed by Damond et al. [@damondMapHumanType2019]. The dataset contains images from 12 human donors, 4 healthy and 8 with type 1 diabetes (T1D). With IMC, 35 markers were measured at single cell resolution [@damondMapHumanType2019]. ## Loading the data The Damond et al. [@damondMapHumanType2019] dataset is easily loaded via the `r BiocStyle::Biocpkg('imcdatasets')` package. The entire dataset can be loaded by setting `full_dataset = TRUE`. For computational reasons, one can reduce to three patients as well by setting this flag to `FALSE`. We will use the small version to show the functionality of the package even though the statistical power of results is greatly reduced by considering only 100 out of 845 images. For a more interesting use case we encourage readers to load the full dataset for a multi-condition/multi-sample example. The package offers multiple datatypes, we will use the `r BiocStyle::Biocpkg('SpatialExperiment')` (SPE) object. ```{r loading, warning = FALSE, message = FALSE} # load the dataset as SpatialExperiment object spe <- imcdatasets::Damond_2019_Pancreas("spe", full_dataset = TRUE) # subset the example for computational reasons spe <- subset(spe, ,patient_id %in% c(6089,6180,6126,6134,6228,6362)) ``` ## Visualising the raw data We can look at the fields of view (FOVs) of the diabetes dataset. To do so we extract the spatial coordinates, store them as a dataframe and add the colData from the SPE to this. We will look only at the first four FOVs of the healthy sample. We plot both the cell categories of all cells and then the cell types of secretory cells ($\alpha, \beta$ and $\delta$ cells) and T-cells (CD8+ and CD4+ T-cells). ```{r plotting fovs, warning = FALSE, fig.width=8, fig.height=15} df <- data.frame(spatialCoords(spe), colData(spe)) dfSub <- df %>% subset(image_name %in% c("E02", "E03", "E04", "E05")) p <- ggplot(dfSub, aes(x = cell_x, y = cell_y, color = cell_category)) + geom_point(size= 0.5) + facet_wrap(~image_name) + theme(legend.title.size = 20, legend.text.size = 20) + xlab("x") + ylab("y") + labs(color = "cell category")+ coord_equal() + theme_light() dfSub <- dfSub %>% subset(cell_type %in% c("alpha", "beta", "delta", "Th", "Tc")) q <- ggplot(dfSub, aes(x = cell_x, y = cell_y, color = cell_type)) + geom_point(size= 0.5) + facet_wrap(~image_name) + theme(legend.title.size = 20, legend.text.size = 20) + xlab("x") + ylab("y") + labs(color = "cell type") + coord_equal() + theme_light() wrap_plots(list(p,q), widths = c(1,1), heights = c(1,1), nrow = 2, ncol = 1) ``` # Calculating Spatial Statistics Metrics In a first step, we calculate a spatial statistic curve as implemented by `r BiocStyle::CRANpkg('spatstat')`. One can choose from a range of metrics for discrete marks and calculate these within a mark or between two marks. Common metrics are: - Ripley's $K$ function and its variance stabilised form, Besag's $L$ - Pair correlation function $g$ - Nearest-neighbour function $G$ - Empty space function $F$ Note that all of these functions have different implementations to correct for inhomogeneity and for comparison between two marks (cross functions) [@baddeleySpatialPointPatterns]. ## Correlation With correlation metrics, we assess the *distances* of all points to one another while normalising for density effects and of the window size $|W|$. Furthermore, spatial metrics are corrected for edge effects, due to the fact that points at the border of a FOV do not have a fully-observed neighborhood [@baddeleySpatialPointPatterns, pp. 203 ff.]. A well-known metric is Ripley's $K$ function or its variance-stabilised transformation, the $L$ function. We can calculate a variant of the $L$ function with the function `calcMetricPerFov` between e.g $\alpha$ and cytotoxic T cells. The output is a dataframe with the following most important columns: - `r`: the radius at which the spatial metric is evaluated - `theo`: the theoretical value of a homogeneous (Poisson) realisation of a point process - `iso`: an isotropic edge corrected value of the $L$ function ```{r Lfunction, warning = FALSE} metricRes <- calcMetricPerFov(spe = spe, selection = c("alpha", "Tc"), subsetby = "image_number", fun = "Lcross", marks = "cell_type", rSeq = seq(0, 50, length.out = 50), by = c("patient_stage", "patient_id", "image_number"), ncores = 1) metricRes %>% head(3) ``` We can visualise this metric with `plotMetricPerFov` function. Here, we need to specify which border correction we want to plot and what the x-axis is. Both can vary from function to function. ```{r plotLfunction, warning = FALSE, fig.width=8, fig.height=8} # create a unique plotting ID metricRes$ID <- paste0( metricRes$patient_stage, "|", metricRes$patient_id ) # change levels for plotting metricRes$ID <- factor(metricRes$ID, levels = c("Non-diabetic|6126", "Non-diabetic|6134", "Onset|6228","Onset|6362", "Long-duration|6089", "Long-duration|6180")) # plot metrics plotMetricPerFov(metricRes, correction = "iso", x = "r", imageId = "image_number", ID = "ID", ncol = 2) ``` By eye, we see no visible difference between the conditions in terms of correlation of $\alpha$ and cytotoxic T cells. ## Spacing Another important aspect of spatial analysis is spacing. Here, the shortest distances or empty space to the next neighbor is calculated. This quantifies a different aspect of a point pattern than correlation or intensity of points. Two well-known functions are [@baddeleySpatialPointPatterns, pp. 255-266]: - nearest-neighbor distance distribution $G$ - empty space function $F$ For spacing metrics, we get different border corrections but otherwise the output stays the same: ```{r Gfunction, warning = FALSE} metricRes <- calcMetricPerFov(spe = spe, selection = c("alpha", "Tc"), subsetby = "image_number", fun = "Gcross", marks = "cell_type", rSeq = seq(0, 50, length.out = 50), by = c("patient_stage", "patient_id", "image_number"), ncores = 1) metricRes %>% head(3) ``` ```{r plotGfunction, warning = FALSE, fig.width=8, fig.height=8} # create a unique plotting ID metricRes$ID <- paste0( metricRes$patient_stage, "|", metricRes$patient_id ) # change levels for plotting metricRes$ID <- factor(metricRes$ID, levels = c("Non-diabetic|6126", "Non-diabetic|6134", "Onset|6228","Onset|6362", "Long-duration|6089", "Long-duration|6180")) # plot metrics plotMetricPerFov(metricRes, correction = "km", x = "r", imageId = "image_number", ID = "ID", ncol = 2) ``` In the nearest-neighbor distance function, we see a strong difference between onset T1D, long-duration T1D and non-diabetic controls in terms of spacing of $\alpha$ and cytotoxic T cells. # Functional boxplot Looking at raw spatial statistics curves can be challenging. In order to summarise this information, we can plot functional boxplots by aggregating the curves into boxplots via a user-defined variable `aggregate_by`. We use the `fbplot` function from the `r BiocStyle::CRANpkg('fda')` package [@sun2011functional; @ramsay2024fda]. ```{r, funcBoxPlot, warning = FALSE, results='hide'} # create a unique ID per row in the dataframe metricRes$ID <- paste0( metricRes$patient_stage, "x", metricRes$patient_id, "x", metricRes$image_number ) #removing field of views that have as a curve only zeros - these are cases where #there is no cells of one type metricRes <- metricRes %>% group_by(ID) %>% filter(sum(km) >= 1) collector <- plotFbPlot(metricRes, "r", "km", "patient_stage") ``` The functional boxplot shows that onset $G$-curves are more variable than the corresponding long-duration and non-diabetic curves. We note as well, that the variability is heteroscedastic along the domain (i.e., variance increases with radius), which is undesirable for our statistical modelling. Therefore, we can e.g. apply a variance stabilising transformation to our data or model this variance in the statistical model. # Functional principal component analysis Another analysis that can be performed is functional principal componentent analysis (fPCA). This is a method to capture the main modes of variation in functional data [@ramsayPrincipalComponentsAnalysis2005]. We use the `r BiocStyle::CRANpkg('refund')` implementation of fPCA. ```{r fPCA, warning = FALSE} # filter out all rows that have a constant zero part - all r<10 metricRes <- metricRes %>% filter(r > 10) # prepare dataframe from calcMetricRes to be in the correct format for pffr dat <- prepData(metricRes, "r", "km") # create meta info of the IDs splitData <- dat$ID %>% str_replace("-","_") %>% str_split_fixed("x", 3) %>% data.frame(stringsAsFactors = TRUE) %>% setNames(c("condition", "patient_id", "imageId")) %>% mutate(condition = relevel(condition,"Non_diabetic")) dat <- cbind(dat, splitData) # drop rows with NA dat <- dat |> drop_na() # calculate the fPCA pca <- functionalPCA(dat = dat, r = metricRes$r |> unique()) # plot the mean curve and the two first eigenfunctions p_mu <- ggplot(data.frame(r = unique(metricRes$r), mu = pca$mu), aes(x = r, y = mu)) + geom_line() + theme_light() + xlab("r [µm]") p_efunction1 <- ggplot(data.frame(r = unique(metricRes$r), phi1 = pca$efunctions[,1]), aes(x = r, y = phi1)) + geom_line() + theme_light() + ylim(-0.3,0.3) + xlab("r [µm]") p_efunction2 <- ggplot(data.frame(r = unique(metricRes$r), phi2 = pca$efunctions[,2]), aes(x = r, y = phi2)) + geom_line() + theme_light() + ylim(-0.3,0.3) + xlab("r [µm]") wrap_plots(list(p_mu, p_efunction1, p_efunction2), ncol = 3) # plot the biplot of the first two PCs plotFpca(dat = dat, res = pca, colourby = "condition") ``` In the biplot above we get a very basic differentiation of the $G$ curves. Onset T1D shows most variability along the first fPC. The second fPC describes less variation. # Functional additive mixed models The $L$ function above showed no clear difference between the three conditions whereas the $G$ function showed a strong difference between onset T1D and the two other conditions. In order to test these differences we will use generalised functional additive mixed models. These are generalisations of standard additive mixed models to compare functions over their entire domain. The package that we use is the `r BiocStyle::CRANpkg('refund')` package [@scheiplFunctionalAdditiveMixed2015; @scheiplGeneralizedFunctionalAdditive2016]. The model implemented here is of the form: $$ \mathbb{E}[y_i(r)] = g(\alpha(r) + \beta_{0,g(i)}(r) + \sum_{j=1}^J f_j(X_{ji},r) + \epsilon_i(r)) $$ With the following terms: - $y_i(r)$ the functional response, here the `r BiocStyle::CRANpkg('spatstat')` curves - $g$ optional link function - $\alpha(r)$ a global functional intercept varying over the domain $r$ - $\beta_{0,g(i)}(r)$ a random functional intercept varying over the domain $r$ per grouping variable $g(i)$. - $f_j(X_{ji},r)$ the additive predictors - $\epsilon_i(r)$ residual zero-mean Gaussian errors For the family we will use a quasi-likelihood distribution where the variance is modeled as quadratic to the mean. We do this to account for the heteroscedastic variance along the domain. In this context we need to specify a design matrix and contrasts. ```{r funcGamG, fig.height=10, warning = FALSE} # create a design matrix mm <- model.matrix(~condition, data = dat) colnames(mm)[1] <- "Intercept" r <- metricRes$r |> unique() # fit the model mdl <- functionalGam( data = dat, x = r, designmat = mm, weights = dat$npoints, formula = formula(Y ~ conditionLong_duration + conditionOnset + (s(patient_id, bs = "re"))), family = quasi(link = "identity", variance = "mu^2"), algorithm = "gam" ) summary(mdl) plotLs <- lapply(colnames(mm), plotMdl, mdl = mdl, shift = mdl$coefficients[["(Intercept)"]]) wrap_plots(plotLs, nrow = 3, axes = 'collect') ``` We note that there is a small difference in the $G$ function between non-diabetic and long-duration T1D samples, but a strong difference between non-diabetic and onset T1D according to the model summary. The point wise confidence bands are a limitation of this method and could be improved with either bootstrapping or continuous confidence bands [@liebl2023fast]. Thus, we see not only that a spatial difference in co-localisation of $\alpha$ and cytotoxic T cells is statistically significant but also at which spatial scale this difference occurs. ## Model evaluation One open problem is the implementation of confidence bands that reflect the non-independently and non-identically distributed residuals. To visualise how much of a problem this is, we can plot the contours of the correlation/covariance and look at some model diagnostics. ```{r contour, warning = FALSE} resid(mdl) |> cor() |> filled.contour() resid(mdl) |> cov() |> filled.contour() try(refund::pffr.check(mdl)) ``` In these model diagnostics, we note that there is still some variability in the residuals that is not considered by the model. The QQ plot indicates a good model fit. The residuals show a considerable structure that is in line with the structure in the auto-covariance / correlation plots. In the functional additive mixed model, we have a specified global intercept varying over the domain $r$ as well as functional random intercepts varying over the domain $r$ per grouping variable `patient_id`. We can plot these smooth estimates of the random intercepts. ```{r intercept, warning = FALSE} # look at the smooth random intercepts per patient data <- coef(mdl)$smterms$`s(patient_id)`$coef data <- data %>% left_join(splitData %>% select(patient_id, condition) %>% unique) p <- ggplot(data, aes(x.vec, value, colour = patient_id)) + geom_point(aes(shape=factor(condition))) + theme_light() + geom_smooth(aes(group = 1), col = 'black') + xlab("r [µm]") p ``` We note that there is still some structure in these intercepts, which is most likely due to identifiability problems between the global and random intercepts. ```{r sessionInfo} sessionInfo() ```