BayesSpace

Edward Zhao, Matt Stone, Xing Ren, and Raphael Gottardo

2022-04-26

library(SingleCellExperiment)
library(ggplot2)
library(BayesSpace)

Preparing your experiment for BayesSpace

Loading data

BayesSpace supports three ways of loading a SingleCellExperiment for analysis.

Visium datasets processed with Space Ranger can be loaded directly via the readVisium() function. This function takes only the path to the Space Ranger output directory (containing the spatial/ and filtered_feature_bc_matrix/ subdirectories) and returns a SingleCellExperiment.

Second, all datasets analyzed for the BayesSpace manuscript are readily accessible via the getRDS() function. This function takes two arguments - the name of the dataset, and the name of the sample in the dataset.

Finally, SingleCellExperiment objects can be constructed manually from a counts matrix and tables of row and column data. BayesSpace only requires that spot array coordinates be provided as columns named row and col in colData. (Note that enhancement of Visium datasets additionally requires the pixel coordinates of each spot in the tissue image, but in this case the dataset should be loaded with readVisium(), which loads these data automatically.)

We’ll continue with the melanoma sample from the 2018 Spatial Transcriptomics paper for the remaining examples in this vignette.

Pre-processing data

BayesSpace requires minimal data pre-processing, but we provide a helper function to automate it.

spatialPreprocess() log-normalizes the count matrix and performs PCA on the top n.HVGs highly variable genes, keeping the top n.PCs principal components. Additionally, the spatial sequencing platform is added as metadata in the SingleCellExperiment for downstream analyses. If you do not wish to rerun PCA, running spatialPreprocess() with the flag skip.PCA=TRUE will only add the metadata BayesSpace requires.

Here, we omit log-normalization as all datasets available through getRDS() already include log-normalized counts.

Clustering

Selecting the number of clusters

We can use the qTune() and qPlot() functions to help choose q, the number of clusters to use in our analysis.

Clustering with BayesSpace

The spatialCluster() function clusters the spots, and adds the predicted cluster labels to the SingleCellExperiment. Typically, as we did for the analyses in the paper, we suggest running with at least 10,000 iterations (nrep=10000), but we use 1,000 iteration in this demonstration for the sake of runtime. (Note that a random seed must be set in order for the results to be reproducible.)

Both the mclust initialization (cluster.init) and the BayesSpace cluster assignments (spatial.cluster) are now available in the SingleCellExperiment’s colData.

Visualizing spatial clusters

We can plot the cluster assignments over the spatial locations of the spots with clusterPlot().

As clusterPlot() returns a ggplot object, it can be customized by composing with familiar ggplot2 functions. Additionally, the argument palette sets the colors used for each cluster, and clusterPlot() takes additional arguments to geom_polygon() such as size or color to control the aesthetics of the spot borders.

Enhanced resolution

Clustering at enhanced resolution

The spatialEnhance() function will enhance the resolution of the principal components, and add these PCs as well as predicted cluster labels at subspot resolution to a new SingleCellExperiment. As with our demonstration of spatialCluster() above, we are using fewer iterations for the purpose of this example (nrep=1000) than we recommend in practice (nrep=100000 or greater). Note that the jitter_scale parameter should be tuned so that proposals for updating subspot-level expression are accepted around 30% of the time. This can be evaluated using mcmcChain(melanoma.enhanced, "Ychange"), where the chain should stabilize to 0.25-0.40. Typically 1000-2500 iterations are sufficient to evaluate if jitter_scale should be increased if acceptance is too high or decreased if acceptance is too low. After tuning, proceed to a full run of spatialEnhance with more iterations.

The enhanced SingleCellExperiment includes an index to the parent spot in the original sce (spot.idx), along with an index to the subspot. It adds the offsets to the original spot coordinates, and provides the enhanced cluster label (spatial.cluster).

We can plot the enhanced cluster assignments as above.

Enhancing the resolution of gene expression

BayesSpace operates on the principal components of the gene expression matrix, and spatialEnhance() therefore computes enhanced resolution PC vectors. Enhanced gene expression is not computed directly, and is instead imputed using a regression algorithm. For each gene, a model using the PC vectors of each spot is trained to predict the spot-level gene expression, and the fitted model is used to predict subspot expression from the subspot PCs.

Gene expression enhancement is implemented in the enhanceFeatures() function. BayesSpace predicts expression with xgboost by default, but linear and Dirichlet regression are also available via the model argument. When using xgboost, we suggest automatically tuning the nrounds parameter by setting it to 0, although this comes at the cost of increased runtime (~4x slower than a pre-specified nrounds in practice).

enhanceFeatures() can be used to impute subspot-level expression for all genes, or for a subset of genes of interest. Here, we’ll demonstrate by enhancing the expression of four marker genes: PMEL (melanoma), CD2 (T-cells), CD19 (B-cells), and COL1A1 (fibroblasts).

By default, log-normalized expression (logcounts(sce)) is imputed, although other assays or arbitrary feature matrices can be specified.

Diagnostic measures from each predictive model, such as rmse when using xgboost, are added to the rowData of the enhanced dataset.

Visualizing enhanced gene expression

Spatial gene expression is visualized with featurePlot().

Here, we compare the spatial expression of the imputed marker genes.

And we can compare to the spot-level expression.

Accessing Markov chains

If save.chain is set to TRUE in either spatialCluster() or spatialEnhance(), the chain associated with the respective MCMC run is preserved to disk as an HDF5 file. The path to this file is stored in the SingleCellExperiment’s metadata at metadata(sce)$h5.chain, and can be read directly using mcmcChain().

The chain is provided as a coda::mcmc object, which can be analyzed with TidyBayes or as a matrix. The object has one row per iteration, with the values of the parameters concatenated across the row. Columns are named with the parameter name and index (if any).

chain <- mcmcChain(melanoma)
chain[1:5, 1:5]
#>      lambda[1,1]  lambda[1,2] lambda[1,3] lambda[1,4]  lambda[1,5]
#> [1,]  0.01000000  0.000000000  0.00000000  0.00000000  0.000000000
#> [2,]  0.08757829 -0.005798093  0.04594761  0.05789615 -0.019023472
#> [3,]  0.10063323 -0.008922653  0.05061634  0.05269334 -0.004028245
#> [4,]  0.13708014 -0.016090349  0.06626887  0.02779234 -0.022475194
#> [5,]  0.12768109 -0.006919337  0.06831921  0.03047754 -0.027188932

To remove the HDF5 file from disk and remove its path from the metadata, use removeChain().