--- title: "BayesSpace" author: "Edward Zhao, Matt Stone, Xing Ren, and Raphael Gottardo" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{BayesSpace} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width="100%", fig.width=7, fig.height=5, dpi=300, fig.path="figures/BayesSpace-", message=FALSE, warning=FALSE, error=FALSE ) ``` ```{r setup} library(SingleCellExperiment) library(ggplot2) library(BayesSpace) ``` ## Preparing your experiment for BayesSpace ### Loading data BayesSpace supports three ways of loading a `SingleCellExperiment` for analysis. Visium datasets processed with [Space Ranger](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/what-is-space-ranger) can be loaded directly via the `readVisium()` function. This function takes only the path to the Space Ranger output directory (containing the `spatial/` and `filtered_feature_bc_matrix/` subdirectories) and returns a `SingleCellExperiment`. ```{r readVisium, eval=FALSE} sce <- readVisium("path/to/spaceranger/outs/") ``` Second, all datasets analyzed for the BayesSpace manuscript are readily accessible via the `getRDS()` function. This function takes two arguments - the name of the dataset, and the name of the sample in the dataset. ```{r download} melanoma <- getRDS(dataset="2018_thrane_melanoma", sample="ST_mel1_rep2") ``` Finally, `SingleCellExperiment` objects can be constructed manually from a counts matrix and tables of row and column data. BayesSpace only requires that spot array coordinates be provided as columns named `array_row` and `array_col` in `colData`. (Note that enhancement of Visium datasets additionally requires the pixel coordinates of each spot in the tissue image, but in this case the dataset should be loaded with `readVisium()`, which loads these data automatically.) ```{r manual.sce, eval=FALSE} library(Matrix) rowData <- read.csv("path/to/rowData.csv", stringsAsFactors=FALSE) colData <- read.csv("path/to/colData.csv", stringsAsFactors=FALSE, row.names=1) counts <- read.csv("path/to/counts.csv.gz", row.names=1, check.names=F, stringsAsFactors=FALSE)) sce <- SingleCellExperiment(assays=list(counts=as(counts, "dgCMatrix")), rowData=rowData, colData=colData) ``` We'll continue with the melanoma sample from the 2018 Spatial Transcriptomics paper for the remaining examples in this vignette. ### Pre-processing data BayesSpace requires minimal data pre-processing, but we provide a helper function to automate it. `spatialPreprocess()` log-normalizes the count matrix and performs PCA on the top `n.HVGs` highly variable genes, keeping the top `n.PCs` principal components. Additionally, the spatial sequencing platform is added as metadata in the `SingleCellExperiment` for downstream analyses. If you do not wish to rerun PCA, running `spatialPreprocess()` with the flag `skip.PCA=TRUE` will only add the metadata BayesSpace requires. Here, we omit log-normalization as all datasets available through `getRDS()` already include log-normalized counts. ```{r preprocess} set.seed(102) melanoma <- spatialPreprocess(melanoma, platform="ST", n.PCs=7, n.HVGs=2000, log.normalize=FALSE) ``` ## Clustering ### Selecting the number of clusters We can use the `qTune()` and `qPlot()` functions to help choose `q`, the number of clusters to use in our analysis. * `qTune()` runs the BayesSpace clustering algorithm for multiple specified values of `q` (by default, 3 through 7) and computes their average pseudo-log-likelihood. It accepts any arguments to `spatialCluster()`. * `qPlot()` plots the pseudo-log-likelihood as a function of `q`; we suggest choosing a `q` around the elbow of this plot. ```{r tuning_q} melanoma <- qTune(melanoma, qs=seq(2, 10), platform="ST", d=7, cores=2) qPlot(melanoma) ``` ### Clustering with BayesSpace The `spatialCluster()` function clusters the spots, and adds the predicted cluster labels to the `SingleCellExperiment`. Typically, as we did for the analyses in the paper, we suggest running with at least 10,000 iterations (`nrep=10000`), but we use 1,000 iteration in this demonstration for the sake of runtime. (Note that a random seed must be set in order for the results to be reproducible.) ```{r cluster} set.seed(149) melanoma <- spatialCluster(melanoma, q=4, platform="ST", d=7, init.method="mclust", model="t", gamma=2, nrep=1000, burn.in=100, save.chain=TRUE) ``` Both the mclust initialization (`cluster.init`) and the BayesSpace cluster assignments (`spatial.cluster`) are now available in the SingleCellExperiment's `colData`. ```{r cluster.results} head(colData(melanoma)) ``` ### Visualizing spatial clusters We can plot the cluster assignments over the spatial locations of the spots with `clusterPlot()`. ```{r cluster.plot, fig.width=7, fig.height=5} clusterPlot(melanoma) ``` As `clusterPlot()` returns a `ggplot` object, it can be customized by composing with familiar `ggplot2` functions. Additionally, the argument `palette` sets the colors used for each cluster, and `clusterPlot()` takes additional arguments to `geom_polygon()` such as `size` or `color` to control the aesthetics of the spot borders. ```{r cluster.plot.customize, fig.width=7, fig.height=5} clusterPlot(melanoma, palette=c("purple", "red", "blue", "yellow"), color="black") + theme_bw() + xlab("Column") + ylab("Row") + labs(fill="BayesSpace\ncluster", title="Spatial clustering of ST_mel1_rep2") ``` ## Enhanced resolution ### Clustering at enhanced resolution The `spatialEnhance()` function will enhance the resolution of the principal components, and add these PCs as well as predicted cluster labels at subspot resolution to a new `SingleCellExperiment`. As with our demonstration of `spatialCluster()` above, we are using fewer iterations for the purpose of this example (`nrep=1000`) than we recommend in practice (`nrep=100000` or greater). Note that the `jitter_scale` parameter should be tuned so that proposals for updating subspot-level expression are accepted around 30% of the time. This can be evaluated using `mcmcChain(melanoma.enhanced, "Ychange")`, where the chain should stabilize to 0.25-0.40. Typically 1000-2500 iterations are sufficient to evaluate if `jitter_scale` should be increased if acceptance is too high or decreased if acceptance is too low. After tuning, proceed to a full run of `spatialEnhance` with more iterations. ```{r enhance, eval=TRUE} melanoma.enhanced <- spatialEnhance(melanoma, q=4, platform="ST", d=7, model="t", gamma=2, jitter.prior=0.3, jitter.scale=3.5, nrep=1000, burn.in=100, save.chain=TRUE, cores = 1) ``` The enhanced `SingleCellExperiment` includes an index to the parent spot in the original `sce` (`spot.idx`), along with an index to the subspot. It adds the offsets to the original spot coordinates, and provides the enhanced cluster label (`spatial.cluster`). ```{r enhance.results} head(colData(melanoma.enhanced)) ``` We can plot the enhanced cluster assignments as above. ```{r enhance.plot, eval=TRUE, fig.width=7, fig.height=5} clusterPlot(melanoma.enhanced) ``` ### Enhancing the resolution of gene expression BayesSpace operates on the principal components of the gene expression matrix, and `spatialEnhance()` therefore computes enhanced resolution PC vectors. Enhanced gene expression is not computed directly, and is instead imputed using a regression algorithm. For each gene, a model using the PC vectors of each spot is trained to predict the spot-level gene expression, and the fitted model is used to predict subspot expression from the subspot PCs. Gene expression enhancement is implemented in the `enhanceFeatures()` function. BayesSpace predicts expression with [`xgboost`](https://xgboost.readthedocs.io/en/latest/) by default, but linear and Dirichlet regression are also available via the `model` argument. When using `xgboost`, we suggest automatically tuning the `nrounds` parameter by setting it to 0, although this comes at the cost of increased runtime (~4x slower than a pre-specified `nrounds` in practice). `enhanceFeatures()` can be used to impute subspot-level expression for all genes, or for a subset of genes of interest. Here, we'll demonstrate by enhancing the expression of four marker genes: PMEL (melanoma), CD2 (T-cells), CD19 (B-cells), and COL1A1 (fibroblasts). ```{r enhanceFeatures} markers <- c("PMEL", "CD2", "CD19", "COL1A1") melanoma.enhanced <- enhanceFeatures(melanoma.enhanced, melanoma, feature_names=markers, nrounds=0) ``` By default, log-normalized expression (`logcounts(sce)`) is imputed, although other assays or arbitrary feature matrices can be specified. ```{r enhanced.logcount} logcounts(melanoma.enhanced)[markers, 1:5] ``` Diagnostic measures from each predictive model, such as `rmse` when using `xgboost`, are added to the `rowData` of the enhanced dataset. ```{r enhanced.rmse} rowData(melanoma.enhanced)[markers, ] ``` ### Visualizing enhanced gene expression Spatial gene expression is visualized with `featurePlot()`. ```{r enhanced.featurePlot} featurePlot(melanoma.enhanced, "PMEL") ``` Here, we compare the spatial expression of the imputed marker genes. ```{r enhanced.markers, fig.width=12, fig.height=8} enhanced.plots <- purrr::map(markers, function(x) featurePlot(melanoma.enhanced, x)) patchwork::wrap_plots(enhanced.plots, ncol=2) ``` And we can compare to the spot-level expression. ```{r compare.resolution, fig.width=16, fig.height=8} spot.plots <- purrr::map(markers, function(x) featurePlot(melanoma, x)) patchwork::wrap_plots(c(enhanced.plots, spot.plots), ncol=4) ``` ## Accessing Markov chains If `save.chain` is set to `TRUE` in either `spatialCluster()` or `spatialEnhance()`, the chain associated with the respective MCMC run is preserved to disk as an HDF5 file. The path to this file is stored in the SingleCellExperiment's metadata at `metadata(sce)$h5.chain`, and can be read directly using `mcmcChain()`. The chain is provided as a `coda::mcmc` object, which can be analyzed with [TidyBayes](https://mjskay.github.io/tidybayes/) or as a matrix. The object has one row per iteration, with the values of the parameters concatenated across the row. Columns are named with the parameter name and index (if any). ```{r mcmcChain, eval=TRUE} chain <- mcmcChain(melanoma) chain[1:5, 1:5] ``` To remove the HDF5 file from disk and remove its path from the metadata, use `removeChain()`.