--- title: "An introduction to the standR package" author: "Ning Liu, Dharmesh Bhuva, Ahmed Mohamed, Chin Wee Tan, Melissa Davis" date: "`r Sys.Date()`" output: html_document: toc: true number_sections: true theme: cosmo highlight: tango code_folding: show vignette: > %\VignetteIndexEntry{standR_introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r message=FALSE, warning=FALSE, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) ``` # Introduction The `standR` package is short for **S**patial **t**ranscriptomics **a**nalyzes a**n**d **d**ecoding in **R**, it aims at providing good practice pipeline and useful functions for users to analyze Nanostring's GeoMx DSP data, including data with GeoMX protein (~30-60 markers), CTA (cancer transcriptome atlas, ~1800 genes) and WTA (whole transcriptome atlas, ~18000 genes) panels. The data analysis workflow consists of the following steps: ![Workflow Steps](../man/figures/workflow.jpg "Workflow Steps") The purpose of this vignette is to demonstrate how to use standR to analyze Nanostring GeoMx DSP data. # Installation The development version of `standR` can be installed from GitHub: ```{r eval=FALSE} devtools::install_github("DavisLaboratory/standR") ``` # Load packages and data We will load the `standR` package. We designed the package and the analysis workflow to be consistent with the BioConductor’s Orchestrating Spatially Resolved Transcriptomics Analysis (OSTA) framework, namely the `SpatialExperiment`. ```{r message = FALSE, warning = FALSE} library(standR) library(SpatialExperiment) library(limma) library(ExperimentHub) ``` In this vignette, we will use a publicly available WTA dataset of diabetic kidney disease (DKD) as the case study. ## Import from paths Data we are using here is a publicly available whole transcriptome atlas (WTA) dataset of diabetic kidney disease (DKD) shared by Nanostring. This is the background for the data: NanoString GeoMx DSP dataset of diabetic kidney disease (DKD) vs healthy kidney tissue. **Seven slides** were analyzed, **4 DKD and 3 healthy**. Regions of Interest (ROI) were focused two different parts of a kidney’s structure: **tubules or glomeruli**. Individual glomeruli were identified by a pathologist as either **relatively healthy or diseased** regardless if the tissue was DKD or healthy. Tubule ROIs were segmented into **distal (PanCK) and proximal (neg) tubules**. While both distal and proximal tubules are called tubules, they perform very different functions in the kidney. In order to import the data into a spatial experiment object. We use the `readGeoMx` function. ```{r message = FALSE, warning = FALSE} eh <- ExperimentHub() query(eh, "standR") countFile <- eh[["EH7364"]] sampleAnnoFile <- eh[["EH7365"]] featureAnnoFile <- eh[["EH7366"]] spe <- readGeoMx(countFile, sampleAnnoFile, featureAnnoFile = featureAnnoFile, hasNegProbe = TRUE) ``` Next, we inspect the SpatialExperiment object. For more details, see [SpatialExperiment](http://www.bioconductor.org/packages/release/bioc/html/SpatialExperiment.html). We can see the dataset has 18503 genes and 231 ROIs (regions of interest). ```{r} # check out object spe ``` Here we can see two types of count matrix, raw count in the `count` assay and logCPM count in the `logcount` assay. ```{r} SummarizedExperiment::assayNames(spe) ``` ## Import from DGEList object Alternatively, `standR` provided a function to generate a spatial experiment object from a DGEList object (`edgeR` package). ```{r} dge <- edgeR::SE2DGEList(spe) spe2 <- readGeoMxFromDGE(dge) spe2 ``` # Quality control The quality control (QC) procedures are consisting of three major steps: 1. **Inspection of the sample metadata**: sample metadata can be seen as tabular-like format using the `colData` function, however here we aim to visualise the relations among important sample metadata, such as what slides did the ROI come from, which are the control groups and treatment groups, what are the pre-defined tissue types etc. By doing this step, we can have a general idea of some important question such as how the experiment was designed, should we be looking out for batch effect, and what interested comparison can be established. 2. **Gene level QC**: at gene level, by default we aim at removing genes that are not expressed in more than 90% of the ROIs, and identifying any ROIs with only few genes that are expressed. This is the similar process as `edgeR::filterByExpr`, as genes with consistently low counts are unlikely be identified as DE genes, and only keeping genes with sufficiently large counts in the DE analysis can increase statistical power while reduce multiple testing burden. 3. **ROI level QC**: at ROI level, we aim at identify low-quality ROIs with relatively small library size and low cell count. Problematic ROIs that are not removed could show up as separate clusters in PCA/UMAPs and further affect the comparisons in DE analysis. ## Visualise metadata To visualise sample metadata, we can use the `plotSampleInfo` function. In this dataset, slides, diseases status, tissue regions, and different subtypes of the tissues are the important features that we could input into the function. Based on the description of the data, we know that all glomerulus are classified as abnormal and healthy, and tubule are classified as neg and PanCK. We therefore merge the region-related annotations to avoid collinearity, which can affect the process of batch correction. ```{r} library(ggalluvial) colData(spe)$regions <- paste0(colData(spe)$region,"_",colData(spe)$SegmentLabel) |> (\(.) gsub("_Geometric Segment","",.))() |> paste0("_",colData(spe)$pathology) |> (\(.) gsub("_NA","_ns",.))() plotSampleInfo(spe, column2plot = c("SlideName","disease_status","regions")) ``` ## Gene level QC Using the `addPerROIQC` function, we can add key statistics to the `colData` of the object. In this case, argument `rm_genes` is set to TRUE, with the default settings of `min_count = 5` and `sample_fraction = 0.9`, meaning here we first calculated a expression threshold on logCPM scale (to account for library size differences), then genes that have expression value below the threshold in more than 90% of the ROIs will be removed. The count matrix with the removed gene are stored in the `metadata` of the object, along with the calculated expression threshold. ```{r} spe <- addPerROIQC(spe, rm_genes = TRUE) ``` Here we can see that 121 genes are removed. ```{r} dim(spe) metadata(spe) |> names() ``` Using the `plotGeneQC` function, we can have a look at which were the genes removed and the overall distribution of percentage of non-expressed genes in all ROIs. By default, top 9 genes are plotted here (arranging by mean expression), user can increase the number of plotted genes by changing the `top_n` parameter. Moreover, users can order the samples (using parameter `ordannots`) and color/shape the dots with annotation to find out if these genes are specifically expressed in some samples (e.g. in some tissue types or in some treatment group) so that we may need to retain them. ```{r} plotGeneQC(spe, ordannots = "regions", col = regions, point_size = 2) ``` In this case we don't see any specific biological pattern for the samples expressing this genes (figure above). ## ROI level QC Using the `plotROIQC` function, we are able to visualise some QC statistics in the ROI level, such as library size and cell count (AOINucleiCount) (the default settinngs for this function). In the ROI level QC, we first aim to identify (if any) ROI(s) that have relatively low library size and low cell count because they are considered as low quality samples due to insufficient sequencing depth or lack of RNA in the chosen region. In this case, looking at the distribution plots of library size and nuclei count, we don't see any particular spike in the low ends, rather the distributions are relatively smooth. Looking at the dot plot, library sizes are mostly positively correlate with the nuclei count, with some data have relatively low library size while the nuclei count is reasonable. We therefore can try to draw an filtering threshold at the low end of the library size, in this case 50000. By coloring the dot with their slide names, we find that the ROIs below the threshold are all from slide disease1B, suggesting the reason for this might be some technical issues of slide disease1B. ```{r} plotROIQC(spe, y_threshold = 50000, col = SlideName) ``` Since library size of 50000 seems to be a reasonable threshold, here we subset the spatial experiment object based on the library size in `colData`. ```{r} spe <- spe[,rownames(colData(spe))[colData(spe)$lib_size > 50000]] ``` ## Inspection of variations on ROI level ### RLE We can use function `plotRLExpr` to visualise the relative log expression (RLE) of the data to inspect the technical variation of the data by looking at the distance from the median of the RLE (the boxplot dot) to zero. By default, we are plotting RLE of the raw count, where most of the variation are from library size differences. ```{r eval=FALSE} plotRLExpr(spe) ``` Using `assay = 2` to run RLE on the logCPM data, and we can see that most technical variation from library sizes are removed. And we can also sort the data using the annotation in the object by specifying `ordannots`, with color or shape mapping parameter as we usually write in ggplot, so that we can have a look at what factor is heavily contributing to the technical variation. Here we can see obvious variation from slides to slides, and small variations are also observed within each slide. ```{r} plotRLExpr(spe, ordannots = "SlideName", assay = 2, col = SlideName) ``` ### PCA Using the `drawPCA` function, we can perform principal component analysis (PCA) on the data. The PCA plot can help visualising the variation (both biological and technical) in the data and finding out which are the main factors contributing to the variations. Here we color the PCA with slide information, and shape by regions (tissue). We can see that PC1 is mainly spread out by regions, especially glomerulus and tubule. And grouping based on slide within each tissue are observed. The subtypes in tubule are clearly separated, but different subtypes of glomerulus is still grouping together. Moreover, diseased tissues and control tissues are mixed as well (disease slides and normal slides). ```{r} drawPCA(spe, assay = 2, col = SlideName, shape = regions) ``` ### MDS `plotMDS` can be used to visualise the multidimension scaling of the data. ```{r} standR::plotMDS(spe, assay = 2, col = SlideName, shape = regions) ``` ### UMAP Further more, to be consistent with the `scater` package, which is commonly used in single cell data and visium data analysis, we provide the function `plotDR` to visualise any kinds of dimension reduction results generated from `scater::run*`, such as UMAP, TSNE and NMF, by specifying `dimred`. Here we plot the UMAP of our data. Similar variation can be observed like PCA and MDS above. ```{r} spe <- scater::runUMAP(spe) plotDR(spe, dimred = "UMAP", col = regions) ``` ### Other PCA-related visualizations The `standR` package also provide other functions to visualise the PCA, including the PCA scree plot, pair-dimension PCA plot and PCA bi-plot. ```{r} plotScreePCA(spe, assay = 2, dims = 10) ``` ```{r} plotPairPCA(spe, col = regions, shape = disease_status, assay = 2) ``` ```{r} plotPCAbiplot(spe, n_loadings = 10, assay = 2, col = regions) ``` # Normalization As we observed the technical variations in the data in both RLE and PCA plots. It is necessary to perform normalization in the data. In the `standR` package, we offer normalization options including TMM, RPKM, TPM, CPM, upperquartile and sizefactor. Among them, RPKM and TPM required gene length information (add `genelength` column to the `rowData` of the object). For TMM, upperquartile and sizefactor, their normalized factor will be stored their `metadata`. Here we used TMM to normalize the data. ```{r} colData(spe)$biology <- paste0(colData(spe)$disease_status, "_", colData(spe)$regions) spe_tmm <- geomxNorm(spe, method = "TMM") ``` Then we use RLE and PCA plot to assess the normalization count. The RLE plot show most of the median of RLE are close to zero, indicating that lots of technical variation are removed. ```{r} plotRLExpr(spe_tmm, assay = 2, color = SlideName) + ggtitle("TMM") ``` However, batch effect from the different slides are still observed, confounding the separation between disease and normal. ```{r} plotPairPCA(spe_tmm, assay = 2, color = disease_status, shape = regions) ``` # Batch correction In the Nanostring's GeoMX DSP protocol, due to the fact that one slide is only big enough for a handful of tissue segments (ROIs), it is common that we see the DSP data being confounded by the batch effect introduced by different slides. In order to establish fair comparison between ROIs later on, it is necessary to remove this batch effect from the data. In the `standR` package, we provide two approaches of removing batch effects, including RUV4 and Limma. ## Correction method : RUV4 To run RUV4 batch correction, we need to provide a list of "negative control genes (NCGs)". The function `findNCGs` allows identifying the NCGs from the data. In this case, since the batch effect is mostly introduced by slide, we therefore want to identify NCGs across all slides, so here we set the `batch_name` to "SlideName", and select the top 500 least variable genes across different slides as NCGs. ```{r} spe <- findNCGs(spe, batch_name = "SlideName", top_n = 500) metadata(spe) |> names() ``` Now we can run RUV4 on the data using function `geomxBatchCorrection`. By default this function will use `RUV4` to normalize the data. For RUV4 correction, the function is requiring 3 parameters other than the input object, including `factors`: the factor of interest, i.e. the biological variation we plan to keep; `NCGs`: the list of negative control genes detected using the function `findNCGs`; and `k`: is the number of unwanted factors to use, in the RUV documentation, it is suggest that we should use the smallest k once we don't observe technical variation in the data. Here we can find the best k by using function `findBestK` to examine the silhouette score of ruv4-normalized count data clustering according to batch from k = 1 to 10, where the score is closing to zero means we are more likely to successfully remove the batch effect from the data. In this case, k = 5 is the satisfied k. ```{r} findBestK(spe, maxK = 10, factor_of_int = "biology", NCGs = metadata(spe)$NCGs, factor_batch = "SlideName") ``` We therefore will choose k=5 for this dataset. ```{r} spe_ruv <- geomxBatchCorrection(spe, factors = "biology", NCGs = metadata(spe)$NCGs, k = 5) ``` We can then inspect the PCA of the corrected data with annotations, to inspect the removal of batch effects, and the retaining of the biological factors. ```{r} plotPairPCA(spe_ruv, assay = 2, color = disease_status, shape = regions, title = "RUV4") ``` ## Correction method: limma Another option is set the parameter `method` to "Limma", which uses the remove batch correction method from `limma`. In this mode, the function is requiring 2 parameters, including `batch`: a vector that indicating batches for all samples; and `design`: a design matrix which is generated by `model.matrix`, in the design matrix, all biologically-relevant factors should be included. ```{r} spe_lrb <- geomxBatchCorrection(spe, batch = colData(spe)$SlideName, method = "Limma", design = model.matrix(~ 0 + disease_status + regions, data = colData(spe))) ``` Again, using PCA to inspect the batch correction process. ```{r} plotPairPCA(spe_lrb, assay = 2, color = disease_status, shape = regions, title = "Limma removeBatch") ``` ## Evaluation ### Summary statistics Besides looking at the PCA plots of normalized count, another way to evaluate the batch correction is by summarising statistics such as adjusted rand index, jaccard similarity coefficient, silhouette coefficient and etc. We can do this by using the `plotClusterEvalStats` function. In the biology section, higher the score is better while in the batch section, lower is better. We can see that when it comes to stratifying for biology factors (disease status and tissue regions) or measuring batch clustering for this dataset, RUV4 outperform Limma in most statistics. ```{r} spe_list <- list(spe, spe_ruv, spe_lrb) plotClusterEvalStats(spe_list = spe_list, bio_feature_name = "regions", batch_feature_name = "SlideName", data_names = c("Raw","RUV4","Limma")) ``` ### RLE plots Moreover, we can also have a look at the RLE plots of the normalized count to determine which batch correction performs better for this dataset. We can clearly see that RUV4-corrected count have a overall more-closer-to-zero median RLE compared to the limma-corrected data. Therefore, in the downstream differential expression analysis, we would suggest using the RUV4 as the batch correction method for this specific dataset. ```{r} plotRLExpr(spe_ruv, assay = 2, color = SlideName) + ggtitle("RUV4") plotRLExpr(spe_lrb, assay = 2, color = SlideName) + ggtitle("Limma removeBatch") ``` # SessionInfo ```{r} sessionInfo() ```