--- title: "Guide to Multi-Gene Plots" author: - name: Nicholas J. Eagles affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus email: nickeagles77@gmail.com - name: Leonardo Collado-Torres affiliation: - *libd - &biostats Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health email: lcolladotor@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('spatialLIBD')`" vignette: > %\VignetteIndexEntry{Guide to Multi-Gene Plots} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## For links library("BiocStyle") ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[3], MatrixGenerics = citation("MatrixGenerics")[1], RColorBrewer = citation("RColorBrewer")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], SpatialExperiment = citation("SpatialExperiment")[1], spatialLIBD = citation("spatialLIBD")[1], HumanPilot = citation("spatialLIBD")[2], spatialDLPFC = citation("spatialLIBD")[3], tran2021 = RefManageR::BibEntry( bibtype = "Article", key = "tran2021", author = "Tran, Matthew N. and Maynard, Kristen R. and Spangler, Abby and Huuki, Louise A. and Montgomery, Kelsey D. and Sadashivaiah, Vijay and Tippani, Madhavi and Barry, Brianna K. and Hancock, Dana B. and Hicks, Stephanie C. and Kleinman, Joel E. and Hyde, Thomas M. and Collado-Torres, Leonardo and Jaffe, Andrew E. and Martinowich, Keri", title = "Single-nucleus transcriptome analysis reveals cell-type-specific molecular signatures across reward circuitry in the human brain", year = 2021, doi = "10.1016/j.neuron.2021.09.001", journal = "Neuron" ) ) ``` One of the goals of `spatialLIBD` is to provide options for visualizing Visium data by 10x Genomics. In particular, `vis_gene()` and `vis_clus()` allow plotting of individual continuous or discrete quantities belonging to each Visium spot, in a spatially accurate manner and optionally atop histology images. This vignette explores a more complex capability of `vis_gene()`: to visualize a summary metric of several continuous variables simultaneously. We'll start with a basic one-gene use case for `vis_gene()` before moving to more advanced cases. First, let's load some example data for us to work on. This data is a subset from a recent publication with Visium data from the dorsolateral prefrontal cortex (DLPFC) `r Citep(bib[['spatialDLPFC']])`. ```{r "setup", message = FALSE, warning = FALSE} library("spatialLIBD") spe <- fetch_data(type = "spatialDLPFC_Visium_example_subset") spe ``` Next, let's define several genes known to be markers for white matter `r Citep(bib[['tran2021']])`. ```{r "white_matter_genes"} white_matter_genes <- c("GFAP", "AQP4", "MBP", "PLP1") white_matter_genes <- rowData(spe)$gene_search[ rowData(spe)$gene_name %in% white_matter_genes ] ## Our list of white matter genes white_matter_genes ``` # Plotting One Gene A typical use of `vis_gene()` involves plotting the spatial distribution of a single gene or continuous variable of interest. For example, let's plot just the expression of *GFAP*. ```{r "single_gene"} vis_gene( spe, geneid = white_matter_genes[1], point_size = 1.5 ) ``` We can see a little **V** shaped section with higher expression of this gene. This seems to mark the location of layer 1. The bottom right corner seems to mark the location of white matter. ```{r "histology_only"} plot(imgRaster(spe)) ``` This particular gene is known to have high expression in both layer 1 and white matter in the dorsolateral prefrontal cortex as can be seen below `r Citep(bib[['HumanPilot']])`. It's the 386th highest ranked white matter marker gene based on the enrichment test. ```{r "GFAP_boxplot"} modeling_results <- fetch_data(type = "modeling_results") sce_layer <- fetch_data(type = "sce_layer") sig_genes <- sig_genes_extract_all( n = 400, modeling_results = modeling_results, sce_layer = sce_layer ) i_gfap <- subset(sig_genes, gene == "GFAP" & test == "WM")$top i_gfap set.seed(20200206) layer_boxplot( i = i_gfap, sig_genes = sig_genes, sce_layer = sce_layer ) ``` # Plotting Multiple Genes As of version 1.15.2, the `geneid` parameter to `vis_gene()` may also take a vector of genes or continuous variables in `colData(spe)`. In this way, the expression of multiple continuous variables can be summarized into a single value for each spot, displayed just as a single input for `geneid` would be. `spatialLIBD` provides three methods for merging the information from multiple continuous variables, which may be specified through the `multi_gene_method` parameter to `vis_gene()`. ## Averaging Z-scores The default is `multi_gene_method = "z_score"`. Essentially, each continuous variable (could be a mix of genes with spot-level covariates) is normalized to be a Z-score by centering and scaling. If a particular spot has a value of `1` for a particular continuous variable, this would indicate that spot has expression one standard deviation above the mean expression across all spots for that continuous variable. Next, for each spot, Z-scores are averaged across continuous variables. Compared to simply averaging raw gene expression across genes, the `"z_score"` method is insensitive to absolute expression levels (highly expressed genes don't dominate plots), and instead focuses on how each gene varies spatially, weighting each gene equally. Let's plot all four white matter genes using this method. ```{r "multi_gene_z"} vis_gene( spe, geneid = white_matter_genes, multi_gene_method = "z_score", point_size = 1.5 ) ``` Now the bottom right corner where the white matter is located starts to pop up more, though the mixed layer 1 and white matter signal provided by *GFAP* is still noticeable (the **V** shape). ## Summarizing with PCA Another option is `multi_gene_method = "pca"`. A matrix is formed, where genes or continuous features are columns, and spots are rows. PCA is performed, and the first principal component is plotted spatially. The idea is that the first PC captures the dominant spatial signature of the feature set. Next, its direction is reversed if the majority of coefficients (from the "rotation matrix") across features are negative. When the features are genes whose expression is highly correlated (like our white-matter-gene example!), this optional reversal encourages higher values in the plot to represent areas of higher expression of the features. For our case, this leads to the intuitive result that "expression" is higher in white matter for white-matter genes, which is not otherwise guaranteed (the "sign" of PCs is arbitrary)! ```{r "multi_gene_pca"} vis_gene( spe, geneid = white_matter_genes, multi_gene_method = "pca", point_size = 1.5 ) ``` ## Plotting Sparsity of Expression This final option is `multi_gene_method = "sparsity"`. For each spot, the proportion of features with positive expression is plotted. This method is typically only meaningful when features are raw gene counts that are expected to be quite sparse (have zero counts) at certain regions of the tissue and not others. It also performs better with a larger number of genes; with our example of four white-matter genes, the proportion may only hold values of 0, 0.25, 0.5, 0.75, and 1, which is not visually informative. The white-matter example is thus poor due to lack of sparsity and low number of genes as you can see below. ```{r "multi_gene_sparsity"} vis_gene( spe, geneid = white_matter_genes, multi_gene_method = "sparsity", point_size = 1.5 ) ``` # With more marker genes Below we can plot via `multi_gene_method = "z_score"` the top 25 or top 50 white matter marker genes identified via the enrichment test in a previous dataset `r Citep(bib[['HumanPilot']])`. ```{r "multi_gene_z_score_top_enriched"} vis_gene( spe, geneid = subset(sig_genes, test == "WM")$ensembl[seq_len(25)], multi_gene_method = "z_score", point_size = 1.5 ) vis_gene( spe, geneid = subset(sig_genes, test == "WM")$ensembl[seq_len(50)], multi_gene_method = "z_score", point_size = 1.5 ) ``` We can repeat this process for `multi_gene_method = "pca"`. ```{r "multi_gene_pca_top_enriched"} vis_gene( spe, geneid = subset(sig_genes, test == "WM")$ensembl[seq_len(25)], multi_gene_method = "pca", point_size = 1.5 ) vis_gene( spe, geneid = subset(sig_genes, test == "WM")$ensembl[seq_len(50)], multi_gene_method = "pca", point_size = 1.5 ) ``` And finally, lets look at the results of `multi_gene_method = "sparsity"`. ```{r "multi_gene_sparsity_top_enriched"} vis_gene( spe, geneid = subset(sig_genes, test == "WM")$ensembl[seq_len(25)], multi_gene_method = "sparsity", point_size = 1.5 ) vis_gene( spe, geneid = subset(sig_genes, test == "WM")$ensembl[seq_len(50)], multi_gene_method = "sparsity", point_size = 1.5 ) ``` In this case, it seems that for both the top 25 or top 50 marker genes, `z_score` and `pca` provided cleaner visualizations than `sparsity`. Give them a try on your own datasets! # Visualizing non-gene continuous variables So far, we have only visualized multiple genes. But these methods can be applied to several continuous variables stored in `colData(spe)` as shown below. ```{r "colData_example"} vis_gene( spe, geneid = c("sum_gene", "sum_umi"), multi_gene_method = "z_score", point_size = 1.5 ) ``` We can also combine continuous variables from `colData(spe)` along with actual genes. We can combine for example the expression of *GFAP*, which is a known astrocyte marker gene, with the spot deconvolution results for astrocytes computed using Tangram `r Citep(bib[['spatialDLPFC']])`. ```{r "colData_plus_gene"} vis_gene( spe, geneid = c("broad_tangram_astro"), point_size = 1.5 ) vis_gene( spe, geneid = c("broad_tangram_astro", white_matter_genes[1]), multi_gene_method = "pca", point_size = 1.5 ) ``` These tools enable you to further explore your data in new ways. Have fun using them! # Reproducibility Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("multi_gene_plots.Rmd")) ## Extract the R code library("knitr") knit("multi_gene_plots.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])`, `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` running behind the scenes. Citations made with `r CRANpkg('RefManageR')` `r Citep(bib[['RefManageR']])`. ```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```