--- title: "Using spatialLIBD with 10x Genomics public datasets" author: - name: Abby Spangler affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus email: aspangle@gmail.com - name: Brenda Pardo affiliation: - *libd - &lcg Licenciatura de Ciencias Genómicas, Escuela Nacional de Estudios Superiores Unidad Juriquilla, Universidad Nacional Autónoma de México email: bpardo@lcgej.unam.mx - name: Leonardo Collado-Torres affiliation: - *libd - &ccb Center for Computational Biology, Johns Hopkins University email: lcolladotor@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('spatialLIBD')`" vignette: > %\VignetteIndexEntry{Using spatialLIBD with 10x Genomics public datasets} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocFileCache = citation("BiocFileCache")[1], BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], pryr = citation("pryr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], rtracklayer = citation("rtracklayer")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], SpatialExperiment = citation("SpatialExperiment")[1], spatialLIBD = citation("spatialLIBD")[1], spatialLIBDpaper = citation("spatialLIBD")[2] ) ``` # Basics ## Install `spatialLIBD` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg("spatialLIBD")` `r Citep(bib[['spatialLIBD']])` is an `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg("spatialLIBD")` by using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("spatialLIBD") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge Please first check the _Introduction to spatialLIBD_ vignette available through [GitHub](http://research.libd.org/spatialLIBD/articles/spatialLIBD.html) or [Bioconductor](https://bioconductor.org/packages/spatialLIBD). ## Citing `spatialLIBD` We hope that `r Biocpkg("spatialLIBD")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} ## Citation info citation("spatialLIBD") ``` # Download data from 10x Genomics In this vignette we'll show you how you can use `r Biocpkg("spatialLIBD")` `r Citep(bib[['spatialLIBD']])` for exploring spatially resolved transcriptomics data from the [Visium platform by 10x Genomics](https://www.10xgenomics.com/products/spatial-gene-expression). That is, you will learn how to use `r Biocpkg("spatialLIBD")` for data beyond the one it was initially developed for `r Citep(bib[['spatialLIBDpaper']])`. To illustrate these steps, we will use data that 10x Genomics made publicly available at https://support.10xgenomics.com/spatial-gene-expression/datasets. We will use files from the human lymph node example publicly available at https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Lymph_Node. ## Load packages To get started, lets load the different packages we'll need for this vignette. Here's a brief summary of why we need these packages: * `r Biocpkg("BiocFileCache")`: for downloading and saving a local cache of the data * `r Biocpkg("SpatialExperiment")`: for reading the `spaceranger` files provided by 10x Genomics * `r Biocpkg("rtracklayer")`: for importing a gene annotation GTF file * `r CRANpkg("pryr")`: for checking how much memory our object is using * `r Biocpkg("spatialLIBD")`: for launching an interactive website to explore the data ```{r "start", message=FALSE} ## Load packages in the order we'll use them next library("BiocFileCache") library("SpatialExperiment") library("rtracklayer") library("pryr") library("spatialLIBD") ``` ## Download spaceranger output files Next we download data from 10x Genomics available from the human lymph node example, available at https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Lymph_Node. We don't need to download all the files listed there since `r Biocpkg("SpatialExperiment")` doesn't need all of them for importing the data into R. These files are part of the output that gets generated by `spaceranger` which is the processing pipeline provided by 10x Genomics for Visium data. We'll use `r Biocpkg("BiocFileCache")` to keep the data in a local cache in case we want to run this example again and don't want to re-download the data from the web. ```{r "download_10x_data"} ## Download and save a local cache of the data provided by 10x Genomics bfc <- BiocFileCache::BiocFileCache() lymph.url <- paste0( "https://cf.10xgenomics.com/samples/spatial-exp/", "1.1.0/V1_Human_Lymph_Node/", c( "V1_Human_Lymph_Node_filtered_feature_bc_matrix.tar.gz", "V1_Human_Lymph_Node_spatial.tar.gz" ) ) lymph.data <- sapply(lymph.url, BiocFileCache::bfcrpath, x = bfc) ``` 10x Genomics provides the files in compressed tarballs (`.tar.gz` file extension). Which is why we'll need to use `utils::untar()` to decompress the files. This will create new directories and we will use `list.files()` to see what files these directories contain. ```{r "extract_files"} ## Extract the files to a temporary location ## (they'll be deleted once you close your R session) sapply(lymph.data, utils::untar, exdir = tempdir()) ## List the files we downloaded and extracted ## These files are typically SpaceRanger outputs lymph.dirs <- file.path( tempdir(), c("filtered_feature_bc_matrix", "spatial", "raw_feature_bc_matrix") ) list.files(lymph.dirs) ``` Now that we have the files that we need, we can import the data into R using `read10xVisium()` from `r Biocpkg("SpatialExperiment")`. We'll import the low resolution histology images produced by `spaceranger` using the `images = "lowres"` and `load = TRUE` arguments. We'll also load the filtered gene expression data using the `data = "filtered"` argument. The count matrix can still be quite large, which is why we'll use the `type = "sparse"` argument to load the data into an R object that is memory-efficient for sparse data. ```{r "import_to_r"} ## Import the data as a SpatialExperiment object spe <- SpatialExperiment::read10xVisium( samples = tempdir(), sample_id = "lymph", type = "sparse", data = "filtered", images = "lowres", load = TRUE ) ## Inspect the R object we just created: class, memory, and how it looks in ## general class(spe) pryr::object_size(spe) spe ## The counts are saved in a sparse matrix R object class(counts(spe)) ``` # Modify spe for spatialLIBD Now that we have an `SpatialExperiment` R object (`spe`) with the data from 10x Genomics for the human lymph node example, we need to add a few features to the R object in order to create the interactive website using `spatialLIBD::run_app()`. These additional elements power features in the interactive website that you might be interested in. First we start with adding a few variables to the sample information table (`colData()`) of our `spe` object. We add: * `key`: this labels each spot with a unique identifier. We combine the sample ID with the spot barcode ID to create this unique identifier. * `sum_umi`: this continuous variable contains the total number of counts for each sample prior to filtering any genes. * `sum_gene`: this continuous variable contains the number of genes that have at least 1 count. ```{r "add_key"} ## Add some information used by spatialLIBD spe$key <- paste0(spe$sample_id, "_", colnames(spe)) spe$sum_umi <- colSums(counts(spe)) spe$sum_gene <- colSums(counts(spe) > 0) ``` ## Add gene annotation information The files `SpatialExperiment::read10xVisium()` uses to read in the `spaceranger` outputs into R do not include much information about the genes, such as their chromosomes, coordinates, and other gene annotation information. We thus recommend that you read in this information from a gene annotation file: typically a `gtf` file. For a real case scenario, you'll mostly likely have access to the GTF file provided by 10x Genomics. However, we cannot download that file without downloading other files for this example. Thus we'll show you the code you would use if you had access to the GTF file from 10x Genomics and also show a second approach that works for this vignette. ```{r "initial_gene_info"} ## Initially we don't have much information about the genes rowRanges(spe) ``` ### From 10x Depending on the version of `spaceranger` you used, you might have used different GTF files 10x Genomics has made available at https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest and described at https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build. These files are too big though and we won't download them in this example. For instance, _References - 2020-A (July 7, 2020)_ for _Human reference (GRCh38)_ is 11 GB in size and contains files we do not need for this vignette. If you did have the file locally, you could use the following code to read in the GTF file prepared by 10x Genomics and add the information into your `spe` object that `SpatialExperiment::read10xVisium()` does not include. For example, in our computing cluster this GTF file is located at the following path and is 1.4 GB in size: ```{bash, eval = FALSE} $ cd /dcl02/lieber/ajaffe/SpatialTranscriptomics/ $ du -sh --apparent-size refdata-gex-GRCh38-2020-A/genes/genes.gtf 1.4G /dcl02/lieber/ajaffe/SpatialTranscriptomics/refdata-gex-GRCh38-2020-A/genes/genes.gtf ``` If you have the GTF file from 10x Genomics, we show next how you can read the information into R, match it appropriately with the information in the `spe` object and add it back into the `spe` object. ```{r "use_10x_gtf", eval = FALSE} ## You could: ## * download the 11 GB file from ## https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz ## * decompress it ## Read in the gene information from the annotation GTF file provided by 10x gtf <- rtracklayer::import( "/path/to/refdata-gex-GRCh38-2020-A/genes/genes.gtf" ) ## Subject to genes only gtf <- gtf[gtf$type == "gene"] ## Set the names to be the gene IDs names(gtf) <- gtf$gene_id ## Match the genes match_genes <- match(rownames(spe), gtf$gene_id) ## They should all be present if you are using the correct GTF file from 10x stopifnot(all(!is.na(match_genes))) ## Keep only some columns from the gtf (you could keep all of them if you want) mcols(gtf) <- mcols(gtf)[, c( "source", "type", "gene_id", "gene_version", "gene_name", "gene_type" )] ## Add the gene info to our SPE object rowRanges(spe) <- gtf[match_genes] ## Inspect the gene annotation data we added rowRanges(spe) ``` ### From Gencode In this vignette, we'll use the GTF file from Gencode v32. That's because the build notes from _References - 2020-A (July 7, 2020)_ and _Human reference, GRCh38 (GENCODE v32/Ensembl 98)_ at https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38_2020A show that 10x Genomics used Gencode v32. They also used Ensembl version 98 which is why a few genes we have in our object are going to be missing. We show next how you can read the information into R, match it appropriately with the information in the `spe` object and add it back into the `spe` object. ```{r "use_gencode_gtf"} ## Download the Gencode v32 GTF file and cache it gtf_cache <- BiocFileCache::bfcrpath( bfc, paste0( "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/", "release_32/gencode.v32.annotation.gtf.gz" ) ) ## Show the GTF cache location gtf_cache ## Import into R (takes ~1 min) gtf <- rtracklayer::import(gtf_cache) ## Subset to genes only gtf <- gtf[gtf$type == "gene"] ## Remove the .x part of the gene IDs gtf$gene_id <- gsub("\\..*", "", gtf$gene_id) ## Set the names to be the gene IDs names(gtf) <- gtf$gene_id ## Match the genes match_genes <- match(rownames(spe), gtf$gene_id) table(is.na(match_genes)) ## Drop the few genes for which we don't have information spe <- spe[!is.na(match_genes), ] match_genes <- match_genes[!is.na(match_genes)] ## Keep only some columns from the gtf mcols(gtf) <- mcols(gtf)[, c("source", "type", "gene_id", "gene_name", "gene_type")] ## Add the gene info to our SPE object rowRanges(spe) <- gtf[match_genes] ## Inspect the gene annotation data we added rowRanges(spe) ``` ### Enable a friendlier gene search Regardless of which method you used to obtain the gene annotation information, we can now proceed by adding the gene symbol and gene ID information that helps users search for genes in the shiny app produced by `spatialLIBD`. This will enable users to search genes by gene symbol or gene ID. If you didn't do this, users would only be able to search genes by gene ID which makes the web application harder to use. We also compute the total expression for the mitochondrial chromosome (chrM) as well as the ratio of chrM expression. Both of these continuous variables are interesting to explore and in some situations could be useful for biological interpretations. For instance, in our pilot data `r Citep(bib[['spatialLIBDpaper']])`, we noticed that the `expr_chrM_ratio` was associated to DLPFC layers. That is, spots with high `expr_chrM_ratio` were not randomly located in our Visium slides. ```{r "add_gene_info"} ## Add information used by spatialLIBD rowData(spe)$gene_search <- paste0( rowData(spe)$gene_name, "; ", rowData(spe)$gene_id ) ## Compute chrM expression and chrM expression ratio is_mito <- which(seqnames(spe) == "chrM") spe$expr_chrM <- colSums(counts(spe)[is_mito, , drop = FALSE]) spe$expr_chrM_ratio <- spe$expr_chrM / spe$sum_umi ``` ## Filter the spe object We can now continue with some filtering steps since this can help reduce the object size in memory as well as make it ready to use for downstream processing tools such as those from the `r Biocpkg("scran")` and `r Biocpkg("scuttle")` packages. Though these steps are not absolutely necessary. ```{r "filter_spe"} ## Remove genes with no data no_expr <- which(rowSums(counts(spe)) == 0) ## Number of genes with no counts length(no_expr) ## Compute the percent of genes with no counts length(no_expr) / nrow(spe) * 100 spe <- spe[-no_expr, , drop = FALSE] ## Remove spots without counts summary(spe$sum_umi) ## If we had spots with no counts, we would remove them if (any(spe$sum_umi == 0)) { spots_no_counts <- which(spe$sum_umi == 0) ## Number of spots with no counts print(length(spots_no_counts)) ## Percent of spots with no counts print(length(spots_no_counts) / ncol(spe) * 100) spe <- spe[, -spots_no_counts, drop = FALSE] } ``` ## Check object Next, we add the `ManualAnnotation` variable to the sample information table (`colData()`) with `"NA"`. That variable is used by the interactive website to store any manual annotations. ```{r "add_layer"} ## Add a variable for saving the manual annotations spe$ManualAnnotation <- "NA" ``` Finally, we can now check the final object using `spatialLIBD::check_spe()`. This is a helper function that will warn us if some important element is missing in `spe` that we use later for the interactive website. If it all goes well, it will return the original `spe` object. ```{r "check_spe"} ## Check the final dimensions and object size dim(spe) pryr::object_size(spe) ## Run check_spe() function check_spe(spe) ``` # Explore the data With our complete `spe` object, we can now use `r Biocpkg("spatialLIBD")` for visualizing our data. We can do so using functions such as `vis_gene()` and `vis_clus()` that are described in more detail at the _Introduction to spatialLIBD_ vignette available through [GitHub](http://research.libd.org/spatialLIBD/articles/spatialLIBD.html) or [Bioconductor](https://bioconductor.org/packages/spatialLIBD). ```{r "test_visualizations"} ## Example visualizations. Let's start with a continuous variable. spatialLIBD::vis_gene( spe = spe, sampleid = "lymph", geneid = "sum_umi", assayname = "counts" ) ## We next create a random cluster label to visualize set.seed(20210428) spe$random_cluster <- sample(1:7, ncol(spe), replace = TRUE) ## Next we visualize that random cluster spatialLIBD::vis_clus( spe = spe, sampleid = "lymph", clustervar = "random_cluster" ) ``` ## Run the interactive website We are now ready to create our interactive website for the human lymph node data. The interactive website is a `r CRANpkg("shiny")` web application that uses `r CRANpkg("plotly")` to power several of the interactive features. We can create the interactive website using the `spatialLIBD::run_app()` function. The default arguments of that function are customized for the data from our initial study `r Citep(bib[['spatialLIBDpaper']])`, so we will need to make some adjustments: * `sce_layer`, `modeling_results` and `sig_genes` will be set to `NULL` since do not have any pseudo-bulk results for this example data. * `title`: we will use a custom title that reflect our data * `spe_discreate_vars`: we don't have really any discrete variables to show beyond `ManualAnnotation` which is used for the manual annotations and `random_cluster` that we created in the previous section. * `spe_continous_vars`: we have computed several continuous variables while adapting our `spe` object for `r Biocpkg("spatialLIBD")`, so we'll list these variables below in order to visually inspect them. * `default_cluster`: this is used for indicating the default discrete variable and for now we'll set it to our `random_cluster`. ```{r "run_shiny_app"} ## Run our shiny app if (interactive()) { run_app( spe, sce_layer = NULL, modeling_results = NULL, sig_genes = NULL, title = "spatialLIBD: human lymph node by 10x Genomics", spe_discrete_vars = c("random_cluster", "ManualAnnotation"), spe_continuous_vars = c("sum_umi", "sum_gene", "expr_chrM", "expr_chrM_ratio"), default_cluster = "random_cluster" ) } ``` We also recommend creating custom website documentation files as described in the documentation of `spatialLIBD::run_app()`. Those documentation files will help you describe your project to your users in a more personalized way. The easiest way to start is to copy our documentation files to a new location and adapt them. You can located them at the following path. ```{r "locate_documentation_files"} ## Locate our documentation files docs_path <- system.file("app", "www", package = "spatialLIBD") docs_path list.files(docs_path) ``` # Limitations `spatialLIBD::run_app()` has limitations that are inherent to the methods used to implement it, such as: 1. the memory per user required by a server for hosting the web application, 1. response speeds for the interactive views due to the number of spots, 1. the resolution of the images displayed limiting the usefulness to magnify specific spots, 1. and customization of the web application by the end user. ## Memory Regarding the memory limitation, you can estimate how much memory you need per user by considering the memory required for the `spe` and `sce_layer` objects. ```{r "check_mem"} pryr::object_size(spe) ``` In our pilot data `r Citep(bib[['spatialLIBDpaper']])` our object uses about 2.1 GB of RAM since it contains the data for 12 Visium slides and we considered using about 3 GB of RAM per user. You could filter the genes more aggressively to drop lowly expressed genes or if you have many Visium slides, you could consider making multiple websites for different sets of slides. You could also have multiple mirrors to support several users, though in that case, we recommend linking users to a stable website instead of one that might not be available if you have too many users: for us our stable website is http://research.libd.org/spatialLIBD/ which includes the links to all the mirrors. Given these memory limitations, we chose to deploy our main web application at http://spatial.libd.org/spatialLIBD/ using an Amazon EC2 instance: an 'r5.4xlarge' EC2 instance with 16 vCPUs, 128 GB DRAM, 10 Gb max network, 1.008 USD/Hour. We also have deployed mirrors at https://www.shinyapps.io/ using the "3X-Large (8 GB)" instances they provide. ## Response speeds This limitation is mostly due to the number of spots shown under the "clusters (interactive)" section of the interactive website powered by `r CRANpkg("plotly")`. Each spot is shown four times which is about 16 thousand spots for one Visium slide (depending on any filter steps you applied). The response time will depend on your own computer RAM memory, that is, the _client_ side. This limitation might be more noticeable if you have a computer with 8GB of RAM or lower, as well as if you have other high-memory software open. Furthermore, if you are running web application locally through `spatialLIBD::run_app()` then you also need to consider the required memory for the R objects. That is, the _server_ side memory use. Thanks to [Jesús Vélez Santiago](http://orcid.org/0000-0001-5128-3838), the app is more responsive as of version 1.3.15 by using `plotly::toWebGL()`. ## Image resolution When you construct the SpatialExperiment `spe` object with `r Biocpkg("SpatialExperiment")`, you can read in higher resolution images. However, the benefit of loading the raw histology images (500 MB to 20 GB per image) is likely non-existent in this web application. The memory required would likely become prohibitive. Other solutions load these raw histology images in chunks and display the chunks necessary for a given visualization area. We thus recommend using other software if you want to zoom in at the spot and/or cell resolution. ## Customization While the documentation, title, icon and HTML footer are all customizable at `spatialLIBD::run_app()`, ultimately the panels shown are not unless you fork and adapt the internal code of this package. Thus, the interactive web applications powered by `r Biocpkg("spatialLIBD")` are not as easy to customize as say `r Biocpkg("iSEE")` web applications are. We think of our web application as a good enough prototype that can be useful for initial explorations of 10x Genomics Visium data. We welcome additions to our code, though we recognize that you might want to build your own production-level solution. # Reproducibility The `r Biocpkg("spatialLIBD")` package `r Citep(bib[["spatialLIBD"]])` was made possible thanks to: * R `r Citep(bib[["R"]])` * `r Biocpkg("BiocFileCache")` `r Citep(bib[["BiocFileCache"]])` * `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` * `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` * `r CRANpkg("pryr")` `r Citep(bib[["pryr"]])` * `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` * `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` * `r Biocpkg("rtracklayer")` `r Citep(bib[["rtracklayer"]])` * `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` * `r Biocpkg("SpatialExperiment")` `r Citep(bib[["SpatialExperiment"]])` * `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("TenX_data_download.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("TenX_data_download.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`. ```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```