--- title: 'Introduction to clustifyr' date: '`r Sys.Date()`' package: clustifyr author: - name: Rui Fu affiliation: RNA Bioscience Initative, University of Colorado School of Medicine - name: Austin Gillen affiliation: RNA Bioscience Initative, University of Colorado School of Medicine - name: Ryan Sheridan affiliation: RNA Bioscience Initative, University of Colorado School of Medicine - name: Chengzhe Tian affiliation: Department of Biochemistry, University of Colorado Boulder - name: Michelle Daya affiliation: Biomedical Informatics & Personalized Medicine, University of Colorado Anschutz Medical Campus - name: Yue Hao affiliation: Bioinformatics Research Center, North Carolina State University - name: Jay Hesselberth affiliation: RNA Bioscience Initative, University of Colorado School of Medicine - name: Kent Riemondy affiliation: RNA Bioscience Initative, University of Colorado School of Medicine output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Introduction to clustifyr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r "knitr options", echo = FALSE, message = FALSE, warning = FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, collapse = TRUE, fig.align = "center", comment = "#>", crop = NULL ) ``` # Introduction: Why use `clustifyr`? Single cell transcriptomes are difficult to annotate without extensive knowledge of the underlying biology of the system in question. Even with this knowledge, accurate identification can be challenging due to the lack of detectable expression of common marker genes defined by bulk RNA-seq, flow cytometry, other single cell RNA-seq platforms, etc. `clustifyr` solves this problem by providing functions to automatically annotate single cells or clusters using bulk RNA-seq data or marker gene lists (ranked or unranked). Additional functions allow for exploratory analysis of calculated similarities between single cell RNA-seq datasets and reference data. # Installation To install `clustifyr` BiocManager must be installed. ```{r "Installation", eval = FALSE} install.packages("BiocManager") BiocManager::install("clustifyr") ``` # A simple example: 10x Genomics PBMCs In this example, we take a 10x Genomics 3' scRNA-seq dataset from peripheral blood mononuclear cells (PBMCs) and annotate the cell clusters (identified using `Seurat`) using scRNA-seq cell clusters assigned from a CITE-seq experiment. ```{r "Load data"} library(clustifyr) library(ggplot2) library(cowplot) # Matrix of normalized single-cell RNA-seq counts pbmc_matrix <- clustifyr::pbmc_matrix_small # meta.data table containing cluster assignments for each cell # The table that we are using also contains the known cell identities in the "classified" column pbmc_meta <- clustifyr::pbmc_meta ``` # Calculate correlation coefficients To identify cell types, the `clustifyr()` function requires several inputs: * `input`: an SingleCellExperiment or Seurat object or a matrix of normalized single-cell RNA-seq counts * `metadata`: a meta.data table containing the cluster assignments for each cell (not required if a Seurat object is given) * `ref_mat`: a reference matrix containing RNA-seq expression data for each cell type of interest * `query_genes`: a list of genes to use for comparison (optional but recommended) When using a matrix of scRNA-seq counts `clustifyr()` will return a matrix of correlation coefficients for each cell type and cluster, with the rownames corresponding to the cluster number. ```{r "Run clustifyr()"} # Calculate correlation coefficients for each cluster (spearman by default) vargenes <- pbmc_vargenes[1:500] res <- clustify( input = pbmc_matrix, # matrix of normalized scRNA-seq counts (or SCE/Seurat object) metadata = pbmc_meta, # meta.data table containing cell clusters cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type query_genes = vargenes # list of highly varible genes identified with Seurat ) # Peek at correlation matrix res[1:5, 1:5] # Call cell types res2 <- cor_to_call( cor_mat = res, # matrix correlation coefficients cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters ) res2[1:5, ] # Insert into original metadata as "type" column pbmc_meta2 <- call_to_metadata( res = res2, # data.frame of called cell type for each cluster metadata = pbmc_meta, # original meta.data table containing cell clusters cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters ) ``` To visualize the `clustifyr()` results we can use the `plot_cor_heatmap()` function to plot the correlation coefficients for each cluster and each cell type. ```{r "Create correlation heatmap", fig.height = 5, fig.width = 7} # Create heatmap of correlation coefficients using clustifyr() output plot_cor_heatmap(cor_mat = res) ``` # Plot cluster identities and correlation coefficients `clustifyr` also provides functions to overlay correlation coefficients on pre-calculated tSNE embeddings (or those from any other dimensionality reduction method). ```{r "Overlay corr coefficients on UMAP", fig.height = 3.5, fig.width = 9} # Overlay correlation coefficients on UMAPs for the first two cell types corr_umaps <- plot_cor( cor_mat = res, # matrix of correlation coefficients from clustifyr() metadata = pbmc_meta, # meta.data table containing UMAP or tSNE data data_to_plot = colnames(res)[1:2], # name of cell type(s) to plot correlation coefficients cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters ) plot_grid( plotlist = corr_umaps, rel_widths = c(0.47, 0.53) ) ```
The `plot_best_call()` function can be used to label each cluster with the cell type that gives the highest corelation coefficient. Using the `plot_dims()` function, we can also plot the known identities of each cluster, which were stored in the "classified" column of the meta.data table. The plots below show that the highest correlations between the reference RNA-seq data and the 10x Genomics scRNA-seq dataset are restricted to the correct cell clusters. ```{r "Label clusters", fig.height = 5.5, fig.width = 12} # Label clusters with clustifyr cell identities clustifyr_types <- plot_best_call( cor_mat = res, # matrix of correlation coefficients from clustifyr() metadata = pbmc_meta, # meta.data table containing UMAP or tSNE data do_label = TRUE, # should the feature label be shown on each cluster? do_legend = FALSE, # should the legend be shown? do_repel = FALSE, # use ggrepel to avoid overlapping labels cluster_col = "seurat_clusters" ) + ggtitle("clustifyr cell types") # Compare clustifyr results with known cell identities known_types <- plot_dims( data = pbmc_meta, # meta.data table containing UMAP or tSNE data feature = "classified", # name of column in meta.data to color clusters by do_label = TRUE, # should the feature label be shown on each cluster? do_legend = FALSE, # should the legend be shown? do_repel = FALSE ) + ggtitle("Known cell types") plot_grid(known_types, clustifyr_types) ``` # Classify cells using known marker genes The `clustify_lists()` function allows cell types to be assigned based on known marker genes. This function requires a table containing markers for each cell type of interest. Cell types can be assigned using several statistical tests including, hypergeometric, Jaccard, Spearman, and GSEA. ```{r "clustifyr with gene lists", fig.height = 4, fig.width = 6} # Take a peek at marker gene table cbmc_m # Available metrics include: "hyper", "jaccard", "spearman", "gsea" list_res <- clustify_lists( input = pbmc_matrix, # matrix of normalized single-cell RNA-seq counts metadata = pbmc_meta, # meta.data table containing cell clusters cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters marker = cbmc_m, # list of known marker genes metric = "pct" # test to use for assigning cell types ) # View as heatmap, or plot_best_call plot_cor_heatmap( cor_mat = list_res, # matrix of correlation coefficients from clustify_lists() cluster_rows = FALSE, # cluster by row? cluster_columns = FALSE, # cluster by column? legend_title = "% expressed" # title of heatmap legend ) # Downstream functions same as clustify() # Call cell types list_res2 <- cor_to_call( cor_mat = list_res, # matrix correlation coefficients cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters ) # Insert into original metadata as "list_type" column pbmc_meta3 <- call_to_metadata( res = list_res2, # data.frame of called cell type for each cluster metadata = pbmc_meta, # original meta.data table containing cell clusters cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters rename_prefix = "list_" # set a prefix for the new column ) ``` # Direct handling of `SingleCellExperiment` objects `clustifyr` can also use a `SingleCellExperiment` object as input and return a new `SingleCellExperiment` object with the cell types added as a column in the colData. ```{r} library(SingleCellExperiment) sce <- sce_pbmc() res <- clustify( input = sce, # an SCE object ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type cluster_col = "clusters", # name of column in meta.data containing cell clusters obj_out = TRUE # output SCE object with cell type inserted as "type" column ) colData(res)[1:10, c("type", "r")] ``` # Direct handling of `Seurat` objects `clustifyr` can also use a `Seurat` object as input and return a new `Seurat` object with the cell types added as a column in the meta.data. ```{r} so <- so_pbmc() res <- clustify( input = so, # a Seurat object ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters obj_out = TRUE # output Seurat object with cell type inserted as "type" column ) res@meta.data[1:10, c("type", "r")] ``` # Building reference matrix from single cell expression matrix In its simplest form, a reference matrix is built by averaging expression (also includes an option to take the median) of a single cell RNA-seq expression matrix by cluster. Both log transformed or raw count matrices are supported. ```{r "average"} new_ref_matrix <- average_clusters( mat = pbmc_matrix, metadata = pbmc_meta$classified, # or use metadata = pbmc_meta, cluster_col = "classified" if_log = TRUE # whether the expression matrix is already log transformed ) head(new_ref_matrix) # For further convenience, a shortcut function for generating reference matrix from `SingleCellExperiment` or `seurat` object is used. new_ref_matrix_sce <- object_ref( input = sce, # SCE object cluster_col = "clusters" # name of column in colData containing cell identities ) new_ref_matrix_so <- seurat_ref( seurat_object = so, # Seurat object cluster_col = "seurat_clusters" # name of column in meta.data containing cell identities ) tail(new_ref_matrix_so) ``` More reference data, including tabula muris, and code used to generate them are available at https://github.com/rnabioco/clustifyrdata Also see list for individual downloads at https://rnabioco.github.io/clustifyrdata/articles/download_refs.html Additional tutorials at https://rnabioco.github.io/clustifyrdata/articles/otherformats.html # Session info ```{r} sessionInfo() ```