--- title: "Introduction to `scater`: Single-cell analysis toolkit for expression in R" author: - name: "Davis McCarthy" affiliation: - EMBL-EBI package: scater output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{An introduction to the scater package} %\VignetteEngine{knitr::rmarkdown} %VignetteEncoding{UTF-8} --- ```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE} ## To render an HTML version that works nicely with github and web pages, do: ## rmarkdown::render("vignettes/vignette.Rmd", "all") library(knitr) opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png') library(ggplot2) theme_set(theme_bw(12)) ``` This document gives an introduction to and overview of the functionality of the `scater` package. The `scater` package is contains tools to help with the analysis of single-cell transcriptomic data, with the focus on RNA-seq data. The package features: * Use of the `SingleCellExperiment` class as a data container for interoperability with a wide range of other Bioconductor packages; * Wrappers to [`kallisto`](http://pachterlab.github.io/kallisto/) and ['Salmon'](https://combine-lab.github.io/salmon/) for rapid quantification of transcript abundance and tight integration with `scater`; * Simple calculation of many quality control metrics from the expression data; * Many tools for visualising scRNA-seq data, especially diagnostic plots for quality control; * Subsetting and many other methods for filtering out problematic cells and features; * Methods for identifying important experimental variables and normalising data ahead of downstream statistical analysis and modeling. To get up and running as quickly as possible, see the [Quick Start](#quickstart) section below. For see the various in-depth sections on various aspects of the functionality that follow. NB: as of July 2017, `scater` has switched from the `SCESet` class previously defined within the package to the more widely applicable `SingleCellExperiment` class. The functions `toSingleCellExperiment` and `updateSCESet` (for backwards compatibility) can be used to convert an old `SCESet` object to a `SingleCellExperiment` object. # Quick Start Assuming you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.), then we first need to form a `SingleCellExperiment` object containing the data. A `SingleCellExperiment` object is the basic data container used in `scater` and many other Bioconductor packages for single-cell data analysis. Here we use the example data provided with the package, which gives us two objects, a matrix of counts and a dataframe with information about the cells we are studying: ```{r quickstart-load-data, message=FALSE, warning=FALSE} suppressPackageStartupMessages(library(scater)) data("sc_example_counts") data("sc_example_cell_info") ``` We use these objects to form a `SingleCellExperiment` object containing all of the necessary information for our analysis: ```{r quickstart-make-sce, results='hide'} example_sce <- SingleCellExperiment( assays = list(counts = sc_example_counts), colData = sc_example_cell_info) ``` We always expect to have (raw) count data in a `SingleCellExperiment` object. In almost all cases we will also want to have a log2-scale representation of the data. We expect this to be stored as the `exprs` assay. Here we use log2-counts-per-million with an offset of 1 as the `exprs` values. ```{r quickstart-add-exprs, results='hide'} exprs(example_sce) <- log2( calculateCPM(example_sce, use.size.factors = FALSE) + 1) ``` Subsetting is very convenient with this class. For example, we can filter out features (genes) that are not expressed in any cells: ```{r filter-no-exprs} keep_feature <- rowSums(exprs(example_sce) > 0) > 0 example_sce <- example_sce[keep_feature,] ``` Now we have the expression data neatly stored in a structure that can be used for lots of exciting analyses. It is straight-forward to compute many quality control metrics. We typically provide one or more sets of "feature controls", that is sets of genes or features that represent technical features of the expression data or are not of primary biological interest. QC metrics are computed especially for these feature sets are used to assess the quality of cells. Spike-in genes (such as the commonly-used ERCC set) and mitochondrial genes are typically useful as "feature controls". Here, for demonstration, we just use the first 40 features. ```{r quick-start-calc-qc-metrics, eval=TRUE} example_sce <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40)) ``` Now you can play around with your data using the graphical user interface (GUI), which opens an interactive dashboard in your browser! ```{r quick-start-gui, eval=FALSE} scater_gui(example_sce) ``` Many plotting functions are available for visualising the data: * `plotScater`: a plot method exists for `SingleCellExperiment` objects, which gives an overview of expression across cells. * `plotQC`: various methods are available for producing QC diagnostic plots. * `plotPCA`: produce a principal components plot for the cells. * `plotTSNE`: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells. * `plotDiffusionMap`: produce a diffusion map (reduced dimension) plot for the cells. * `plotMDS`: produce a multi-dimensional scaling plot for the cells. * `plotReducedDim`: plot a reduced-dimension representation of the cells. * `plotExpression`: plot expression levels for a defined set of features. * `plotPlatePosition`: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level. * `plotColData`: plot cell metadata and QC metrics. * `plotRowData`: plot feature metadata and QC metrics. More detail on all plotting methods is provided in the data visualisation vignette. Visualisations can highlight features and cells to be filtered out, which can be done easily with the subsetting capabilities of scater. The QC plotting functions also enable the identification of important experimental variables, which can be conditioned out in the data normalisation step. After QC and data normalisation (methods are available in `scater`), the dataset is ready for downstream statistical modeling. # Where to find out more 1. For more information about using the `SingleCellExperiment` class, including transitioning from `SCESet` objects in previous versions of `scater`, see the `"Transitioning from SCESet to SingleCellExperiment"` vignette; 2. For guidance on using `scater` for quality control, see the `Quality control with scater` vignette; 3. For demonstrations of the data visualisation capabilities of `scater`, see the `Data visualisation methods in scater` vignette; 4. For more details about importing expression data into `scater`, see the `Expression quantification and import` vignette.