--- title: "Transitioning from SCESet to SingleCellExperiment" author: - name: "Davis McCarthy" affiliation: - EMBL-EBI package: scater output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Transition from SCESet to SingleCellExperiment} %\VignetteEngine{knitr::rmarkdown} %VignetteEncoding{UTF-8} --- ```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE} ## To render an HTML version that works nicely with github and web pages, do: ## rmarkdown::render("vignettes/vignette.Rmd", "all") library(knitr) opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png') library(ggplot2) theme_set(theme_bw(12)) ``` This document provides advice for users of early versions of `scater` who will need to transition from the use of the `SCESet` class to the `SingleCellExperiment` class. As of July 2017, `scater` has switched from the `SCESet` class previously defined within the package to the more widely applicable `SingleCellExperiment` class. From Bioconductor 3.6 (October 2017), the release version of `scater` will use `SingleCellExperiment`. `SingleCellExperiment` is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets. It should be straight-forward to convert existing scripts based on `SCESet` objects to `SingleCellExperiment` objects, with key changes outlined immediately below. # Executive summary * The functions `toSingleCellExperiment` and `updateSCESet` (for backwards compatibility) can be used to convert an old `SCESet` object to a `SingleCellExperiment` object; * Create a new `SingleCellExperiment` object with the function `SingleCellExperiment` (actually less fiddly than creating a new `SCESet`); * `scater` functions have been refactored to take `SingleCellExperiment` objects, so once data is in a `SingleCellExperiment` object, the user experience is almost identical to that with the `SCESet` class. Potential "gotchas": * Cell names can now be accessed/assigned with the `colnames` function (instead of `sampleNames` or `cellNames` for an `SCESet` object); * Feature (gene/transcript) names should now be accessed/assigned with the `rownames` function (instead of `featureNames`); * Cell metadata, stored as `phenoData` in an `SCESet`, corresponds to `colData` in a `SingleCellExperiment` object and is accessed/assigned with the `colData` function (this replaces the `pData` function); * Individual cell-level variables can still be accessed with the `$` operator (e.g. `sce$total_counts`); * Feature metadata, stored as `featureData` in an `SCESet`, corresponds to `rowData` in a `SingleCellExperiment` object and is accessed/assigned with the `rowData` function (this replaces the `fData` function); * `plotScater`, which produces a cumulative expression, overview plot, replaces the generic `plot` function for `SCESet` objects. # A note on terminology In Bioconductor terminology we assay numerous "features" for a number of "samples". Features, in the context of `scater`, correspond most commonly to genes or transcripts, but could be any general genomic or transcriptomic regions (e.g. exon) of interest for which we take measurements. Samples correspond to cells. With the switch to using the `SingleCellExperiment` class, the terminology has become more general again. Now we have "rows" representing features and "cols" representing samples (cells). Thus, applying the `rownames` function returns the names of the features defined for a `SingleCellExperiment` object, which in typical `scater` usage would correspond to gene IDs. In much of what follows, it may be more intuitive to mentally replace "feature" with "gene" or "transcript" (depending on the context of the study) wherever "feature" appears. In the `scater` context, "samples" refer to individual cells that we have assayed. This differs from common usage of "sample" in other contexts, where we might usually use "sample" to refer to an individual subject, a biological replicate or similar. A "sample" in this sense in `scater` may be referred to as a "block" in the more classical statistical sense. Within a "block" (e.g. individual) we may have assayed numerous cells. Thus, the function `colnames`, when applied to a `SingleCellExperiment` object returns the cell IDs. # The `SingleCellExperiment` class and methods In `scater` we organise single-cell expression data in objects of the `SingleCellExperiment` class. The class inherits the Bioconductor `SummarizedExperiment` class, which provides a common interface across many Bioconductor packages. For more details about other features inherited from Bioconductor's `SummarizedExperiment` class, type `?SummarizedExperiment` at the R prompt. The class only requires some "assay data" (i.e. expression values of some sort) as input. Most commonly, these will be "counts" (e.g. molecule or read counts) and/or log2-scale transformed counts. Cell metadata can be supplied as a `DataFrame` object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.). Feature metadata can be supplied as a `DataFrame` object, where rows are features (e.g. genes), and columns are feature attributes, such as Ensembl ID, biotype, gc content, etc. We can create a minimal `SingleCellExperiment` object as follows: ```{r sceset-make-sce-minimal} data("sc_example_counts") example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts)) example_sce ``` The requirements for the `SingleCellExperiment` class (as with other S4 classes in R and Bioconductor) are strict. The idea is that strictness with generating a valid class object ensures that downstream methods applied to the class will work reliably. Thus, if we supply `colData` and/or `rowData` when building an obejct, the expression value matrix *must* have the same number of columns as the `colData` `DataFrame` has rows, and it must have the same number of rows as the `rowData` `DataFrame` has rows. Row names of the `colData` object need to match the column names of the expression matrix and row names of the `rowData` object need to match row names of the expression matrix. We can create a new `SingleCellExperiment` object with count data, cell metadata and gene metadata as follows. ```{r sceset-make-sceset-counts-only} data("sc_example_cell_info") gene_df <- DataFrame(Gene = rownames(sc_example_counts)) rownames(gene_df) <- gene_df$Gene example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), colData = sc_example_cell_info, rowData = gene_df) example_sce ``` Frequently (typically), we will want both raw counts and log2-scale counts in our `SingleCellExperiment` object. It is straight-forward to add log2-counts-per-million to an object containing counts. We can use the `normalise` (or, if you prefer, `normalize`) function: ```{r normalise, eval=TRUE} example_sce <- normalise(example_sce) ``` (This gives a warning to let us know that as size factors for normalisation have not yet been defined, library sizes (total counts) are used instead. This function can also be used for more sophisticated size-factor normalisation once size factors have been calculated.) Or, we use `calculateCPM` directly (with equivalent results): ```{r cpm, eval=TRUE} logcounts(example_sce) <- log2(calculateCPM(example_sce, use.size.factors = FALSE) + 1) ``` The log-scale count data is stored in the `logcounts` assay slot of a `SingleCellExperiment` object. The `exprs` getter/setter function also accesses this `logcounts` slot, to enable equivalent usage as in previous versions of `scater`. # Subsetting, accessing and assigning data in a `SingleCellExperiment` object We have accessor functions to access elements of the `SingleCellExperiment` object. Furthermore, subsetting `SingleCellExperiment` objects is straightforward and reliable, using the usual R `[]` notation, with rows representing features and columns representing cells. * `counts(object)`: returns the matrix of read counts. As you can see above, if no counts are defined for the object, then the counts matrix slot is simpy `NULL`. ```{r counts-accessor, eval=TRUE} counts(example_sce)[1:3, 1:6] ``` * `exprs(object)`: returns the matrix of (log-counts) expression values, in fact accessing the `logcounts` slot of the object (synonym for `logcounts`). Typically these should be log2(counts-per-million) values or log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course. The package will generally assume that these are the values to use for expression. ```{r exprs-accessor, eval=TRUE} exprs(example_sce)[1:3, 1:6] ``` * Generically, we can access any assay data from the object with the `assay` function. We simply supply the function with the `SingleCellExperiment` object and the name of the desired expression matrix: ```{r assay, eval=FALSE} assay(example_sce, "counts")[1:3, 1:6] ``` Similarly we can assign a new (say, transformed) expression matrix to an `SingleCellExperiment` object using `assay` as follows: ```{r assay-set, eval=TRUE} assay(example_sce, "counts") <- counts(example_sce) ``` For convenience (and backwards compatibility) getters and setters are provided as follows: `exprs`, `tpm`, `cpm`, `fpkm` and versions of these with the prefix "norm_"): Handily, it is also easy to replace other data in slots of the `SCESet` object using generic accessor and replacement functions. ```{r sce-demo-replacement, eval=TRUE} gene_df <- DataFrame(Gene = rownames(sc_example_counts)) rownames(gene_df) <- gene_df$Gene ## replace rowData (previously featureData) rowData(example_sce) <- gene_df ## replace colData (previously phenotype data) colData(example_sce) <- DataFrame(sc_example_cell_info) ``` After gaining familiarity with creating and manipulating `SingleCellExperiment` objects, see the other `scater` vignettes for guidance on using `scater` for quality control, data visualisation and more.