--- title: "Expression quantification and import" author: - name: "Davis McCarthy" affiliation: - EMBL-EBI package: scater output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Expression quantification and import} %\VignetteEngine{knitr::rmarkdown} %VignetteEncoding{UTF-8} --- ```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE} ## To render an HTML version that works nicely with github and web pages, do: ## rmarkdown::render("vignettes/vignette.Rmd", "all") library(knitr) opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png') library(ggplot2) theme_set(theme_bw(12)) ``` This document provides details on the import of single-cell data into a `SingleCellExperiment` object for use with the `scater` package (frequent use case) and how to quantify single-cell gene expression from within an R session using the popular RNA-seq quantification tools `Salmon` and `kallisto`. # Importing expression data into R # Read 10x results # Read Salmon or kallisto results # Using `tximport` # Using **kallisto** and **Salmon** to quantify transcript abundance from within R Lior Pachter's group at Berkeley has recently released a new software tool called **kallisto** for rapid quantification of transcript abundance from RNA-seq data. Similarly, Rob Patro at Stony Brook University and colleagues have released **Salmon**, a similarly fast and accurate RNA-seq quantification tool. In `scater`, a couple of wrapper functions to `kallisto` and `Salmon` enable easy and extremely fast quantification of transcript abundance from within R. For simplicity, examples are shown below demonstrating the use of `kallisto`, but the same could be done with `Salmon` using almost identical interfaces (replacing `runKallisto` with `runSalmon`, `readKallistoResults` with `readSalmonResults` and so on). ```{r kallisto-demo-kallisto-test-data, eval=FALSE} ################################################################################ ### Tests and Examples # Example if in the kallisto/test directory setwd("/home/davis/kallisto/test") kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE, output_prefix="output", verbose=TRUE, n_bootstrap_samples=10) sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE) sce_test ``` An example using real data from a project investigating cell cycle. Note that this analysis is not 'live' (the raw data are not included in the package), but it demonstrates what can be done with `scater` and `kallisto`. ```{r kallisto-cell-cycle-example, eval=FALSE} setwd("/home/davis/021_Cell_Cycle/data/fastq") system("wc -l targets.txt") ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690)) kallisto_test <- runKallisto("targets.txt", "Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx", output_prefix="kallisto_output_Mmus", n_cores=12, fragment_length=ave_frag_len, verbose=TRUE) sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE) sce_kall_mmus sce_kall_mmus <- readKallistoResults(kallisto_test) sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus) head(fData(sce_kall_mmus)) head(pData(sce_kall_mmus)) sce_kall_mmus[["start_time"]] counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6] ## Summarise expression at the gene level sce_kall_mmus_gene <- summariseExprsAcrossFeatures( sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id") datatable(fData(sce_kall_mmus_gene)) sce_kall_mmus_gene <- getBMFeatureAnnos( sce_kall_mmus_gene, filters="ensembl_gene_id", attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name", "gene_biotype", "start_position", "end_position", "percentage_gc_content", "description"), feature_symbol="mgi_symbol", feature_id="ensembl_gene_id", biomart="ensembl", dataset="mmusculus_gene_ensembl") datatable(fData(sce_kall_mmus_gene)) ## Add gene symbols to featureNames to make them more intuitive new_feature_names <- featureNames(sce_kall_mmus_gene) notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol) new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb] notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id) new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid], fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_") sum(duplicated(new_feature_names)) featureNames(sce_kall_mmus_gene) <- new_feature_names head(featureNames(sce_kall_mmus_gene)) tail(featureNames(sce_kall_mmus_gene)) sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol)) ```