---
title: "Expression quantification and import"
author:
- name: "Davis McCarthy"
  affiliation: 
  - EMBL-EBI
package: scater
output:
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Expression quantification and import}
  %\VignetteEngine{knitr::rmarkdown}
  %VignetteEncoding{UTF-8}
---


```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE}
## To render an HTML version that works nicely with github and web pages, do:
## rmarkdown::render("vignettes/vignette.Rmd", "all")
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))
```

This document provides details on the import of single-cell data into a 
`SingleCellExperiment` object for use with the `scater` package (frequent use 
case) and how to quantify single-cell gene expression from within an R session
using the popular RNA-seq quantification tools `Salmon` and `kallisto`.


# Importing expression data into R


# Read 10x results


# Read Salmon or kallisto results


# Using `tximport`


#  Using **kallisto** and **Salmon** to quantify transcript abundance from within R

Lior Pachter's group at Berkeley has recently released a new software tool
called **kallisto** for rapid quantification of transcript abundance from RNA-seq
data. Similarly, Rob Patro at Stony Brook University and colleagues have 
released **Salmon**, a similarly fast and accurate RNA-seq quantification tool.
In `scater`, a couple of wrapper functions to `kallisto` and `Salmon` enable easy and
extremely fast quantification of transcript abundance from within R.

For simplicity, examples are shown below demonstrating the use of `kallisto`, 
but the same could be done with `Salmon` using almost identical interfaces 
(replacing `runKallisto` with `runSalmon`, `readKallistoResults` with 
`readSalmonResults` and so on).

```{r kallisto-demo-kallisto-test-data, eval=FALSE}
################################################################################
### Tests and Examples

# Example if in the kallisto/test directory
setwd("/home/davis/kallisto/test")
kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE,
            output_prefix="output", verbose=TRUE, n_bootstrap_samples=10)

sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE)
sce_test
```

An example using real data from a project investigating cell cycle. Note that
this analysis is not 'live' (the raw data are not included in the package), but
it demonstrates what can be done with `scater` and `kallisto`.

```{r kallisto-cell-cycle-example, eval=FALSE}
setwd("/home/davis/021_Cell_Cycle/data/fastq")
system("wc -l targets.txt")
ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690))

kallisto_test <- runKallisto("targets.txt",
                             "Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx",
                             output_prefix="kallisto_output_Mmus", n_cores=12,
                             fragment_length=ave_frag_len, verbose=TRUE)
sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE)
sce_kall_mmus

sce_kall_mmus <- readKallistoResults(kallisto_test)

sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus)

head(fData(sce_kall_mmus))
head(pData(sce_kall_mmus))
sce_kall_mmus[["start_time"]]

counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6]

## Summarise expression at the gene level
sce_kall_mmus_gene <- summariseExprsAcrossFeatures(
    sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id")

datatable(fData(sce_kall_mmus_gene))

sce_kall_mmus_gene <- getBMFeatureAnnos(
    sce_kall_mmus_gene, filters="ensembl_gene_id",
    attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name",
                 "gene_biotype", "start_position", "end_position",
                 "percentage_gc_content", "description"),
    feature_symbol="mgi_symbol", feature_id="ensembl_gene_id",
    biomart="ensembl", dataset="mmusculus_gene_ensembl")

datatable(fData(sce_kall_mmus_gene))

## Add gene symbols to featureNames to make them more intuitive
new_feature_names <- featureNames(sce_kall_mmus_gene)
notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol)
new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb]
notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id)
new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid],
          fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_")
sum(duplicated(new_feature_names))
featureNames(sce_kall_mmus_gene) <- new_feature_names
head(featureNames(sce_kall_mmus_gene))
tail(featureNames(sce_kall_mmus_gene))
sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol))

```