# Grun human pancreas (CEL-seq2)
## Introduction
This workflow performs an analysis of the @grun2016denovo CEL-seq2 dataset consisting of human pancreas cells from various donors.
## Data loading
``` r
library(scRNAseq)
sce.grun <- GrunPancreasData()
```
We convert to Ensembl identifiers, and we remove duplicated genes or genes without Ensembl IDs.
``` r
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol,
keytype="SYMBOL", column="ENSEMBL")
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
sce.grun <- sce.grun[keep,]
rownames(sce.grun) <- gene.ids[keep]
```
## Quality control
``` r
unfiltered <- sce.grun
```
This dataset lacks mitochondrial genes so we will do without them for quality control.
We compute the median and MAD while blocking on the donor;
for donors where the assumption of a majority of high-quality cells seems to be violated (Figure \@ref(fig:unref-grun-qc-dist)),
we compute an appropriate threshold using the other donors as specified in the `subset=` argument.
``` r
library(scater)
stats <- perCellQCMetrics(sce.grun)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
batch=sce.grun$donor,
subset=sce.grun$donor %in% c("D17", "D7", "D2"))
sce.grun <- sce.grun[,!qc$discard]
```
``` r
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard
gridExtra::grid.arrange(
plotColData(unfiltered, x="donor", y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count"),
plotColData(unfiltered, x="donor", y="detected", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features"),
plotColData(unfiltered, x="donor", y="altexps_ERCC_percent",
colour_by="discard") + ggtitle("ERCC percent"),
ncol=2
)
```
(\#fig:unref-grun-qc-dist)Distribution of each QC metric across cells from each donor of the Grun pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded.
``` r
colSums(as.matrix(qc), na.rm=TRUE)
```
```
## low_lib_size low_n_features high_altexps_ERCC_percent
## 451 511 605
## discard
## 664
```
## Normalization
``` r
library(scran)
set.seed(1000) # for irlba.
clusters <- quickCluster(sce.grun)
sce.grun <- computeSumFactors(sce.grun, clusters=clusters)
sce.grun <- logNormCounts(sce.grun)
```
``` r
summary(sizeFactors(sce.grun))
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.102 0.505 0.794 1.000 1.231 11.600
```
``` r
plot(librarySizeFactors(sce.grun), sizeFactors(sce.grun), pch=16,
xlab="Library size factors", ylab="Deconvolution factors", log="xy")
```
(\#fig:unref-grun-norm)Relationship between the library size factors and the deconvolution size factors in the Grun pancreas dataset.
## Variance modelling
We block on a combined plate and donor factor.
``` r
block <- paste0(sce.grun$sample, "_", sce.grun$donor)
dec.grun <- modelGeneVarWithSpikes(sce.grun, spikes="ERCC", block=block)
top.grun <- getTopHVGs(dec.grun, prop=0.1)
```
We examine the number of cells in each level of the blocking factor.
``` r
table(block)
```
```
## block
## CD13+ sorted cells_D17 CD24+ CD44+ live sorted cells_D17
## 86 87
## CD63+ sorted cells_D10 TGFBR3+ sorted cells_D17
## 40 90
## exocrine fraction, live sorted cells_D2 exocrine fraction, live sorted cells_D3
## 82 7
## live sorted cells, library 1_D10 live sorted cells, library 1_D17
## 33 88
## live sorted cells, library 1_D3 live sorted cells, library 1_D7
## 25 85
## live sorted cells, library 2_D10 live sorted cells, library 2_D17
## 35 83
## live sorted cells, library 2_D3 live sorted cells, library 2_D7
## 27 84
## live sorted cells, library 3_D3 live sorted cells, library 3_D7
## 17 83
## live sorted cells, library 4_D3 live sorted cells, library 4_D7
## 29 83
```
``` r
par(mfrow=c(6,3))
blocked.stats <- dec.grun$per.block
for (i in colnames(blocked.stats)) {
current <- blocked.stats[[i]]
plot(current$mean, current$total, main=i, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(current)
points(curfit$mean, curfit$var, col="red", pch=16)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}
```
(\#fig:unref-416b-variance)Per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor.
(\#fig:unref-grun-tsne)Obligatory $t$-SNE plots of the Grun pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right).