---
author: "Belinda Phipson"
title: "speckle: statistical methods for analysing single cell RNA-seq data"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('speckle')`"
vignette: >
    %\VignetteEncoding{UTF-8}
    %\VignetteIndexEntry{speckle: statistical methods for analysing single cell RNA-seq data}
    %\VignetteEngine{knitr::rmarkdown}
output: >
    BiocStyle::html_document
html_document:
    fig_caption: yes
    fig_retina: FALSE
    keep_md: FALSE
editor_options: 
    chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)
```

# Introduction

The `r BiocStyle::Biocpkg("speckle")` package contains functions to analyse 
differences in cell type proportions in single cell RNA-seq data. As our 
research into specialised analyses of single cell data continues we anticipate 
that the package will be updated with new functions. 

The propeller method has now been published in *Bioinformatics*:\
Belinda Phipson, Choon Boon Sim, Enzo R Porrello, Alex W Hewitt, Joseph Powell, Alicia Oshlack, propeller: testing for differences in cell type proportions in single cell data, Bioinformatics, 2022;, btac582, [https://doi.org/10.1093/bioinformatics/btac582](https://doi.org/10.1093/bioinformatics/btac582)

The analysis of single cell RNA-seq data consists of a large number of steps, 
which can be iterative and also depend on the research question. There are many 
R packages that can do some or most of these steps. The analysis steps are 
described here briefly. 

Once the sequencing data has been summarised into counts over genes, quality 
control is performed to remove poor quality cells. Poor quality cells are often 
characterised as having very low total counts (library size) and very few genes 
detected. Lowly expressed and uninformative genes are filtered out, followed by 
appropriate normalisation. Dimensionality reduction and clustering of the cells
is then performed. Cells that have similar transcriptional profiles cluster 
together, and these clusters (hopefully) correspond to something biologically 
relevant, such as different cell types. Differential expression between each 
cluster compared to all other clusters can highlight genes that are more highly
expressed in each cluster. These marker genes help to determine the cell type 
each cluster corresponds to. Cell type identification is a process that often 
uses marker genes as well as a list of curated genes that are known to be 
expressed in each cell type. It is always helpful to visualise the data in a lot
of different ways to aid in interpretation of the clusters using tSNE/UMAP 
plots, clustering trees and heatmaps of known marker genes. An alternative to 
clustering is classification or label transfer approaches, where 
reference datasets can be used to annotate new datasets.

# Installation

```{r eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("speckle")
```

# Finding significant differences in cell type proportions using propeller

In order to determine whether there are statistically significant compositional 
differences between groups, there must be some form of biological replication in
the experiment. This is so that we can estimate the variability of the cell type
proportion estimates for each group. A classical statistical test for 
differences between two proportions is typically very sensitive to small changes
and will almost always yield a significant p-value. Hence `propeller` is only 
suitable to use in single cell experiments where there are multiple groups and 
multiple biological replicates in at least one of the groups. The absolute 
minimum sample size is 2 in one group and 1 in the other group/s. Variance 
estimates are obtained from the group with more than 1 biological replicate 
which assumes that the cell type proportion variances estimates are similar 
between experimental conditions.

The `propeller` test is performed after initial analysis of the single cell data
has been done, i.e. after clustering and cell type assignment. The `propeller` 
function can take a `SingleCellExperiment` or `Seurat` object and extract the 
necessary information from the metadata. The basic model for `propeller` is that
the cell type proportions for each sample are estimated based on the clustering 
information provided by the user or extracted from the relevant slots in the 
data objects. The proportions are then transformed using either an arcsin square
root transformation, or logit transformation. For 
each cell type $i$, we fit a linear model with group as the explanatory variable
using functions from the R Bioconductor package `r BiocStyle::Biocpkg("limma")`.
Using `r BiocStyle::Biocpkg("limma")` to obtain p-values has the added benefit 
of performing empirical Bayes shrinkage of the variances. For every cell type 
we obtain a p-value that indicates whether that cell type proportion is 
statistically significantly different between two (or more) groups.

# Load the libraries

```{r}
library(speckle)
library(SingleCellExperiment)
library(CellBench)
library(limma)
library(ggplot2)
library(scater)
library(patchwork)
library(edgeR)
library(statmod)
```


# Loading data into R

We are using single cell data from the `r BiocStyle::Biocpkg("CellBench")` 
package to illustrate how `propeller` works. This is an artificial dataset that 
is made up of an equal mixture of 3 different cell lines. There are three 
datasets corresponding to three different technologies: 10x genomics, CelSeq and
DropSeq.

```{r}
sc_data <- load_sc_data()
```

The way that `propeller` is designed to be used is in the context of a designed 
experiment where there are multiple biological replicates and multiple groups. 
Comparing cell type proportions without biological replication should be done 
with caution as there will be a large degree of variability in the cell type 
proportions between samples due to technical factors (cell capture bias, 
sampling, clustering errors), as well as biological variability. The 
`r BiocStyle::Biocpkg("CellBench")` dataset does not have biological 
replication, so we will create several artificial biological replicates by 
bootstrapping the data. Bootstrapping has the advantage that it induces 
variability between bootstrap samples by sampling with replacement. Here we will
treat the three technologies as the groups, and create artifical biological 
replicates within each group. Note that bootstrapping only induces sampling 
variability between our biological replicates, which will almost certainly be 
much smaller than biological variability we would expect to see in a real 
dataset.

The three single cell experiment objects in `sc_data` all have differing numbers
of genes. The first step is to find all the common genes between all three 
experiments in order to create one large dataset.

```{r}
commongenes1 <- rownames(sc_data$sc_dropseq)[rownames(sc_data$sc_dropseq) %in% 
                                                rownames(sc_data$sc_celseq)]
commongenes2 <-  commongenes1[commongenes1 %in% rownames(sc_data$sc_10x)]

sce_10x <- sc_data$sc_10x[commongenes2,]
sce_celseq <- sc_data$sc_celseq[commongenes2,] 
sce_dropseq <- sc_data$sc_dropseq[commongenes2,] 

dim(sce_10x)
dim(sce_celseq)
dim(sce_dropseq)

table(rownames(sce_10x) == rownames(sce_celseq))
table(rownames(sce_10x) == rownames(sce_dropseq))
```

# Bootstrap additional samples

This dataset does not have any biological replicates, so we will bootstrap 
additional samples and pretend that they are biological replicates. 
Bootstrapping won't replicate true biological variation between samples, but we 
will ignore that for the purpose of demonstrating how `propeller` works. Note 
that we don't need to simulate gene expression measurements; `propeller` only 
uses cluster information, hence we simply bootstrap the column indices of the 
single cell count matrices.

```{r}
i.10x <- seq_len(ncol(sce_10x))
i.celseq <- seq_len(ncol(sce_celseq))
i.dropseq <- seq_len(ncol(sce_dropseq))

set.seed(10)
boot.10x <- sample(i.10x, replace=TRUE)
boot.celseq <- sample(i.celseq, replace=TRUE)
boot.dropseq <- sample(i.dropseq, replace=TRUE)

sce_10x_rep2 <- sce_10x[,boot.10x]
sce_celseq_rep2 <- sce_celseq[,boot.celseq]
sce_dropseq_rep2 <- sce_dropseq[,boot.dropseq]
```

# Combine all SingleCellExperiment objects

The `SingleCellExperiment` objects don't combine very easily, so I will create a
new object manually, and retain only the information needed to run `propeller`.

```{r}
sample <- rep(c("S1","S2","S3","S4","S5","S6"), 
                c(ncol(sce_10x),ncol(sce_10x_rep2),ncol(sce_celseq),
                ncol(sce_celseq_rep2), 
                ncol(sce_dropseq),ncol(sce_dropseq_rep2)))
cluster <- c(sce_10x$cell_line,sce_10x_rep2$cell_line,sce_celseq$cell_line,
                sce_celseq_rep2$cell_line,sce_dropseq$cell_line,
                sce_dropseq_rep2$cell_line)
group <- rep(c("10x","celseq","dropseq"),
                c(2*ncol(sce_10x),2*ncol(sce_celseq),2*ncol(sce_dropseq)))

allcounts <- cbind(counts(sce_10x),counts(sce_10x_rep2), 
                    counts(sce_celseq), counts(sce_celseq_rep2),
                    counts(sce_dropseq), counts(sce_dropseq_rep2))

sce_all <- SingleCellExperiment(assays = list(counts = allcounts))
sce_all$sample <- sample
sce_all$group <- group
sce_all$cluster <- cluster
```

# Visualise the data

Here I am going to use the Bioconductor package `r BiocStyle::Biocpkg("scater")`
to visualise the data. The `r BiocStyle::Biocpkg("scater")` vignette goes 
quite deeply into quality 
control of the cells and the kinds of QC plots we like to look at. Here we will 
simply log-normalise the gene expression counts, perform dimensionality 
reduction (PCA) and generate PCA/TSNE/UMAP plots to visualise the relationships 
between the cells.

```{r}
sce_all <- scater::logNormCounts(sce_all)
sce_all <- scater::runPCA(sce_all)
sce_all <- scater::runUMAP(sce_all)
```

Plot PC1 vs PC2 colouring by cell line and technology:

```{r, fig.width=12, fig.height=6}
pca1 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "cluster") +
    ggtitle("Cell line")
pca2 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "group") +
    ggtitle("Technology")
pca1 + pca2
```

Plot UMAP highlighting cell line and technology:

```{r, fig.width=12, fig.height=6}
umap1 <- scater::plotReducedDim(sce_all, dimred = "UMAP", 
                                colour_by = "cluster") + 
    ggtitle("Cell line")
umap2 <- scater::plotReducedDim(sce_all, dimred = "UMAP", colour_by = "group") +
    ggtitle("Technology")
umap1 + umap2
```

For this dataset UMAP is a little bit of an overkill, the PCA plots show the 
relationships between the cells quite well. PC1 separates cells based on 
technology, and PC2 separates cells based on the cell line (clusters). From the 
PCA plots we can see that 10x is quite different to CelSeq and DropSeq, and the 
H2228 cell line is quite different to the remaining 2 cell lines.

# Test for differences in cell line proportions in the three technologies

In order to demonstrate `propeller` I will assume that the cell line information
corresponds to clusters and all the analysis steps have beeen performed. Here we
are interested in testing whether there are compositional differences between 
the three technologies: 10x, CelSeq and DropSeq. Since there are more than 2 
groups, `propeller` will perform an ANOVA to determine whether there is a 
significant shift in the cell type proportions between these three groups.

The `propeller` function can take a `SingleCellExperiment` object or `Seurat` 
object as input and extract the three necessary pieces of information from the 
cell information stored in `colData`. The three essential pieces of information 
are

* cluster
* sample
* group

If these arguments are not explicitly passed to the `propeller` function, then 
these are extracted from the `SingleCellExperiment` or `Seurat` object. Upper 
or lower case is acceptable, but 
the variables need to be named exactly as stated in the list above. For a 
`Seurat` object, the cluster information is extracted from `Idents(x)`.

The default of propeller is to perform the logit transformation:
```{r}
# Perform logit transformation
propeller(sce_all)
```

An alternative variance stabilising transformation is the arcsin square root
transformation. 

```{r}
# Perform arcsin square root transformation
propeller(sce_all, transform="asin")
```

The results from using the two different transforms are a little bit different, 
with the H1975 cell line being statistically significant using the arc sin 
square root transform, and not significant after using the logit transform.

Another option for running `propeller` is for the user to supply the cluster, 
sample and group information explicitly to the `propeller` function.

```{r}
propeller(clusters=sce_all$cluster, sample=sce_all$sample, group=sce_all$group)
```

The cell lines were mixed together in roughly equal proportions (~0.33) and 
hence we don't expect to see significant differences between the three 
clusters. However, because bootstrapping the samples doesn't incorporate 
enough variability between the samples to mimic true biological variability, 
we can see that the H1975 cluster looks significantly different between 
the three technologies. The proportion of this cell line is closer to 0.4 
for CelSeq and DropSeq, and 0.34 for the 10x data.

# Visualise the results

In the `r BiocStyle::Biocpkg("speckle")` package there is a plotting function 
`plotCellTypeProps` that takes a `Seurat` or `SingleCellExperiment` object, 
extracts sample and cluster information and outputs a barplot of cell type 
proportions between the samples. The user also has the option of supplying the
cluster and sample cell information instead of an R object. The output is a 
`ggplot2` object that the user can then manipulate however they please.

```{r,fig.height=4,fig.width=7}
plotCellTypeProps(sce_all)
```

Alternatively, we can obtain the cell type proportions and transformed 
proportions directly by running the `getTransformedProps` function which takes 
the cluster and sample information as input. The output from 
`getTransformedProps` is a list with the cell type counts, transformed 
proportions and proportions as elements.

```{r,fig.height=5,fig.width=7}
props <- getTransformedProps(sce_all$cluster, sce_all$sample, transform="logit")
barplot(props$Proportions, col = c("orange","purple","dark green"),legend=TRUE, 
        ylab="Proportions")
```

Call me old-school, but I still like looking at stripcharts to visualise results
and see whether the significant p-values make sense.

```{r,fig.height=4,fig.width=10}
par(mfrow=c(1,3))
for(i in seq(1,3,1)){
stripchart(props$Proportions[i,]~rep(c("10x","celseq","dropseq"),each=2),
            vertical=TRUE, pch=16, method="jitter",
            col = c("orange","purple","dark green"),cex=2, ylab="Proportions")
title(rownames(props$Proportions)[i])
}
```

If you are interested in seeing which models best fit the data in terms of the
cell type variances, there are two plotting functions that can do this: 
`plotCellTypeMeanVar` and `plotCellTypePropsMeanVar`. For this particular 
dataset it isn't very informative because there are only three cell "types" 
and no biogical variability.

```{r}
par(mfrow=c(1,1))
plotCellTypeMeanVar(props$Counts)
plotCellTypePropsMeanVar(props$Counts)
```


# Fitting linear models using the transformed proportions directly

If you are like me, you won't feel very comfortable with a black-box approach 
where one function simply spits out a table of results. If you would like to 
have more control over your linear model and include extra covariates then you 
can fit a linear model in a more direct way using the transformed proportions 
that can be obtained by running the `getTransformedProps` function.

We have already obtained the proportions and transformed proportions when we ran
the `getTransformedProps` function above. This function outputs a list object 
with three elements: `Counts`, `TransformedProps` and `Proportions`. These are 
all matrices with clusters/cell types in the rows and samples in the columns.

```{r}
names(props)

props$TransformedProps
```

First we need to set up our sample information in much the same way we would if 
we were analysing bulk RNA-seq data. We can pretend that we have pairing 
information which corresponds to our original vs bootstrapped samples to make 
our model a little more complicated for demonstration purposes. 

```{r}
group <- rep(c("10x","celseq","dropseq"),each=2)
pair <- rep(c(1,2),3)
data.frame(group,pair)
```

We can set up a design matrix taking into account group and pairing information.
Please note that the way that `propeller` has been designed is such that the 
group information is *always* first in the design matrix specification, and 
there is NO intercept. If you are new to design matrices and linear modelling, I
would highly recommend reading the `r BiocStyle::Biocpkg("limma")` manual, which
is incredibly extensive and covers a variety of different experimental set ups.

```{r}
design <- model.matrix(~ 0 + group + pair)
design
```

In our example, we have three groups, 10x, CelSeq and DropSeq. In order to 
perform an ANOVA to test for cell type composition differences between these
3 technologies, we can use the `propeller.anova` function. The `coef` argument
tells the function which columns of the design matrix correspond to the groups 
we are interested in testing. Here we are interested in the first three columns.

```{r}
propeller.anova(prop.list=props, design=design, coef = c(1,2,3), 
                robust=TRUE, trend=FALSE, sort=TRUE)
```

Note that the p-values are smaller here than before because we have specified
a pairing vector that states which samples were bootstrapped and which are the 
original samples.

If we were interested in testing only 10x versus DropSeq we could alternatively
use the `propeller.ttest` function and specify a contrast that tests for this
comparison with our design matrix.

```{r}
design
mycontr <- makeContrasts(group10x-groupdropseq, levels=design)
propeller.ttest(props, design, contrasts = mycontr, robust=TRUE, trend=FALSE, 
                sort=TRUE)
```

Finally note that the `robust` and `trend` parameters are parameters for the 
`eBayes` function in `r BiocStyle::Biocpkg("limma")`. When `robust = TRUE`, 
robust empirical Bayes shrinkage of the variances is performed which mitigates 
the effects of outlying observations. We set `trend = FALSE` as we don't expect 
a mean-variance trend after performing our variance stabilising transformation. 
There may also be an error when `trend` is set to TRUE because there are often 
not enough data points to estimate the trend.

# More complex (or just different) experimental designs

## Fitting a continuous variable rather than groups

Let us assume that we expect that the different technologies have a meaningful 
ordering to them, and we would like to find the cell types that are increasing 
or decreasing along this trend. In more complex scenarios beyond group 
comparisons I would recommend taking the transformed proportions from the 
`getTransformedProps` function and using the linear model fitting functions 
from the `r BiocStyle::Biocpkg("limma")` package directly.

Let us assume that the ordering of the technologies is 10x->celseq->dropseq. 
Then we can recode them 1, 2, 3 and treat the technologies as a
continuous variable. Obviously this scenario doesn't make much sense 
biologically, but we will continue for demonstration purposes.

```{r}
group
dose <- rep(c(1,2,3), each=2) 

des.dose <- model.matrix(~dose)
des.dose

fit <- lmFit(props$TransformedProps,des.dose)
fit <- eBayes(fit, robust=TRUE)
topTable(fit)
```

Here the log fold changes are reported on the transformed data, so they are 
not as easy to interpret directly. The positive logFC indicates that the cell
type proportions are increasing (for example for H1975), and a negative
logFC indicates that the proportions are decreasing across the ordered 
technologies 10x -> celseq -> dropseq.

You can get the estimates from the model on the proportions directly by fitting
the same model to the proportions. Here the `logFC` is the slope of the trend 
line on the proportions, and the `AveExpr` is the average of the proportions 
across all technologies.

```{r}
fit.prop <- lmFit(props$Proportions,des.dose)
fit.prop <- eBayes(fit.prop, robust=TRUE)
topTable(fit.prop)
```

You could plot the continuous variable `dose` against the proportions and add 
trend lines, for example.

```{r,fig.height=4,fig.width=10}
fit.prop$coefficients

par(mfrow=c(1,3))
for(i in seq(1,3,1)){
    plot(dose, props$Proportions[i,], main = rownames(props$Proportions)[i], 
        pch=16, cex=2, ylab="Proportions", cex.lab=1.5, cex.axis=1.5,
        cex.main=2)
    abline(a=fit.prop$coefficients[i,1], b=fit.prop$coefficients[i,2], col=4, 
            lwd=2)
}
```

What I recommend in this instance is using the p-values from the analysis on the
transformed data, and the reported statistics (i.e. the coefficients from the 
model) obtained from the analysis on the proportions for visualisation 
purposes. 

## Including random effects

If you have a random effect that you would like to account for in your
analysis, for example repeated measures on the same individual, then you 
can use the `duplicateCorrelation` function from 
the `r BiocStyle::Biocpkg("limma")`.

For illustration purposes, let us assume that `pair` indicates samples taken 
from the same individual (or they could represent technical replicates), and we 
would like to account for this in our analysis 
using a random effect. Again, we fit our models on the transformed proportions
in order to obtain the p-values.

We will formulate the design matrix with an intercept for this example, and test
the differences in technologies relative to 10x. The `block` parameter will be
the `pair` variable. Note that the design matrix now does not include `pair` as 
a fixed effect.

```{r}
des.tech <- model.matrix(~group)

dupcor <- duplicateCorrelation(props$TransformedProps, design=des.tech,
                                block=pair)
dupcor
```

The consensus correlation is quite high (`r dupcor$consensus.correlation`), 
which we expect because we bootstrapped these additional samples.

```{r}
# Fitting the linear model accounting for pair as a random effect
fit1 <- lmFit(props$TransformedProps, design=des.tech, block=pair, 
                correlation=dupcor$consensus)
fit1 <- eBayes(fit1)
summary(decideTests(fit1))

# Differences between celseq vs 10x
topTable(fit1,coef=2)

# Differences between dropseq vs 10x
topTable(fit1, coef=3)
```

For celseq vs 10x, H1975 and H2228 are significantly different, with a greater
proportion of H1975
cells detected in celseq, and fewer H2228 cells. For dropseq vs 10x, there is a 
higher proportion of H1975 cells.

If you want to do an ANOVA between the three groups:
```{r}
topTable(fit1, coef=2:3)
```

Generally, you can perform any analysis on the transformed proportions that you
would normally do when using limma (i.e. on roughly normally distributed data). 
For more complex random effects models with 2 or more random effects, you can 
use the ``r BiocStyle::Biocpkg("lme4")` package. You could also try the 
``r BiocStyle::Biocpkg("dream")` package.

# Using propeller on any proportions data 

The statistical methods used in propeller are generalisable to any proportions 
data. For example, we have successfully applied propeller to Nanostring GeoMx 
data, where cell type proportions were inferred using deconvolution methods.

An example of how to do this analysis is shown here for single cell data from
young and old PBMC single cell data, where only the cell type proportions data
is included as a resource in the following paper: \
Huang Z.  et al.  (2021) Effects of sex and aging on the immune cell landscape 
as assessed by single-cell transcriptomic analysis. Proc. Natl. Acad. Sci. USA, 
118, e2023216118.

The proportions data is available in the `r BiocStyle::Biocpkg("speckle")` 
package and can be accessed with the `data` function. It is a list object that 
contains three components: a data frame of cell type proportions, a data frame
of sample information including sex and age, and the total number of cells.

```{r}
data("pbmc_props")
pbmc.props <- pbmc_props$proportions
pbmc.sample.info <- pbmc_props$sample_info
tot.cells <- pbmc_props$total_cells
```

```{r, fig.width=10, fig.height=6}
par(mfrow=c(1,1))
barplot(as.matrix(pbmc.props),col=ggplotColors(nrow(pbmc.props)),
        ylab="Cell type proportions", xlab="Samples")
```

We can convert the proportions data into the list object that the `propeller` 
functions expect using the `convertDataToList` function. This function can take
either counts or proportions, specified with the `data.type` parameter.

```{r}
prop.list <- convertDataToList(pbmc.props,data.type="proportions", 
                               transform="logit", scale.fac=tot.cells/20)
```

The `scale.fac` parameter can be a vector or scalar. It is the number of cells 
sequenced per sample. Here we are taking a best guess because from the published
paper all we know is that in total `r tot.cells` were sequenced for the whole
dataset. As there are 20 samples, we divide the total number of cells by 20
to get a rough estimate of total number of cells per sample. This is used 
specifically when the logit transform is used, but not needed for the arcsin 
square root transformation. If `scale.fac` is a vector, it should be the same
length as the number of samples in the study.

We can now test young vs old, taking sex into account.

```{r}
designAS <- model.matrix(~0+pbmc.sample.info$age + pbmc.sample.info$sex)
colnames(designAS) <- c("old","young","MvsF")

# Young vs old
mycontr <- makeContrasts(young-old, levels=designAS)
propeller.ttest(prop.list = prop.list,design = designAS, contrasts = mycontr,
                robust=TRUE,trend=FALSE,sort=TRUE)
```

We see that the CD8 naive cells are enriched in the young samples, while CD16
cells are enriched in the old samples. We can visualise the significant cell 
types to check they make sense:

```{r}
group.immune <- paste(pbmc.sample.info$sex, pbmc.sample.info$age, sep=".")
par(mfrow=c(1,2))
stripchart(as.numeric(pbmc.props["CD8.Naive",])~group.immune,
           vertical=TRUE, pch=c(8,16), method="jitter",
           col = c(ggplotColors(20)[20],"hotpink",4, "darkblue"),cex=2,
           ylab="Proportions", cex.axis=1.25, cex.lab=1.5,
           group.names=c("F.Old","F.Young","M.Old","M.Young"))
title("CD8.Naive: Young Vs Old", cex.main=1.5, adj=0)

stripchart(as.numeric(pbmc.props["CD16",])~group.immune,
           vertical=TRUE, pch=c(8,16), method="jitter",
           col = c(ggplotColors(20)[20],"hotpink",4, "darkblue"),cex=2,
           ylab="Proportions", cex.axis=1.25, cex.lab=1.5,
           group.names=c("F.Old","F.Young","M.Old","M.Young"))
title("CD16: Young Vs Old", cex.main=1.5, adj=0)
```



# Tips for clustering

The experimental groups are likely to contribute large sources of variation in 
the data. In the `r BiocStyle::Biocpkg("CellBench")` data the technology effect 
is larger than the cell line effect. In order to cluster the data to produce 
meaningful cell types that will then feed into meaningful tests for 
proportions, the cell types should be represented in as many samples as 
possible. Users should consider using integration techniques on their 
single cell data prior to clustering, integrating on biological sample or 
perhaps experimental group. See methods such as Harmony, Liger and Seurat's 
integration technique for more information.

Cell type label assignment should not be too refined such that every sample has
many unique cell types. The `propeller` function can handle proportions of 0 and
1 without breaking, but it is not very meaningful if every cell type difference
is statistically significant. Consider testing cell type categories that are 
broader for more meaningful results, perhaps by combining clusters that are 
highly similar. The refined clusters can always be explored in terms of gene 
expression differences later on. One can also test for cell type proportion
differences within a broader cell type lineage using  `propeller`.

It may be appropriate to perform cell type assignment using classification 
methods rather than clustering. This allows 
the user to classify cells into known cell types, but you may run the risk of 
missing novel cell types.
A combination of approaches may be necessary depending on the dataset. 
Good luck. The more heterogenous the dataset, the more tricky this becomes.


# Session Info

```{r}
sessionInfo()
```