---
title: "NeuCA Package User's Guide"
author:
- name: Ziyi Li
  affiliation: Department of Biostatistics, The University of Texas MD Anderson Cancer Center
- name: Hao Feng
  affiliation: Department of Population and Quantitative Health Sciences, Case Western Reserve University
  email: hxf155@case.edu
package: NeuCA
output:
  BiocStyle::html_document
abstract: |
  NEUral-network based Cell Annotation, `NeuCA`, is a tool for cell type annotation using single-cell RNA-seq data. It is a supervised cell label assignment method that uses existing scRNA-seq data with known labels to train a neural network-based classifier, and then predict cell labels in single-cell RNA-seq data of interest.
vignette: |
  %\VignetteIndexEntry{NeuCA Package User's Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction
The fast advancing single cell RNA sequencing (scRNA-seq) technology enables transcriptome study in heterogeneous tissues at a single cell level. The initial important step of analyzing scRNA-seq data is to accurately annotate cell labels. We present a neural-network based cell annotation method `NeuCA`. 
When closely correlated cell types exist, `NeuCA` uses the cell type tree information through a hierarchical structure of neural networks to improve annotation accuracy. Feature selection is performed in hierarchical structure to further improve classification accuracy. When cell type correlations are not high, a feed-forward neural network is adopted.

`NeuCA` depends on the following packages:

* `r CRANpkg("keras")`, for neural-network interface in _R_,
* `r Biocpkg("limma")`, for linear model framework and testing markers,
* `r Biocpkg("SingleCellExperiment")`, for data organization formatting,
* `r CRANpkg("e1071")`, for probability and predictive functions.


# Preparing NeuCA input files: `SingleCellExperiment` class
The scRNA-seq data input for `NeuCA` must be objects of the _Bioconductor_ `r Biocpkg("SingleCellExperiment")`. You may need to read corresponding vignettes on how to create a SingleCellExperiment from your own data. An example is provided here to show how to do that, but please note this is not a comprehensive guidance for `r Biocpkg("SingleCellExperiment")`.


**Step 1**: Load in example scRNA-seq data.

We are using two example datasets here: `Baron_scRNA` and `Seg_scRNA`. `Baron_scRNA` is a droplet(inDrop)-based, single-cell RNA-seq data generated from pancrease ([Baron et al.](https://doi.org/10.1016/j.cels.2016.08.011)). Around 10,000 human and 2,000 mouse pancreatic cells from four cadaveric donors and two strains of mice were sequenced. `Seg_scRNA` is a Smart-Seq2 based, single-cell RNA-seq dataset ([Segerstolpe et al.](https://doi.org/10.1016/j.cmet.2016.08.020)). It has thousands of human islet cells from healthy and type-2 diabetic donors. A total of 3,386 cells were collected, with around 350 cells from each donor. Here, subsets of these two datasets (with cell type labels for each cell) were included as examples.
```{r, eval = TRUE, message = FALSE}
library(NeuCA)
data("Baron_scRNA")
data("Seg_scRNA")
```

**Step 2a**: Prepare training data as a SingleCellExperiment object. 
```{r, eval = TRUE, message = FALSE}
Baron_anno = data.frame(Baron_true_cell_label, row.names = colnames(Baron_counts))
Baron_sce = SingleCellExperiment(
    assays = list(normcounts = as.matrix(Baron_counts)),
    colData = Baron_anno
    )
# use gene names as feature symbols
rowData(Baron_sce)$feature_symbol <- rownames(Baron_sce)
# remove features with duplicated names
Baron_sce <- Baron_sce[!duplicated(rownames(Baron_sce)), ]
```


**Step 2b**: Similarly, prepare testing data as a SingleCellExperiment object. Note the true cell type labels are not necessary (and of course often not available). 
```{r, eval = TRUE, message = FALSE}
Seg_anno = data.frame(Seg_true_cell_label, row.names = colnames(Seg_counts))
Seg_sce <- SingleCellExperiment(
    assays = list(normcounts = as.matrix(Seg_counts)),
    colData = Seg_anno
)
# use gene names as feature symbols
rowData(Seg_sce)$feature_symbol <- rownames(Seg_sce)
# remove features with duplicated names
Seg_sce <- Seg_sce[!duplicated(rownames(Seg_sce)), ]
```

# NeuCA training and prediction
**Step 3**: with both training and testing data as objects in `SingleCellExperiment` class, now we can train the classifier in `NeuCA` and predict testing dataset's cell types. This process can be achieved with one line of code: 

```{r, eval = TRUE, message = FALSE}
predicted.label = NeuCA(train = Baron_sce, test = Seg_sce, 
                        model.size = "big", verbose = FALSE)
#Baron_scRNA is used as the training scRNA-seq dataset
#Seg_scRNA is used as the testing scRNA-seq dataset
```

`NeuCA` can detect whether highly-correlated cell types exist in the training dataset, and automatically determine if a general neural-network model will be adopted or a marker-guided hierarchical neural-network will be adopted for classification. 

[**Tuning parameter**] In neural-network, the numbers of layers and nodes are tunable parameters. Users have the option to determine the complexity of the neural-network used in `NeuCA` by specifying the desired `model.size` argument. Here, "big", "medium" and "small" are 3 possible choices, reflecting large, medium and small number of nodes and layers in neural-network, respectively. The model size details are shown in the following Table 1. From our experience, "big" or "medium" can often produce 
high accuracy predictions.

```{r, eval = TRUE, message = FALSE, echo=FALSE}
library(knitr)
library(kableExtra)
df <- data.frame(Cat = c("Small", "Medium", "Big"), 
                 Layers = c("3", "4", "5"), 
                 Nodes = c("64", "64,128", "64,128,256"))
kable(df, col.names = c("", "Number of layers", "Number of nodes in hidden layers"), 
      escape = FALSE, caption = "Tuning model sizes in the neural-network classifier training.") %>%
  kable_styling(latex_options = "striped")
```


# Predicted cell types
`predicted.label` is a vector of the same length with the number of cells in the testing dataset, containing all cell's predicted cell type. It can be viewed directly: 

```{r, eval = TRUE}
head(predicted.label)
table(predicted.label)
```
[**Optional**] If you have the true cell type labels for the testing dataset, you may evaluate the predictive performance by a confusion matrix: 
```{r, eval = TRUE}
table(predicted.label, Seg_true_cell_label)
```
You may also draw a Sankey diagram to visualize the prediction accuracy: 

```{r, eval = TRUE, message = FALSE, echo=F}
library(networkD3)
source = rep(NA, choose(length(unique(Seg_true_cell_label)),2)+length(unique(Seg_true_cell_label)))
target = rep(NA, choose(length(unique(Seg_true_cell_label)),2)+length(unique(Seg_true_cell_label)))
value = rep(NA, choose(length(unique(Seg_true_cell_label)),2)+length(unique(Seg_true_cell_label)))
links = data.frame(source, target, value)
cfm = table(predicted.label, Seg_true_cell_label)
id = 1
for(i in 1:ncol(cfm)){
  for(j in i:nrow(cfm)){
    links[id,1] = paste0(colnames(cfm)[i], "_true")
   links[id,2] = paste0(rownames(cfm)[j], "_pred")
   links[id,3] = cfm[j,i]
    id = id + 1
  }
}

nodes <- data.frame(
  name=c(paste0(colnames(cfm), "_true"), paste0(rownames(cfm), "_pred"))
)

links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

p <- sankeyNetwork(Links = links, Nodes = nodes,
              Source = "IDsource", Target = "IDtarget",
              Value = "value", NodeID = "name", 
              sinksRight=FALSE)
p
```


# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```