---
title: "Load Single-Cell Proteomics data using `readSCP`"
author:
  - name: Laurent Gatto
  - name: Christophe Vanderaa
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
bibliography: scp.bib
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('scp')`"
vignette: >
  %\VignetteIndexEntry{Load data using readSCP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
    
```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

# The `scp` data framework

Our data structure is relying on two curated data classes: `QFeatures`
(@Gatto2020-ry) and `SingleCellExperiment` (@Amezquita2019-bf).
`QFeatures` is dedicated to the manipulation and processing of 
MS-based quantitative data. It explicitly records the successive steps
to allow users to navigate up and down the different MS levels.
`SingleCellExperiment` is another class designed as an efficient data
container that serves as an interface to state-of-the-art methods and
algorithms for single-cell data. Our framework combines the two 
classes to inherit from their respective advantages. 

Because mass spectrometry (MS)-based single-cell proteomics (SCP) only
captures the proteome of between one and a few tens of single-cells in
a single run, the data is usually acquired across many MS batches.
Therefore, the data for each run should conceptually be stored in its
own container, that we here call an *assay*. The expected input for 
working with the `scp` package is quantification data of peptide to 
spectrum matches (PSM). These data can then be processed to reconstruct
peptide and protein data. The links between related features across 
different assays are stored to facilitate manipulation and 
visualization of of PSM, peptide and protein data. This is 
conceptually shown below. 

```{r scp_framework, results='markup', fig.cap="`scp` relies on `SingleCellExperiment` and `QFeatures` objects.", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("./figures/SCP_framework.png", error = FALSE)
```

There are two input tables required for starting an analysis with 
`scp`:

1. The feature data
2. The sample data

# Feature data

The feature data are generated after the identification and 
quantification of the MS spectra by a pre-processing software such as
MaxQuant, ProteomeDiscoverer or MSFragger (the 
[list](https://en.wikipedia.org/wiki/List_of_mass_spectrometry_software)
of available software is actually much longer). We will here use as an
example a data table that has been generated by MaxQuant. The table is
available from the `scp` package and is called `mqScpData` (for 
MaxQuant generated SCP data). 

```{r, message = FALSE}
library(scp)
data("mqScpData")
dim(mqScpData)
```

In this toy example, there are 1361 rows corresponding to features 
(quantified PSMs) and 149 columns corresponding to different data 
fields recorded by MaxQuant during the processing of the MS spectra. 
The columns can be divided into three categories: 

- Columns holding feature quantifications
- Columns holding feature metadata
- Columns holding MS run metadata

### Feature quantifications

The quantification data can be composed of one (in case of label-free
acquisition) up to 16 columns (in case of TMT-16 multiplexing). The 
columns holding the quantification start with `Reporter.intensity.`
followed by a number. 

```{r}
(quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
                  value = TRUE))
```

As you may notice, the example data was acquired using a TMT-16 
protocol since we retrieve 16 quantification columns. Actually, some 
runs were acquired using a TMT-11 protocol (11 labels) but we will 
come back to this later. 

```{r}
head(mqScpData[, quantCols])
```

### Feature metadata

Most columns in the `mqScpData` table contain information used or 
generated during the identification of the MS spectra. For instance, 
you may find the charge of the parent ion, the score and probability 
of a correct match between the MS spectrum and a peptide sequence, the
sequence of the best matching peptide, its length, its modifications,
the retention time of the peptide on the LC, the protein(s) the peptide
originates from and much more. 

```{r}
head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
                   "Retention.time", "Proteins")])
```

### MS run metadata

This type of metadata is related to the MS instrument. In MaxQuant, 
only the file name generated by the MS instrument is stored. There is one 
file for each MS run, hence the file name can be used as a batch 
identifier.

```{r}
unique(mqScpData$Raw.file)
```

# Sample data

Next to the quantification data and the feature data, sample metadata 
contains the experimental design generated by the researcher. The 
rows of sample metadata correspond to a sample in the experiment and 
the columns correspond to the available information about the sample.
We will here use the second example table:

```{r}
data("sampleAnnotation")
head(sampleAnnotation)
```

This table may contain any information about the samples. For example,
useful information could be the type of sample that is analysed, a 
phenotype known from the experimental design, the MS batch, the
acquisition date, MS settings used to acquire the sample, the LC 
batch, the sample preparation batch, etc... However, `scp` 
**requires** 2 specific fields in the sample metadata:

1. One column containing the MS run names (`Raw.file` in this case). 
   It must have the same name as the name of the column containing the 
   MS run names in the quantification table. This will allow `scp` to 
   correctly match and split data that were acquired separately. This 
   is illustrated below by the linked circles.
2. One column that tells `scp` the names of the columns in the feature
   data holds the quantification of the corresponding sample. This is
   illustrated below by the arrow.

```{r, fig.cap = "Linking the sample data to the feature data", echo = FALSE}
knitr::include_graphics("figures/readSCP_relation.png")
```

# `readSCP`

`readSCP` is the function that converts the sample and the feature 
data into a `QFeatures` object following the data structure described
above, that is storing the data belonging to each MS batch in a 
separate `SingleCellExperiment` object. We therefore provide the 
feature data, the sample data to the function as well as the name of 
the column that holds the batch name in both tables and the name of 
the column in the sample data that points to the quantification 
columns in the feature data. 

## Sample names and `suffix`

`readSCP` automatically assigns names that are unique across all 
samples in all assays. This is performed by appending the name of the 
batch where a given sample is found in with the name of the 
quantification column for that sample. Suppose a sample belongs to 
batch `190222S_LCA9_X_FP94BM` and the quantification values in the 
feature data are found in the column called `Reporter.intensity.3`, 
then the sample name will become `190222S_LCA9_X_FP94BMReporter.intensity.3`.
Optionally, to improve the readability of sample names, `readSCP` can
take a suffix instead of the quantification column name. For instance,
in the example below, we will provide a short suffix with the TMT 
index to remind that samples were multiplexed using TMT.

## Special case: empty samples

In some rare cases, it can be beneficial to remove empty samples (all
quantifications are `NA`) from the assays. Such samples can occur when
samples that were acquired with different multiplexing labels are 
merged in a single table. For instance, the SCoPE2 data we provide as
an example contains runs that were acquired with two TMT protocols. 
The 3 first assays were acquired using the TMT-11 protocol and the 
last assay was acquired using a TMT-16 protocol. When exporting the 
table, the authors combined the data in a single table, were missing 
channels in the TMT-11 data are filled with `NA`. This is essential 
when working in table format, but since `scp` keeps the runs separated 
we can allow for different numbers of channels per run.  When setting 
`removeEmptyCols = TRUE`, `readSCP` automatically detects and removes
columns that contain only `NA`s, 

## Running `readSCP`

We convert the sample and the feature data into a `QFeatures` object
in a single command thanks to `readSCP`. 

```{r readSCP}
scp <- readSCP(featureData = mqScpData,
               colData = sampleAnnotation,
               batchCol = "Raw.file",
               channelCol = "Channel",
               suffix = paste0("_TMT", 1:16),
               removeEmptyCols = TRUE)
```

As indicated by the output on the console, `readSCP` proceeds as follows:

1. If `featureData` is the path to a CSV file, it reads the file using
`read.csv`. The table is converted to a `SingleCellExperiment`
object. `readSCP` needs to know in which field(s) the quantitative
data is stored. Those field name(s) is/are provided by the
`channelCol` field in the `metaData` table. So in this example, the
column names holding the quantitative data in `mqScpData` are
stored in the column named `Channel` in `sampleAnnotation`.

2. The `SingleCellExperiment` object is then split according to
batch. The split is performed depending on the `batchCol` field in
`featureData`, in this case the data will be split according to the
`Raw.file` column in `mqScpData`. `Raw.file` contains the names of the 
acquisition runs that was used by MaxQuant to retrieve the raw data 
files.

3. The sample metadata is generated from the supplied `colData`. Note 
that in  order for `readSCP` to correctly match the feature data 
with the metadata, `colData` must contain the same `batchCol` field
with batch names.

4. Finally, the split feature data and the sample metadata are stored
in a single `QFeatures` object.

Here is a compact overview of the data:

```{r overview}
scp
```

We can see that the object returned by `readSCP` is a `QFeatures` 
object containing 4 `SingleCellExperiment` assays that have
been named after the 4 MS batches. Each assay contains either 11 or 16
columns (samples) depending on the TMT labelling strategy and a
variable number of rows (quantified PSMs). Each piece of information 
can easily be retrieved thanks to the `Qfeatures` architectures.
As mentioned in the previous vignette, sample data is retrieved from 
the `colData`. Note that unique sample names were automatically 
generated by combining the batch name and the suffix provided to 
`readSCP`:

```{r colData}
head(colData(scp))
```

Notice that the sample names were suffixed with TMT indexes.

The feature metadata is retrieved from the `rowData`. Since the 
feature metadata is specific to each assay, we need to tell from which 
assay we want to get the `rowData`:

```{r rowData}
head(rowData(scp[["190222S_LCA9_X_FP94BM"]]))[, 1:5]
```

Finally, we can also retrieve the quantification matrix for an assay
of interest:

```{r assay}
head(assay(scp, "190222S_LCA9_X_FP94BM"))
```

# What about label-free SCP?

The `scp` package is meant for both label-free and multiplexed SCP
data. Like in the example above, the label-free data should contain 
the batch names in both the feature data and the sample data. 
The sample data must also contain a column that points to the columns 
of the feature data that contains the quantifications. Since 
label-free SCP acquires one single-cell per run, this sample data 
column should point the same column for all samples. Moreover, this 
means that each PSM assay will contain a single column. 

# What about other input formats?

`readSCP` should work with any PSM quantification table that is output
by a pre-processing software. For instance, you can easily import the 
PSM tables generated by ProteomeDiscoverer. The batch names are 
contained in the `File ID` column (that should be supplied as the 
`batchCol` argument to `readSCP`). The quantification columns are 
contained in the columns starting with `Abundance `, eventually 
followed by a multiplexing tag name. These columns should be stored in
a dedicated column of the sample data to be supplied as `channelCol`
to `readSCP`. 

If your input cannot be loaded using the procedure described in this 
vignette, you can submit a feature request (see next section).

# Need help?

You can open an issue on the
[GitHub repository](https://github.com/UCLouvain-CBIO/scp/issues) in
case of troubles when loading your SCP data with `readSCP`. Any 
suggestion or feature request about the function or the documentation
are also warmly welcome. 

# Session information {-}

```{r setup2, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "",
    crop = NULL
)
```


```{r sessioninfo, echo=FALSE}
sessionInfo()
```

# Reference