---
title: "HiResTEC-workflow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{HiResTEC-workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`HiResTEC` aims to find and validate measurement signals in mass spectrometry data
that indicate the incorporation of a tracer molecule, i.e. $^{13}C$ which is a
common approach in metabolic flux analysis.

To this end, at least 3 measurement files containing non-labeled samples and 
3 measurement files containing labeled samples need to be available. This vignette 
will demonstrate the general workflow on a set of 18 samples from two groups and
three time points.

Let's start with loading the example data set provided with the package and
setting up a sample table describing these files.

```{r load_raw_data}
raw <- HiResTEC::raw
sam <- data.frame(
  "gr" = rep(rep(c("A","B"), each=3), 3), 
  "tp" = rep(0:2, each=6), 
  "rep" = rep(1:3,6)
)
sam[,"ids"] <- apply(sam, 1, paste0, collapse="_")
head(sam)
```

It contains 18 measurements stored in a list as `xcmsRawLike` objects. The samples
belong to two groups (annotated as **A** and **B** in column `gr`) and were taken 
at three different time points (annotated as **0**, **1** and **2** in column `tp`). 
We can plot the base peak chromatograms (BPCs) for these files to observe that they 
contain only a small retention time window (appr. 10s) and a limited mass range.

```{r show_raw_bpc, fig.dim=c(8,6), out.width="98%"}
length(raw)
class(raw[[1]])
raw[[1]]@mzrange
bpc <- lapply(raw, HiResTEC::getMultipleBPC)
HiResTEC::plotBPC(bpc, type = "bpc", mfrow = c(3,6), ids = sam[,"ids"])
```

There appears to be a peak at rt = 1026.5s in all files. Let's deconvolute the
ion intensities at this rt and plot the spectrum.

```{r show_spectrum}
rt <- 1026.5
s <- HiResTEC::DeconvoluteSpectrum(dat = raw, rt = rt)
InterpretMSSpectrum::PlotSpec(x = s)
```

The base peak is observed at m/z = 556.263. We can extract the mass isotopomer 
distribution (MID) of m/z 556.263 and its isotopes for all files. For better
visibility we omit plotting the intermediate time point **1** and group **B**.

```{r show_mid, fig.dim=c(8,6), out.width="98%"}
mz <- 556.263
mz_dev <- 0.4
bpc <- lapply(raw, function(x) { HiResTEC::getMultipleBPC(x = x, mz = mz + c(0:6)*1.003355, mz_dev = mz_dev) })
flt <- sam[,"tp"] %in% c(0, 2) & sam[,"gr"] == "A"
HiResTEC::plotBPC(bpc[flt], mfrow = c(2,3), ids = sam[flt,"ids"])
```

The figure shows that the MID changes systematically between replicate samples
of time points 0 and 2. To test hypotheses of m/z pairs indicating label 
incorporation `HiResTEC` needs a peak list. You can use any peak picking 
algorithm to get such a peak list. `HiResTEC` accepts an `xcmsSet` result as input 
or alternatively a numeric matrix containing `mz` and `rt` information in 
the first two columns followed by peak intensities of all samples.

We can prepare such a peak list for the example data using the maxima from the
BPCs we just extracted.

```{r get_peak_list}
int <- sapply(bpc, function(x) { x[attr(x, "maxBPC"),] })
colnames(int) <- sam[,"ids"]
xg <- data.frame("mz" = as.numeric(rownames(int)), "rt" = rep(rt,7), int, row.names = NULL)
xg[,1:8]
```

Finally, we can apply the two main functions of `HiResTEC`. First, we evaluate our 
peak list to identify interesting m/z pairs.

```{r EvaluatePairsFromXCMSSet, results="hide"}
preCL <- HiResTEC::EvaluatePairsFromXCMSSet(xg = xg, tp = sam$tp, gr = sam$gr, dmz = 0.04)
```
```{r EvaluatePairsFromXCMSSet_result}
head(preCL[order(preCL[,"P"]),3:7])
```


The function finds all relevant pairs within the specified parameters (allowed 
mass difference, allowed rt window, see help file of *EvaluatePairsFromXCMSSet()* 
for details) and tests each pair using an ANOVA model based on the group and 
time point information of each sample. 

Above, we sorted the output according to the P-value (column `P`). We could also 
have sorted according to column `dR` which gives the change in the intensity ratio 
between the first and the last time point specified. However, the lowest P-value 
is obtained for an m/z pair {559, 561} within the MID of our peak. The correct 
solution would be m/z pair {556, 561}.

The task to pick the best candidate, to apply rigorous quality control and avoid 
redundancy in the final result is achieved by the second step of the evaluation,
where we cross check the preliminary candidate list **preCL** against the raw data.

```{r EvaluateCandidateListAgainstRawData, results="hide"}
finCL <- HiResTEC::EvaluateCandidateListAgainstRawData(
  x = preCL, tp = sam$tp, gr = sam$gr, dat = raw, dmz = 0.04, rolp = "all"
)
```
*EvaluateCandidateListAgainstRawData()* will result in a l*oooo*ng list in a real
non-targeted experiment. However, as we only tested the MID of a single peak and 
decided to remove redundancy (parameter **rolp**), only a single candidate remains.

This candidate can be exported to Excel, and can be checked by generating
quality control plots in a PDF using function *GenerateQCPlots()*. Four QC plots
are generated per candidate. First, the enrichment in all samples is depicted.

```{r QCplot1, fig.dim=c(9,5), out.width="98%", echo=FALSE}
HiResTEC:::CandidateBoxplot(res = finCL[[1]])
```

The enrichment is estimated based on the assumption that **n**, the difference 
between mz1 and mz2 or the number of incorporated tracer atoms, is equivalent to
the total number of tracer atoms in the molecule. This assumption is generally
incorrect, but the obtained values are a good approximation for the amount of 
labeling. 

Time points are color coded and different plotting symbols are used for groups. 
Individual samples can be identified in the left subplot due to their number from 
the sample list. The annotation 'dE' in the left subplot provides the maximum 
difference in enrichment between samples. The P-values annotated in the right
subplot are obtained from the ANOVA result. In short, if 'dE' is large and 'P_tp'
is small, it is worth looking at the other QC plots.

The second QC plot provides the BPCs as shown before. Let's limit this figure 
this time to the first replicate of each time point and group.

```{r QCplot2, fig.dim=c(8,6), out.width="98%", echo=FALSE}
flt <- sam[,"rep"] == 1
bpc <- finCL[[1]][["bpc"]]
HiResTEC::plotBPC(bpc[flt], mfrow = c(3,2), ids = sam[flt,"ids"])
```

Numerous things can be checked in this plot. Obviously the peak shape of all masses
should be nice and co-located. 
***Note!*** that the intensities are depicted log10-transformed to enhance small 
signals. 
The spectra on the right hand subplots are not log10-transformed to emphasize 
differences between time points. Each spectrum depicts the intensities from the 
scan in the left subplot indicated by the grey line, usually the peak center. 
In the spectra, one should confirm that the ratio of mz1 (black) and mz2 (purple) 
is really changing. This sounds trivial, but is sometimes not the case if 
errors inpeak picking or deconvolution occurred.

Also, the small numbers at the top of each mass intensity are very informative.
The indicate the difference of the measured mass from the theoretical mass in mDa.
The theoretical mass of a M+5 isotope in a tracer experiment using $^{13}C$ would be
calculated by $M + 5 \times 1.0034$ (because 1.0034 is the difference between $^{12}C$ 
and $^{13}C$). In GC-APCI-MS (which this example is coming from), analytes are
derivatized before analysis, often using TMS groups (Tri-Methyl-Silyl). TMS 
contains silicon and silicon isotopes have a different mass difference than carbon.
This leads to the effect of negative mass deviations in the above spectra for
higher isotopes in non labeled samples, as here silicon determines the mass and
not carbon. In labeled samples (tp=1 and tp=2), carbon starts to determine the
mass deviation instead of silicon.

In consequence, observing a strong negative mass deviation in tp=0 samples for 
mz2 (purple) and a minor mass deviation in labeled samples is a strong indicator
for successful tracer incorporation.

The third and fourth QC plots provide spectra deconvoluted from raw data. These
should be checked to confirm if the candidate m/z pair is representative for
the compound (is it the base peak? is it the likely M+H?).

```{r QCplot3, fig.dim=c(8,6), out.width="98%", echo=FALSE}
x <- finCL[[1]]
mz1 <- x[["mz1"]]
mz2 <- x[["mz2"]]
tmp.txt <- data.frame("pos" = c(mz1, mz2), "val" = c("mz1", "mz2"))
ylim <- c(0, max(c(x[["s"]][abs(x[["s"]][, "mz"] - (mz1 + 4)) < 10, "int"], x[["s2"]][abs(x[["s2"]][, "mz"] - (mz1 + 4)) < 10, "int"]), na.rm = T))
graphics::layout(matrix(c(1, 2), ncol = 1))
InterpretMSSpectrum::PlotSpec(x[["s"]], rellab = x[["s"]][which.min(abs(x[["s"]][, 1] - mz1)), 1], masslab = 0, xlim = mz1 + c(-6, 15), txt = tmp.txt, ylim = ylim)
InterpretMSSpectrum::PlotSpec(x[["s2"]], rellab = x[["s2"]][which.min(abs(x[["s2"]][, 1] - mz2)), 1], masslab = 0, xlim = mz + c(-6, 15), txt = tmp.txt, ylim = ylim)
```

For those candidates looking promising, one can use the `InterpretMSSpectrum` 
package to identify likely sum formulas for each compound, and use the `CorMID` 
package to correct the intensities obtained by `HiResTEC` for natural abundance.

To demonstrate this shortly, `InterpretMSSpectrum` suggests C~22~H~46~N~5~O~4~Si~4~ 
as a possible sum formula.

```{r CorMID}
fml <- "C22H46N5O4Si4"
attr(fml, "nbio") <- 5

# re-extract the BPCs including [M+]
mz <- finCL[[1]]$mz1
rt <- finCL[[1]]$rt
bpc <- lapply(raw, function(x) { 
  HiResTEC::getMultipleBPC(x = x, mz = mz + c(-1:6)*1.003355, mz_dev = 0.04, rt = rt) 
})

# BPCs show about 2.5% [M+] intensity, define r accordingly
r <- setNames(c(0.975, 0.025), nm = c("M+H", "M+"))
int <- sapply(bpc, function(x) { x[attr(x, "maxBPC"),] })
rownames(int) <- paste0("M",-1:6)
colnames(int) <- sam$ids
mid <- apply(int, 2, function(x) { 
  CorMID::CorMID(int = x, fml = fml, r = r) 
})

# show mean group corrected MID
sapply(split(as.data.frame(t(mid)), interaction(sam[,"gr"], sam[,"tp"])), function(x) {
  round(apply(x,2,mean),1)
})
```

As can be seen from the mean corrected MID per group, the estimate of enrichment
shown in the QC plots above (70% and 40% at tp 2 for group A and B respectively) 
is pretty close to the results obtained by `CorMID`.

Have fun, using `HiResTEC` and, in the likely event that something does not work
as expected, let me know.