---
title: "Using MSGFplus"
author: "Thomas Lin Pedersen"
date: "August 1st 2014"
output: 
    rmarkdown::html_vignette:
        css: styles.css
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Using MSGFgui}
  %\usepackage[utf8]{inputenc}
---

* * *

This document describe how to use the MSGFplus package to perform MS-GF+ 
analyses on liquid-chromatography tandem mass-spectroscopy (LC-MS/MS) data. It 
will walk you through creating a parameter set and initialising the MS-GF+ 
analysis in different ways.

The different parameters needed to run MS-GF+ will not be discussed here as they
are well documented on the 
[MS-GF+ webpage](http://proteomics.ucsd.edu/software-tools/ms-gf/).

```{r, eval=TRUE, echo=FALSE, results='hide', message=FALSE}
require(MSGFplus)
```

## Introduction

One of the most ubiquitous steps of modern proteomics is the detection of 
peptides from MS/MS fragmentation data. Usually the results of such an analysis
will be used to infer the presence and possibly the quantity of proteins in your
starting material. Automatic identification of peptides from LC-MS/MS 
experiments was made possible with the SEQUEST algorithm from 1994, but the 
process has constantly been refined and improved. Currently there exists a range
of different algorithms for performing the identification task, all with 
strengths and weaknesses, and MS-GF+ is an increasingly popular one of them.
This package makes it possible to run MS-GF+ directly from within R, but is not
a reimplementation of the algorithm. Underneath it all it is still the same java
code performing the analysis.

Related in functionality to this package is 
[rTANDEM](http://www.bioconductor.org/packages/release/bioc/html/rTANDEM.html),
which provides an R interface for the X! Tandem algorithm, in much the same way 
as this package does for MS-GF+. As stated above each algorithm can have an 
upper hand on certain kind of data and the consensus is increasingly to use 
multiple algorithms on your data and combine the results. Having multiple 
algorithms available from R is only making this easier. Other packages in the 
Bioconductor project are concerned with other aspects of the proteomic workflow 
and interested readers can get an overview in the 
[proteomics package directory](http://www.bioconductor.org/packages/release/BiocViews.html#___Proteomics).

### Data in the package
As LC-MS/MS data tend to be very big, example files are not included in the 
package. If you do not have access to any LC-MS/MS data but wish to experiment
with the package, raw data can be obtained from different data repositories 
(e.g. [PeptideAtlas](http://www.peptideatlas.org/repository/)). The only 
included data is a very short fasta file (milk-proteins.fasta), that contains 
the 3 caseins and 2 whey proteins that make up the bulk of proteins in milk.

## Creating a parameter set

The cornerstone of the MSGFplus package is the msgfPar class. Instances of this
class can be created in a number of ways, and after creation they can be 
modified in many ways.

### Building a parameter set iteratively

To create an empty msgfPar object use the eponymous creator function:

```{r, eval=TRUE, echo=TRUE}
par <- msgfPar()
show(par)
```

In general parameters that are not set will use the default value. As a minimum
it is needed to specify the fasta file used as a database during peptide search

```{r, eval=TRUE, echo=TRUE}
databaseFile <- system.file('extdata', 'milk-proteins.fasta', package='MSGFplus')
db(par) <- databaseFile
```

Usually the default parameters aren't a good match (at least not all of them) 
and blindly running an analysis with the default values are not recommended. All
the different parameters can be accessed and modified using relevant setter and
getter methods. In the following they are all illustrated:

```{r, eval=TRUE, echo=TRUE}
tolerance(par) <- '20 ppm'      # Set parent ion tolerance
chargeRange(par) <- c(2, 6)     # Set the range of charge states to look after
lengthRange(par) <- c(6, 25)    # Set the range of peptide length to look after
instrument(par) <- 'QExactive'  # Set the instrument used for acquisition
enzyme(par) <- 'Trypsin'        # Set the enzyme used for digestion
fragmentation(par) <- 0         # Set the fragmentation method
protocol(par) <- 0              # Set the protocol type
isotopeError(par) <- c(0,2)     # Set the isotope error
matches(par) <- 2               # Set the number of matches to report per scan
ntt(par) <- 1                   # Set number of tolerable termini
tda(par) <- TRUE                # Use target decoy approach

par
```

As can be seen the show method of the object gives a clear overview over the
parameters that have been set.

The last parameter that can be added are modifications to expect in your data. 
The syntax for this is a bit different as the nature of the parameter is more 
complex:

```{r, eval=TRUE, echo=TRUE}
mods(par)[[1]] <- msgfParModification(name = 'Carbamidomethyl', 
                                      composition = 'C2H3N1O1', 
                                      residues = 'C', 
                                      type = 'opt', 
                                      position = 'any')
mods(par)[[2]] <- msgfParModification(name = 'Oxidation', 
                                      mass = 15.994915, 
                                      residues = 'M', 
                                      type = 'opt', 
                                      position = 'any')
nMod(par) <- 2                  # Set max number of modifications per peptide

par
```

### Other ways to define parameters

Apart from building up the parameters iteratively, they can also be specified at
creation time.

```{r, eval=TRUE, echo=TRUE}
par <- msgfPar(database = databaseFile, 
               tolerance = '20 ppm', 
               tda=TRUE,
               instrument='QExactive')

par
```

A third method is to read parameter data from a result file generated by MS-GF+.
This makes it easy to quickly replicate the parameter used for a certain search
in order to compare results.

```{r, eval=FALSE, echo=TRUE, results='hide'}
par <- msgfParFromID('/path/to/results/file.mzid')

par
```

The last method of setting up a parameter set is with a simple gui that is 
pretty much self-explanatory:

```{r, eval=FALSE}
require(gWidgets)

par <- msgfParGUI()
```

## Running MS-GF+

When a parameter set has been defined to your likening you can start an MS-GF+
analysis in two ways, depending on your aim.

### Running MS-GF+ in batch mode

To start an MS-GF+ run for one or several raw data files the method 
`runMSGF()` is used:

```{r, eval=FALSE, echo=TRUE}
res <- runMSGF(par, 'your_rawfile.mzML')
```

If a vector of file paths are provided, these will be run in succession. By 
default result files are written besides the original rawfiles with an mzid
extension instead of their original extension (silently overriding exiting
files). Alternatively a vector of the same length as the number of raw files can
be provided to use as save names.

The results are reimported into R as either an mzID or mzIDCollection object,
depending on the number of raw files. If importing is not desired it can be 
avoided with `import=FALSE`.

### Running MS-GF+ asynchronously

If `async=TRUE` is set in `runMSGF()` an asynchronous run of MS-GF+ is started.
In contrast to running in batch mode, only one file at a time can be run
asynchronously and as such it is mostly useful for embedding in code, where you
want to use the time when MS-GF+ is running to perform some other tasks in R.
When running asynchronously `runMSGF()` returns an object of class msgfAsync.
This object is used to check whether MS-GF+ has finished analysing your code, 
and in the case that it has finished, import the results.

```{r, eval=FALSE, echo=TRUE}
msgf <- runMSGF(par, 'your_rawfile.mzML', async=TRUE)

while(running(msgf)) {
    Sys.sleep(1)             # You could arguably do more meaningfull stuff here
}
if(finished(msgf)) {
    res <- import(msgf)
}
```

### Running MS-GF+ from a GUI
GUI's are so nice that it got it's own package. Check out 
[MSGFgui](https://github.com/thomasp85/MSGFgui) (also part of Bioconductor) if 
that is more to your taste. Besides running analysis it got a whole range of 
ways to investigate your results.

## Session
```{r, eval=TRUE, echo=TRUE}
sessionInfo()
```