---
title: "Proteomics Data in Annotation Hub"
output:
  BiocStyle::html_document:
     toc: true
     toc_depth: 1
vignette: >
  % \VignetteIndexEntry{Proteomics Data in Annotation Hub}
  % \VignetteEngine{knitr::rmarkdown}
  % \VignetteKeyword{proteomics, mass spectrometry, data}
  % \VignettePackage{ProteomicsAnnotationHubData}
---


```{r style, echo = FALSE, results = 'asis', message=FALSE}
BiocStyle::markdown()
```

```{r env, echo=FALSE, message=FALSE}
suppressPackageStartupMessages(library("AnnotationHub"))
suppressPackageStartupMessages(library("ProteomicsAnnotationHubData"))
suppressPackageStartupMessages(library("mzR"))
```

**Package**: `r Biocpkg("ProteomicsAnnotationHubData")`<br />
**Authors**: `r packageDescription("ProteomicsAnnotationHubData")[["Author"]]`<br />
**Modified:** `r file.info("ProteomicsAnnotationHubData.Rmd")$mtime`<br />
**Compiled**: `r date()`

# Introduction

About `r Biocpkg("AnnotationHub")`:

> This package provides a client for the Bioconductor AnnotationHub
> web resource. The AnnotationHub web resource provides a central
> location where genomic files (e.g., VCF, bed, wig) and other
> resources from standard locations (e.g., UCSC, Ensembl) can be
> discovered. The resource includes metadata about each resource,
> e.g., a textual description, tags, and date of modification. The
> client creates and manages a local cache of files retrieved by the
> user, helping with quick and reproducible access.

The goal of `r Biocpkg("ProteomicsAnnotationHubData")` is to expand
this functionality to mass spectrometry and proteomics data.


See the `AnnotationHub`'s
[How-To](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html)
and
[Access the AnnotationHub Web Service](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html)
vignettes for a description on how to use it.

# Accessing proteomics data 

```{r ahinit}
library("AnnotationHub")
ah <- AnnotationHub()
ah
```

We can extract the entries that originate from the PRIDE database:

```{r provider}
query(ah, "PRIDE")
```

Or those of a specific project

```{r title}
query(ah, "PXD000001")
```

To see the metadata of a specific entry, we use its AnnotationHub
entry number inside single `[`

```{r ah49008}
ah["AH49008"]
```

To access the actual data, raw mass spectrometry data in this case, we
double the `[[`

```{r rawmsdata}
library("mzR")
rw <- ah[["AH49008"]]
rw
```

In this case, we have an instance of class `r as.character(class(rw))`,
that can be processed as anticipated

```{r msdataplot}
plot(peaks(rw, 1), type = "h", xlab = "M/Z", ylab = "Intensity")
```

In the short demonstration above, we had **direct** and
**standardised** access to the raw data, without a need to manually
open this raw data or worry about the file format. The data was
prepared and converted into a **standard Bioconductor data types** for
immediate consumption by the user. This is also valid for other
relevant data types such as identification results, fasta files or
protein of peptide quantitation data.

# Available datasets

To list all available proteomics datasets, one can query
`AnnotationHub`, as described above, or using the following variable
defined in the `ProteomicsAnnotationHubData` package:

```{r availablepahd}
library("ProteomicsAnnotationHubData")
availableProteomicsAnnotationHubData
```

## PXD000001

Description

> Four human TMT spliked-in proteins in an Erwinia carotovora
> background. Expected reporter ion ratios: Erwinia peptides:
> 1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST):
> 10:5:2.5:1:2.5:10; BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1;
> PhosB spike (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
> (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.


Four data files from the PRIDE
[PXD000001](http://www.ebi.ac.uk/pride/archive/projects/PXD000001)
experiment are served through `AnnotationHub`.

1. The raw mass spectrometry data from the
   `TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML`
   file from the PRDIE ftp site, served as an `mzRpwiz` object, from
   the `r Biocpkg("mzR")` package.

2. The peptide-level quantitation data from the
   `F063721.dat-mztab.txt` file from the PRIDE ftp site, served as an
   `MSnSet` object, from the `r Biocpkg("MSnbase")` package.

3. The protein data base, via the `erwinia_carotovora.fasta` file from
	the PRIDE ftp server, served as a `AAStringSet` object, from the
	`r Biocpkg("Biostrings")` package.

4. The identification results, produced using the `MSGF+` search
   engine, served as a `mzRident` object, from the `r Biocpkg("mzR")`
   package.


# Adding new datasets

To suggest updates and/or new mass spectrometry and/or proteomics
data, please post your suggestions/request on the
[Bioconductor support site](https://support.bioconductor.org/) or open
a
[github issue](https://github.com/lgatto/ProteomicsAnnotationHubData/issues).
Contributions can also be made using
[github pull requests](https://github.com/lgatto/ProteomicsAnnotationHubData).

## Input files

Starting with `r Biocpkg("ProteomicsAnnotationHubData")` version
`1.1.2`, preparing data for submission can be done by writing simple
metadata files in Debian Control File (DCF) format. DCF is a simple
format for storing key:value pairs in plain text files that can easily
be directly read and written by humans. For example, package
DESCRIPTION files follow the DCF format. See the details section in
[`?read.dcf`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/dcf.html)
for details about the format.

Each DCF file can document one or more data files and, as opposed to
the default R specification, comment lines starting with a `#` are
supported (inline comments are not supported).  The fields that
**must** be documented in these ProteomicsAnnotationhubData (PAHD)
files are detailed in the next section. 

An example, taken from `r dir(system.file("extdata", package =
"ProteomicsAnnotationHubData"), full.names = TRUE, pattern =
"PXD000001.dcf")` is illustrated below:

```{r, echo=FALSE}
cat(readLines(dir(system.file("extdata", package = "ProteomicsAnnotationHubData"), full.names = TRUE, pattern = "PXD000001.dcf"))[10:30], sep = "\n")
```

The `writePahdTemplate` function prepares a PAHD DCF template.

```{r}
writePahdTemplate()
```

## Required data and metadata

This section describes how `ProteomicsAnnotationHubData` metadata
ojects are described and generated. See also the 
`r Biocpkg("AnnotationHub")` package for additional documentation. Below
is an excerpt of `PXD000001.dcf`

```{r PXD000001.dcf, echo=FALSE}
f <- list.files(system.file("extdata", package = "ProteomicsAnnotationHubData"),
                full.names = TRUE, pattern = "PXD000001.dcf")
cat(readLines(f)[10:30], sep = "\n")
```

### Title {-}

The title of a file should always be prefixed with its experiment
identifier, such as

### Description {-}

A short description of the experiment, generally a couple of lines.

### Source types {-}

These 3 field document the type/format of the original file and the R
data class the file will be converted to. 

|-------------------|---------|---------|-----------|--------------|--------|
| **SourceType**    |  mzML   |   mzTab |   mzid    |      FASTA   | MSnSet |
| **DispatchClass** | mzRpwiz |  MSnSet |  mzRident |  AAStringSet | MSnSet |
| **RDataClass**    | mzRpwiz |  MSnSet |  mzRident |  AAStringSet | MSnSet |

### Recipe {-}

The function that converts the data into its R data class. See below
for further details.

### RDataPath {-}

The path to the R data file (see the scenarios below for more details).

### Location_prefix {-}

The path to the location of the file. Use `S3` if the file will be
stored on the Amazon S3 instance or `PRIDE` if the file is to be
retrieved from the PRIDE resource.

### SourceUrl {-}

The URL of the original source file. Use `S3` if the file will be
stored on the Amazon S3 instance or `PRIDE` if the file is to be
retrieved from the PRIDE resource.

### Species {-}

Scientific species name.

### TaxonomyId {-}

The NCBI taxonomy identifier. Can be found by searching the species
name in http://www.ncbi.nlm.nih.gov/taxonomy.

### File {-}

The name of the source file.

### DataProvider  {-}

The original provider of the data.  A list of predefined/tested
providers.

```{r providers, echo=FALSE}
tab <- do.call(rbind, ProteomicsAnnotationHubData:::ProteomicsAnnotationHubDataProviders)
knitr::kable(tab, row.names = FALSE)
```

### Maintainer {-}

Resource maintainer name and email address.

### Tags {-}

Frer from tags. A list of suggested tags is shown below. These
suggestions will be updated and completed over time.

```{r tags, echo=FALSE}
ProteomicsAnnotationHubData:::ProteomicsAnnotationHubDataTags
```

## Data location and associated metadata

### Overview {-}

The data accessed through the `r Biocpkg("AnnotationHub")`
infrastructure exists, in different forms, in different
locations. These locations can be the user's computer, the
AnnotationHub Amazon S3 instance and the original data
provider. Multiple scenarios are can occur:

1. The data originates from the provider's public repository. It is
   directly served to the user, from that third-party server, with
   possible local processing/coercion and made accessible as a
   Bioconductor data object.

2. The data originates from the provider's public repository. However,
   conversion to a Bioconductor data object is time-consuming or it is
   anticipated that this would be repeated many times. The data is
   therefor copied, processed and stored on the AnnotationHub Amazon
   S3 instance and server from there upon request.

3. The original file is not available from a data provider, and is
   stored on the AnnotationHub Amazon S3 instance and, possibly
   pre-processed. Upon request, it is served to the user.

### Definitions {-}

- The `Recipe` is a short function, typically named
  `NameOfDataOrigformatToFinalformat`, that generally converts the
  original data into on compatible with R/Bioconductor or enable to
  read the data directly using a special data accessor.

    For example, for some `fasta` files, the recipe function uses the
    `Rsamtools::indexFa` function to create an index file without
    converting the original file. Similarly, raw mass-spectrometry
    files are not converted into objects *per se*, but an accessor
    object is produced to extract data directly from the data file.

- `Location_Prefix` is either `S3`, when the file to be loaded/read by
  the user exists on the AH Amazon S3 instance, or `PRIDE` when it
  lives on the PRIDE ftp server. (These will be replaced by
  `.amazonBaseUrl` and `.prideBaseUrl` respectively during data
  preparation.)

- `SourceUrl` is the full location of the original file. This is
  generally the third-party server, but not necessarily.

- `RDataPath` is the path and filename of the file to be read into R
  and provided to the user. This field does not contain the server
  address (`.prideBaseUrl`/`PRIDE` or `.amazonBaseUrl`/`S3`, see
  `Location_Prefix`).

- The metadata list, used to create the `AnnotationHubResources` also
  uses a `SourceBaseUrl`, which is the full url minus file name (that
  is in `File`) of the original file. Used to construct `SourceUrl`.

### Examples {-}

Refering back to the scenarios described above

#### Scenario 1 {-}

Files that are downloaded from the third-party resource, in our case
PRIDE, and loaded directly into R without any pre-processing:

- the `Recipe` argument **must be** `NA`. Leave empty in the DCF file.
- the `Location_Prefix` should be `PRIDE` (`.prideBaseUrl`).
- the `RDdataPath` should be `sub(.prideBaseUrl, "", SourceUrl)`
- the `SourceUrl` should be the actual full url on third-party server.

If the data is pre-processed, a `Recipe` must be provided.

An example from the `PXD000001` data set is the raw `mzML` file, which
is directly downloaded from the PRIDE server and read into R as an
`mzRpwiz` object:

```
SourceType: mzML
RDataClass: mzRpwiz
Recipe: NA
Location_Prefix: ftp://ftp.pride.ebi.ac.uk/
RDataPath: pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
SourceUrl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
```

#### Scenario 2 {-}

Files that need to be downloaded from a third-party provider such as
the PRIDE server, pre-processed and the pre-processed product is
stored on AnnotationHub Amazon s3 machine. The user directly gets the
object from Amazon S3 instance:

- the `Recipe` argument **should not be** `NA`.
- the `Location_Prefix` **should be** the `.amazonBaseUrl`.
- the `RDataPath` should correspond to the directory structure after
  `.amazonBaseUrl` on the Amazon s3 instance. Typically, the directory
  structure on the Amazon S3 instance mimics the directory structure
  on the original server.
- the `SourceUrl` should be the actual url on third-party server.

An example from the `PXD000001` data set is the `fasta` file. It
originates from the
[PRIDE ftp server](ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/),
but is processed into and `AAStringSet` and stored/server on the
AnnotationHub Amazon S3 instance.

```
SourceType: FASTA
RDataClass: AAStringSet
Recipe: ProteomicsAnnotationHubData:::PXD000001FastaToAAStringSet
Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath: pride/data/archive/2012/03/PXD000001/erwinia_carotovora.rda
SourceUrl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/erwinia_carotovora.fasta
```

Another example is the `mzTab` file with peptide-level quantitation
data, that is served from the Amazon instance as an `MSnSet` object.

```
SourceType: mzTab
RDataClass: MSnSet
Recipe: ProteomicsAnnotationHubData:::PXD000001MzTabToMSnSet
Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath: pride/data/archive/2012/03/PXD000001/F063721.dat-MSnSet.rda
SourceUrl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/F063721.dat-mztab.txt
```

#### Scenario 3 {-}

The original data file and the Bioconductor data object are stored on
the AnnotationHub Amazon S3 instance and directly served to the user
upon request.

An example from the `PXD000001` data set is the `mzid` file, which is
not available from the PRIDE ftp server (only a Macot `dat` file is
provided).

```
SourceType: mzid
RDataClass: mzRident
Recipe: NA
Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath: pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
SourceUrl: http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
```

## Data preparation script

The fully completed DCF files are added to `r
Biocpkg("ProteomicsAnnotationHubData")`'s `extdata` directory and
named accordin to the dataset's identifier using the `dcf` extension.

Once the above metadata is prepared in one or multiple DCF files,
these can be read into R with `PAHD`. Data preparation scripts are
added to If you have new types of data, please contact `r
Biocpkg("ProteomicsAnnotationHubData")`'s maintainer.  `r
Biocpkg("ProteomicsAnnotationHubData")`'s `scripts` directory. Below
are the first four lines of `PXD000001.R`:


```{r PXD000001.R, echo=FALSE}
f <- list.files(system.file("scripts", package = "ProteomicsAnnotationHubData"),
                full.names = TRUE, pattern = "PXD000001.R")
cat(readLines(f)[1:4], sep = "\n")
```

The rest of the preparation script calls various functions from the
`r Biocpkg("AnnotationHubData")` package to create valid
`AnnotationHubMetadata` instances. At the end, it is important to
serialise the metadata objects in the `extdata` directory, as these
will be used in the unit tests described below.

## Preparer functions {-}

Preparer functions and recipes are only required if the `rda` file is
prepared on the `AnnotationHub` Amazon S3 instance.

Below are the relevant functions for `mzRpwiz`, `mzRIdent`, `MSnSet`
and `AAStringSet` resources. These are defined in `r
Biocpkg("AnnotationHub")` `/R/AnnotationHubProteomicsResource-class.R`
file. 

```{r, eval=FALSE}
setClass("mzRpwizResource", contains="AnnotationHubResource")
setMethod(".get1", "mzRpwizResource",
    function(x, ...) 
{
    .require("mzR")
    yy <- cache(.hub(x))
    mzR::openMSfile(yy, backend = "pwiz")
})
```

```{r, eval=FALSE}
setClass("mzRidentResource", contains="AnnotationHubResource")
setMethod(".get1", "mzRidentResource",
    function(x, ...) 
{
    .require("mzR")
    yy <- cache(.hub(x))
    mzR::openIDfile(yy)
})
```
```{r, eval=FALSE}
setClass("MSnSetResource", contains="RdaResource")
setMethod(".get1", "MSnSetResource",
    function(x, ...) 
{
    .require("MSnbase")
    callNextMethod(x, ...) 
})
```
```{r, eval=FALSE}
setClass("AAStringSetResource", contains="AnnotationHubResource")
setMethod(".get1", "AAStringSetResource",
     function(x, ...) 
{
    .require("Biostrings")
    yy <- cache(.hub(x))
    Biostrings::readAAStringSet(yy)
})
```

If you have new types of data, please contact `r
Biocpkg("ProteomicsAnnotationHubData")`'s maintainer.

## Testing

<!-- ### Generic tests {-} -->

<!-- **TODO**, when generating the AH data -->

<!-- - Check that `SourceType`, `DispatchClass` and `RDataClass` match. -->
<!-- - Checking file types that have a base URL on amazon or PRIDE. -->

### Experiment/data unit tests {-}

When new data/experiments or even file types are added, the procedure
to add new `AnnotationHub` items will be streamlined, revised,
simplified and hopefully automated. To make sure that any of these
updates do not alter the format/annotation, a set of
experiment-specific unit tests are set up, that compare the metadata
created in this package and the metadata extracted from
`AnnotationHub`.


See for example `./tests/testthat/test_PXD000001.R`.