---
title: "Mapping experimental MS data to genomic coordinates"
output:
BiocStyle::html_document:
toc: true
---
```{r style, echo = FALSE, results = 'asis', message=FALSE}
BiocStyle::markdown()
```
**Package:** [`Pbase`](http://bioconductor.org/packages/devel/bioc/html/Pbase.html)
**Author:** [Laurent Gatto](http://cpu.sysbiol.cam.ac.uk/)
**Last compiled:** `r date()`
**Last modified:** `r file.info("mapping.Rmd")$mtime`
```{r env, message=FALSE, echo=FALSE, cache=FALSE, warning=FALSE}
library("Pbase")
```
## Introduction
The aim of this vignette is to document the mapping of proteins and
the tandem mass spectrometry-derived peptides to genomic locations.
```{r schema, echo=FALSE, fig.cap='Mapping proteins to a genome reference.', fig.align='center'}
Pbase:::mapplot()
```
## Protein and genome data
We will use a small `Proteins` object from the `r Biocpkg("Pbase")`
package to illustrate how to retrieve genome coordinates and map a
peptides back to genomic coordinates. See the `Pbase-data` vignette
for an introduction to `Proteins` data. The main information needed in
this vignette consists of protein UniProt identifiers and a set of
peptides positions along the protein sequence.
```{r p, message=FALSE}
library("Pbase")
data(p)
p
seqnames(p)
```
```{r startend}
pcols(p)[1, c("start", "end")]
```
We will also require an identifier relating the protein feature of
interest to the genome. Below, we use `r Biocpkg("biomaRt")` to query
the matching Ensembl transcript identifier. We start by create a
`Mart` object that stores the connection details to the latest human
Ensembl biomart server.
```{r bm, cache=TRUE, message=FALSE}
library("biomaRt")
ens <- useMart("ensembl", "hsapiens_gene_ensembl")
ens
bm <- select(ens, keys = seqnames(p),
keytype = "uniprot_swissprot",
columns = c(
"uniprot_swissprot",
"hgnc_symbol",
"ensembl_transcript_id"))
bm
```
As can be seen, there can be multiple transcripts for one protein
accession. We have defined the transcripts of interest for our
proteins in `p`; they are stored as protein elements metadata:
```{r enst}
acols(p)$ENST
```
## Genomic transcript coordinates
The `etrid2grl` function takes our transcript identifiers and will
query the Ensembl biomart server (note the `ens` argument) and return
a `GRangesList` object. For each of Ensembl transcript identifiers
provided as input, we have the genomic coordinates of that
transcript's exons as well as additional information such as the type
of exons (protein coding or untranslated region).
```{r etris2grl}
grl <- etrid2grl(acols(p)$ENST, ens)
all.equal(names(grl), acols(p)$ENST,
check.attributes=FALSE)
grl
```
We also need to retain only coding exons and discard untranslated
regions for later peptide mapping, using the `proteinCoding` function.
```{r pc}
pcgrl <- proteinCoding(grl)
pcgrl
```
## Visualisation with `r Biocpkg("Gviz")` and `r Biocpkg("Pviz")`
### Peptides along proteins
```{r echo=FALSE}
pp <- p[5]
n <- elementLengths(pranges(pp))
pepstring <- paste(unique(pcols(pp)[[1]]$pepseq), collapse = ", ")
```
The peptides that have been experimentally observed are available as
ranges (coordinates) along the protein sequences. For example, below,
we isolate and visualise the `r n` peptides (`r pepstring`) have been
identified for our protein of interest `r seqnames(p)[5]`.
```{r}
sort(pranges(p)[5])
plot(p[5])
```
### Exons along the genome
We can also plot the transcript regions inluding (`grl`) or exclusing
(`pcgrl`) the untranslated regions.
```{r}
plotAsGeneRegionTrack(grl[[5]], pcgrl[[5]])
```
## Mapping peptides back to the genome
The aim of this document is to document the mapping of peptides,
i.e. ranges along a protein sequence to ranges along the genome
reference. In other words, our aim is the convert protein coordinates
to genome coordinates.
### Comparing protein and translated DNA sequences
The first check that we want to implement is to verify that we can
regenerate the protein amino acid sequence from the genome regions
that we have extracted.
We also need the actual genome sequence (so far, we have only dealt
with regions and features). The exons coordinates have been retrieved
from the latest Ensembl release, which is based on the human genome
assembly `GRCh38`. We will use a genome package that is based on the
same reference genome, namely `r Biocannopkg("BSgenome.Hsapiens.NCBI.GRCh38")`.
We need to make sure that the chromosomes are named the same way in
the genome sequence data and our genomics ranges (`"chrX"`, as seen
above).
```{r protfromgenome}
library("BSgenome.Hsapiens.NCBI.GRCh38")
head(seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
if (!"chr1" %in% seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23] <-
paste0("chr", seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23])
seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[21:27]
```
Once we have extracted the actual sequences, we must also make sure
that we we reverse the sequences in case out genomic features are on
the reverse strand. We the combine (`unlist`) the exons (coding
sequences only, `pcgrl`) and translate then into a protein sequence.
```{r aaseq}
s <- getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl[[5]])
s
if (isReverse(pcgrl[[5]]))
s <- rev(s)
aaseq <- translate(unlist(s))
aaseq
```
We verify that the translated genome sequence and the protein squence
we started with match by aligning them.
```{r aln}
writePairwiseAlignments(pairwiseAlignment(aa(p[5]), aaseq))
```
### Calculating new coordinates
We can now calculate the peptide coordinate along the genome using the
position of the peptides along the protein (in `p`) and the position
of the exons of the protein's transcript along the genome (in `pcgrl`)
using the `mapToGenome` function.
```{r map}
res <- mapToGenome(p[5], pcgrl[5])
res[[1]]
```
### Plotting
Based on the new peptide genomic coordinates, it is now
straightforward to create a new `AnnotationTrack` and add it the the
track visualisation.
```{r pepcoords, fig.align='center'}
plotAsAnnotationTrack(res[[1]], grl[[5]])
```
### Detailed annotation track
Finally, we customise the figure by adding a track with the $MS^2$
spectra. The raw data used to search the protein database an create
`p` are available as an `MSnExp` object.
```{r msmsspectra, fig.align='center'}
data(pms)
library("ggplot2")
details <- function(identifier, ...) {
p <- plot(pms[[as.numeric(identifier)]], full=TRUE, plot=FALSE) + ggtitle("")
p <- p + theme_bw() + theme(axis.text.y = element_blank(),
axis.text.x = element_blank()) +
labs(x = NULL, y = NULL)
print(p, newpage=FALSE)
}
res <- res[[1]]
deTrack <- AnnotationTrack(start = start(res),
end = end(res),
genome = "hg38", chromosom = "chrX",
id = pcols(p)[[5]]$acquisitionNum,
name = "MS2 spectra",
stacking = "squish", fun = details)
grTrack <- GeneRegionTrack(grl[[5]],
name = acols(p)$ENST[5])
ideoTrack <- IdeogramTrack(genome = "hg38",
chromosome = "chrX")
axisTrack <- GenomeAxisTrack()
plotTracks(list(ideoTrack, axisTrack, deTrack, grTrack),
add53 = TRUE, add35 = TRUE)
```
### Mapping all peptide sets in the `Proteins` object
Above, we have demonstrated the `mapToGenome` functionality on one
protein only. The same operation can be performed on all the `r length(p)`
of the `p` object using the `pmapToGenome` equivalent using
the `r length(pcgrl)` ranges calculated with `etrid2grl`. The `pmapToGenome`
will map the peptides of i-th protein to the i-th genomic location of `pcgrl`.
```{r pmap}
pres <- pmapToGenome(p, pcgrl)
pres
```
## Dealing with multiple transcipts per protein
```{r}
k <- 6
seqnames(p)[k]
```
In the code chunk below, we remind ourselves that, querying the
Ensembl Biomart server for `r seqnames(p)[k]`, we obtain several
possible transcript identifiers, including the identifier of interest
`r acols(p)$ENST[k]`.
```{r remindbm}
sel <- bm$uniprot_swissprot == seqnames(p)[k]
bm[sel, ]
acols(p)$ENST[k]
```
Let's fetch the coordinates of all possible transcipts, making sure
that the names of the Ensembl identifiers are used to name the `grl`
ranges (using `use.names = TRUE`).
```{r etris2grl2}
eid <- bm[sel, 3]
names(eid) <- bm[sel, 1]
eid
grl <- etrid2grl(eid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
```
```{r plot5}
plotAsGeneRegionTrack(pcgrl)
```
### Descriminating transcripts
We extract the transcript sequences, translate them into protein
sequences and align each to our protein sequence (originally imported
from the fasta database, see `?Proteins` for the construction of `p`).
```{r getseq2, warning=FALSE}
lseq <- lapply(getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl),
function(s) translate(unlist(s)))
laln <- sapply(lseq, pairwiseAlignment, aa(p[k]))
sapply(laln, nmatch)/width(aa(p[k]))
```
```{r ki, echo=FALSE}
ki <- which.max(sapply(laln, nmatch))
```
We see that transcript number `r ki`, `r eid[ki]`, perfectly aligns
with our protein sequence. This is also the transcipt that corresponds
to the curated Ensembl transcript in `acols(p)$ENST`.
```{r checkk}
ki <- which.max(sapply(laln, nmatch))
stopifnot(eid[ki] == acols(p)$ENST[k])
```
```{r map2}
res <- pmapToGenome(p[k], pcgrl[ki])
```
As shown on the next figure, peptides that span over exon junctions
are grouped together and, below, colour-coded.
```{r pepcoords2}
plotAsAnnotationTrack(res[[1]], pcgrl[[ki]])
```
One can also apply a many-to-one mapping approach to all proteins in
the `p` object and all the transcripts identifiers fetched with
`etrid2grl` as shown below.
```{r coordall}
alleid <- bm[, 3]
names(alleid) <- bm[, 1]
grl <- etrid2grl(alleid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
res <- mapToGenome(p, pcgrl)
length(res)
```
The messages indicate that one protein accession number was not found
in the `pcgrl` ranges (no transcript was found) and several mapping
failed. In total, we obtain `r length(res)` mapping for
`r length(unique(names(pcgrl)))` protein accession numbers.
## Session information
```{r si}
sessionInfo()
```