---
title: "Introduction to ImmuneSpaceR, BioC 2016"
author: Renan Sauteraud
date: June 25, 2016
autosize: true
output:
html_document:
toc: true
toc_float: true
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Introduction to ImmuneSpaceR}
---
```{r knitr-opts, echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE)
opts_chunk$set(fig.align = "center", fig.width = 10)
```
# Use case: System biology of Influenza
Reproducing results of a publication. **Systems biology of vaccination for seasonal
influenza in humans**. Nakaya et al (2011).
The paper uses system biology approach to study immune response to vaccination
against influenza in three seasons.
Two cohorts vaccinated with TIV (2007 & 2008). And one with LAIV (2008).
# Without ImmuneSpaceR
## Fetching data from GEO
The data was made available on GEO after publication. The [SuperSeries](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29619)
is referenced in the paper.
```{r usecase}
library(GEOquery)
gse <- getGEO("GSE29619")
```
The returned object is a `list` of 3 `ExpressionSet`s. One per SubSeries/cohort.
```{r}
gse
```
The seasons cannot be identified without looking in the objects.
```{r}
names(gse)
```
## phenoData
Let's look at the metadata available on GEO
```{r}
library(Biobase)
es_LAIV <- gse[[1]]
head(pData(es_LAIV), 3)
```
### Issues
* The metadata is poorly formated
* Subject ID & visit in the same column
* HAI results mapped using "characteristic_ch1" columns
* No demographics (age, gender, race)
* No mapping to additional datasets (ELISA, ELISPOT, Flow, GE, etc...)
## Combining data
Without any additional information the various seasons cannot be combined.
```{r combine-es, error = TRUE}
combine(gse[[1]], gse[[2]])
```
The platforms are different accross seasons.
## Summary
In order to do something simple like differential gene expression analyses:
* Parse the phenoData of each `ExpressionSet` for visit and patient ID
* Use `annotate` to map probes to genes
* `combine` into one expression matrix
* Run DGEA (limma)
Combining this information with another dataset would require additional data
cleaning (assuming the IDs are consistent accros experiment).
# With ImmuneSpaceR
This study was funded by HIPC. As a result, all of its data has been
curated and standardized by [ImmPort](http://www.immport.org/immport-open/public/home/home)
and is now publicly available on ImmuneSpace. We will use ImmuneSpaceR to
retrieve the gene-expression data and associated metadata.
## Connect to ImmuneSpace
The study SDY269 holds the data for the two 2008 cohorts, LAIV and TIV.
![](/home/rsautera/Pictures/ImmuneSpace/sdy269.png)
A study overview page on ImmuneSpace
First, we instantiate a connection to the study of interest.
```{r CreateConnection269}
library(ImmuneSpaceR)
sdy269 <- CreateConnection("SDY269")
sdy269
```
The returned `ImmuneSpaceConnection` object shows available datasets and
expression matrices available for the selected study. ImmuneSpace uses
Reference classes or `R5`. Functions are members of the object, thus the $
semantics to access member functions in the following samples.
## Download gene expression
```{r getGEM_TIV2008}
TIV2008 <- sdy269$getGEMatrix("TIV_2008")
TIV2008
```
```{r pdata_TIV2008}
head(pData(TIV2008))
```
Only essential information is displayed in the phenoData,
## Do I have to do this for every cohort?
`getGEMatrix` accepts multiple IDs as input. Additionally setting `summary = TRUE`
returns the matrix with expression values averaged by gene. Allowing an easy
merge of the two ExpressionSets.
```{r getGEM_multicohorts}
es269 <- sdy269$getGEMatrix(c("TIV_2008", "LAIV_2008"), summary = TRUE)
es269
```
For convenience, cohorts can also be downloaded by names.
```{r getGEM_cohortnames}
es269 <- sdy269$getGEMatrix(cohort = c("TIV Group 2008", "LAIV group 2008"), summary = TRUE)
```
## Do I have to do this for every study?
SDY269 only contains the two cohorts asociated with the 2008 season but the GEO
SuperSeries had three cohorts. The data has been split between multiple studies.
SDY61 contains the data for the 2007 season. We have to combine the gene-expression
results accross these two studies to get the data that was used in the publication.
In order to query multiple studies, we can instantiate a connection to all
studies at once.
```{r}
all <- CreateConnection("") #All studies
es <- all$getGEMatrix(c("TIV_2007", "TIV_2008", "LAIV_2008"), summary = TRUE)
head(pData(es))
```
With a single command, we can download the three relevant normalized and
transformed expression matrices, summarize them by gene and combine into a
unique ExpressionSet with relevant phenoData.
It is also a lot faster than GEO as each matrix is available as a single file.
## How about the other datasets?
Unlike expresssion matrices, the other datasets are stored as tables in the
database.
![](/home/rsautera/Pictures/ImmuneSpace/sdy269hai.png)
SDY269 HAI dataset as visible on ImmuneSpace
### Downloading
They can be accessed using the `getDataset` function.
```{r listds}
library(data.table)
sdy269$listDatasets()
hai <- sdy269$getDataset("hai")
hai
```
For performance, the function returns `data.table` objects. It is especially
interesting when querying accross all studies.
```{r getDataset_cross}
ahai <- all$getDataset("hai")
```
### Dataset filtering
Because `getDataset` is a wrapper around `Rlabkey`'s `labkey.selectRows` function,
we use `makeFilter` to create filters that can be interpreted by LabKey.
```{r makeFilter}
library(Rlabkey)
virus_filter <- makeFilter(c("virus", "CONTAINS", "H1N1"))
hai_f <- sdy269$getDataset("hai", colFilter = virus_filter)
virus_filter2 <- makeFilter(c("virus", "EQUAL", "A/Brisbane/59/2007 (H1N1)"))
hai_f <- sdy269$getDataset("hai", colFilter = virus_filter2)
#multiple filters can be specified
analyte_filter <- makeFilter(c("Analyte", "EQUAL", "IFNg"), c("Study time collected", "IN", "0;7"))
elisa <- sdy269$getDataset("elisa", colFilter = analyte_filter)
```
## Cross assay analysis
With the same information accross datasets, we can merge results from various assay
types.
Here we use the elispot and flow cytometry results to reproduce Figure 1d of
Nakaya et al.
```{r cross-assay}
# Elispot
analyte_filter2 <- makeFilter(c("Analyte", "EQUAL", "IgG"), c("Study time collected", "EQUAL", "7"))
elispot <- sdy269$getDataset("elispot", colFilter = analyte_filter2, reload = TRUE)
elispot <- elispot[, elispot_response := spot_number_reported + 1]
elispot <- elispot[, list(participant_id, elispot_response)]
# Flow
fcs <- sdy269$getDataset("fcs_analyzed_result")
fcs <- fcs[, fcs_response := (as.double(population_cell_number) + 1) / as.double(base_parent_population)][study_time_collected == 7]
res <- merge(elispot, fcs, by = "participant_id")
library(ggplot2)
ggplot(res, aes(x = as.double(fcs_response), y = elispot_response, color = cohort)) +
geom_point() + scale_y_log10() + scale_x_log10() + geom_smooth(method = "lm") +
xlab("Total plasmablasts (%)") + ylab("Influenza specific cells\n (per 10^6 PBMCs)") +
theme_IS()
```
Only a few lines were needed to calculate fold-change and a sinle `merge` operation
is enough to combine data accross assays and bind metadata.
# What else?
## Quick plot
For the exploration of most datasets, `quick_plot` is a function that takes
advantage of the standardized datasets to plot the data in a relevant fashion.
It has a limited number of options and should be used for having a first look at
the data.
More advanced plots can be achieved by downloading the data in the R session, as
seen in the use case above.
```{r quick-plot}
sdy269$quick_plot("hai", normalize = FALSE)
sdy269$quick_plot("hai", filter = virus_filter2, normalize = FALSE, color = "Age", shape = "Gender")
```
This is the same function that is used by the interactive
[Data Explorer Module](https://www.immunespace.org/DataExplorer/Studies/begin.view).
```{r qp_cross}
virus_filter3 <- makeFilter(c("cohort", "contains", "TIV"), c("study_time_collected", "IN", "0;21;28;30;180"))
all$quick_plot("hai", filter = virus_filter3, normalize = TRUE, color = "Age")
```
## Online R/Rmd reports
The reports available on ImmuneSpace are written using ImmuneSpaceR. The code is
available and will work both through LabKey's report on the web portal and locally.
![](/home/rsautera/workspace/git/bioc2016/inst/img/sdy180report_cropped.png)
Example of live report available in SDY180
## Technical details
* Reference classes for performance and caching
* Connections to the database are made with `RCurl`
* `data.table` for handling downloaded datasets
## Future
* Rework the instantiation to allow connection to a subset of studies.
* Use [Data Finder](https://www.immunespace.org/project/Studies/begin.view) filters in R.
* Integration of Rstudio to ImmuneSpace (spawn pre-filtered R sessions).
* Accept user made R/Rmd reports on ImmuneSpace.
* Improve testing of both ImmuneSpaceR & ImmuneSpace.