---
title: "Downloading expression matrices using ImmuneSpaceR"
author: "Renan Sauteraud"
date: "`r Sys.Date()`"
output: html_document
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Handling expression matrices with ImmuneSpaceR}
---
```{r knitr, echo = FALSE}
library(knitr)
opts_chunk$set(echo = TRUE)
opts_chunk$set(cache = FALSE)
```
```{r netrc_req, echo = FALSE}
# This chunk is only useful for BioConductor checks and shouldn't affect any other setup
ISR_login <- Sys.getenv("ISR_login")
ISR_pwd <- Sys.getenv("ISR_pwd")
if(ISR_login != "" & ISR_pwd != ""){
netrc_file <- tempfile("ImmuneSpaceR_tmp_netrc")
netrc_string <- paste("machine www.immunespace.org login", ISR_login, "password", ISR_pwd)
write(x = netrc_string, file = netrc_file)
labkey.netrc.file <- netrc_file
}
```
This vignette shows detailed examples for the `getGEMatrix` function.
## Contents
1. [Connections](#cons)
1. [List the expression matrices](#matrices)
2. [Download an expression matrix](#download)
3. [Summarized matrices](#summary)
4. [Combining matrices](#multi)
5. [Caching](#caching)
6. [sessionInfo](#sessioninfo)
## Connections
As explained into the user guide vignette, datasets must be downloaded from
`ImmuneSpaceConnection` objects. We must first instantiate a connection to
the study or studies of interest. Throughout this vignette, we will use two
connections, one to a single study, and one to to all available data.
```{r CreateConection, cache=FALSE, message = FALSE}
library(ImmuneSpaceR)
sdy269 <- CreateConnection("SDY269")
all <- CreateConnection("")
```
## List the expression matrices
Now that the connections have been instantiated, we can start downloading from
them. But we need to figure out which processed matrices are available within
our chosen studies.
On www.immunespace.org, in the study of interest or at the project level, the
**Gene expression matrices** table will show the available runs.
Printing the connections will, among other information, list the datasets
availables. The `listDatasets` method will only display the downloadable data.
looking for. With `which = "expression"`, the datasets wont be printed.
```{r listDatasets}
sdy269$listDatasets()
```
Using `wich = "expression"`, we can remove the datasets from the output.
```{r listDatasets-which}
all$listDatasets(which = "expression")
```
Naturally, `all` contains every processed matrices available on ImmuneSpace as
it combines all available studies.
[Back to top](#contents)
## Download
### By run name
The `getGEMatrix` function will accept any of the run names listed in the
connection.
```{r getGEMatrix}
TIV_2008 <- sdy269$getGEMatrix("TIV_2008")
TIV_2011 <- all$getGEMatrix(x = "TIV_2011")
```
The matrices are returned as `ExpressionSet` where the phenoData slot contains
basic demographic information and the featureData slot shows a mapping of probe
to official gene symbols.
```{r ExpressionSet}
TIV_2008
```
### By cohort
The `cohort` argument can be used in place of the run name (`x`). Likewise, the
list of valid cohorts can be found in the Gene expression matrices table.
```{r getGEMatrix-cohorts}
LAIV_2008 <- sdy269$getGEMatrix(cohort = "LAIV group 2008")
```
Note that when cohort is used, `x` is ignored.
[Back to top](#contents)
## Summarized matrices
By default, the returned `ExpressionSet`s have probe names as features (or rows).
However, multiple probes often match the same gene and merging experiments from
different arrays is impossible at feature level.
When they are available, the `summary` argument allows to return the matrices
with gene symbols instead of probes.
```{r summary}
TIV_2008_sum <- sdy269$getGEMatrix("TIV_2008", summary = TRUE)
```
Probes that do not map to a unique gene are removed and expression is averaged
by gene.
```{r summary-print}
TIV_2008_sum
```
[Back to top](#contents)
## Combining matrices
In order to faciliate analysis across experiments and studies, when multiple
runs or cohorts are specified, `getGEMatrix` will attempt to combine the
selected expression matrices into a single `ExpressionSet`.
To avoid returning an empty object, it is usually recommended to use the
summarized version of the matrices, thus combining by genes. This is almost
always necessary when combining data from multiple studies.
```{r multi}
# Within a study
em269 <- sdy269$getGEMatrix(c("TIV_2008", "LAIV_2008"))
# Combining accross studies
TIV_seasons <- all$getGEMatrix(c("TIV_2008", "TIV_2011"), summary = TRUE)
```
## Caching
As explained in the user guide, the `ImmuneSpaceConnection` class is a Reference
class. It means its objects have fields accessed by reference. As a consequence,
they can be modified without making a copy of the entire object.
ImmuneSpaceR uses this feature to store downloaded datasets and expression
matrices. Subsequent calls to `getGEMatrix` with the same input will be faster.
See `?setRefClass` for more information about reference classes.
We can see a list of already downloaded runs and feature sets the `data_cache`
field. This is not intended to be used for data manipulation and only displayed
here to explain what gets cached.
```{r caching-dataset}
names(sdy269$data_cache)
```
If, for any reason, a specific marix needs to be redownloaded, the `reload`
argument will clear the cache for that specific `getGEMatrix` call and download
the file and metadata again.
```{r caching-reload}
TIV_2008 <- sdy269$getGEMatrix("TIV_2008", reload = TRUE)
```
Finally, it is possible to clear every cached expression matrix (and dataset).
```{r caching-clear}
sdy269$clear_cache()
```
Again, the `data.cache` field should never be modified manually. When in doubt,
simply reload the expression matrix.
[Back to top](#contents)
## sessionInfo()
```{r sessionInfo}
sessionInfo()
```