---
title: "BgeeDB: an R package for datasets retrieval from Bgee database"
author: "Andrea Komljenovic, Julien Roux, Marc Robinson-Rechavi, Frederic Bastian"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```BgeeDB``` is a collection of functions to import data from the Bgee database (<http://bgee.org/>) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from humans and many animals.
 
The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the data reprocessed by the Bgee pipeline. It works for all the species in Bgee. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns. This is the same as the TopAnat web-service available at (<http://bgee.org/?page=top_anat#/>), but with more flexibility in the choice of parameters and developmental stages, and is based on the ```topGO``` package.

This package allows: 
1. Listing annotation files gene of expression data available in the current version of Bgee database
2. Downloading the processed gene expression data available in the current version of Bgee database
3. Downloading the gene expression calls and annotations and using them to perform TopAnat analyses 

## How to use BgeeDB package

### Running example for Mus musculus

#### List available species in current version of Bgee

``` {r, message = FALSE, warning = FALSE}
library(BgeeDB)
```

```{r}
listBgeeSpecies()
```

#### Choose the species and create a new Bgee species object

From the list above, please choose one species (e.g., "Homo\_sapiens", "Mus\_musculus",...) and platform ("rna\_seq" or "affymetrix").

``` {r}
bgee <- Bgee$new(species = "Mus_musculus", datatype = "rna_seq")
```

#### Retrieve annotation for Mus musculus RNA-seq data

The ```bgee$get_annotation()``` will output the list of experiments and libraries currently available in Bgee for RNA-seq of Mus musculus. The ```bgee$get_annotation()``` loads the annotation in R, but also creates the Mus musculus folder in your current path, where it saves the downloaded annotation locally, so you can use the annotation for later as well.

```{r, eval=FALSE}
# check the path where your folder with annotation will be saved. The folder is named after your chosen species.
getwd()
```

```{r}
annotation_bgee_mouse <- bgee$get_annotation()
# head the experiments and libraries
lapply(annotation_bgee_mouse, head)
```

#### Download the processed RNA-seq data for Mus musculus

The ```bgee$get_data()``` will download read counts and RPKMs for Mus musculus from all available experiments in Bgee database as a list (see below). In case of downloaded data from all experiments for Mus musculus, ```bgee$get_data()``` will save the downloaded data in your current folder for later usage. 

```{r}
# download all RPKMs and counts for Mus musculus
data_bgee_mouse <- bgee$get_data()
```

```{r}
# the number of experiments downloaded from Bgee
length(data_bgee_mouse)
```

```{r}
# see your first experiment
data_bgee_experiment1 <- data_bgee_mouse[[1]]
head(data_bgee_experiment1)
```

Alternatively, you can choose to download only one experiment from Bgee, as in the example below. The data is then saved as a .tsv file in your current folder.

``` {r}
# download RPKMs and counts only for GSE30617 for Mus musculus
data_bgee_mouse_gse30617 <- bgee$get_data(experiment.id = "GSE30617")
```

The data from different samples will be listed in rows, one after the other. It is sometimes easier to work with data organized as a matrix, where different columns represent different samples. To transform the data into a matrix with genes in rows and samples in columns, you can use the ```bgee$format_data()``` function that separates data according to anatomical term. This function also allows to filter out genes that are not called present in a given sample (giving them NA values).

```{r}
# only present calls and rpkm values for GSE30617
gene.expression.mouse.rpkm <- bgee$format_data(data_bgee_mouse_gse30617, calltype = "expressed", stats = "rpkm")
names(gene.expression.mouse.rpkm)
```

```{r}
head(gene.expression.mouse.rpkm$"Ammon's horn")
```

#### Download the data to perform GO-like enrichment test for anatomical terms

The ```loadTopAnatData()``` function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.

```{r}
# Loading calls of expression based on RNA-seq data only
myTopAnatData <- loadTopAnatData(species=10090, datatype="rna_seq")
# Loading calls observed in embryonic stages only
myTopAnatData <- loadTopAnatData(species=10090, datatype="rna_seq", stage="UBERON:0000068")

# Look at the data
lapply(myTopAnatData, head)
```

*Note*: the results are stored in files (see the ```pathToData``` arguments). To save time, if you query again with the exact same parameters, these cache files will be read instead of querying the web-service. So do not delete the files in the working folder if you plan to perform additional queries.

#### Prepare a topGO object allowing to perform GO-like enrichment test for anatomical terms, for Mus musculus

First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a ```topGOdata``` object in the ```topGO``` package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes, annotated with "spermatogenesis" in the Gene Ontology (using the ```biomaRt``` package). 

```{r}
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)

# Foreground genes are those with a GO annotation "spermatogenesis"
myGenes <- getBM(attributes= "ensembl_gene_id", filters=c("go_id"), values=list(c("GO:0007283")), mart=ensembl)

# Background are all genes with a GO annotation
universe <- getBM(attributes= "ensembl_gene_id", filters=c("with_go_go"), values=list(c(TRUE)), mart=ensembl)

# Prepares the gene list vector 
geneList <- factor(as.integer(universe[,1] %in% myGenes[,1]))
names(geneList) <- universe[,1]
head(geneList)
summary(geneList == 1)

# Prepares the topGO object
myTopAnatObject <-  topAnat(myTopAnatData, geneList)
```

*Warning*: This can be long, especially if the gene list is large, since the anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.

#### Launch an enrichment test for anatomical terms

For this step, see the vignette of the ```topGO``` package for more details, as you have to directly use the tests implemented in the ```topGO``` package, as shown in this example:

```{r}
results <- runTest(myTopAnatObject, algorithm = 'classic', statistic = 'fisher')
# You can also use the topGO decorrelation methods, for example the "weight" method to get less redundant results
results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')
```

*Warning*: This can be long because of the size of the ontology. Consider running a script in batch mode if you have multiple analyses to do.

#### Format the table of results after an enrichment test for anatomical terms

We built the ```makeTable``` function to filter and format the test results. Results are sorted by p-value, and FDR values are calculated. 

```{r}
# Display results sigificant at a 1% FDR threshold
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, 0.01)
# Display all results
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, 1)
```

*Warning*: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the ```topGO``` package.