```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
# The AnnotationHubData Package
**Package**: `r Biocpkg("AnnotationHubData")`
**Authors**: `r packageDescription("AnnotationHubData")[["Author"]] `
**Modified**: 11 October, 2014
**Compiled**: `r date()`
## Overview
The AnnotationHubData package provides tools to acquire, annotate,
convert and store data for use in Bioconductor's AnnotationHub. BED
files from the Encode project, gtf files from Ensembl, or annotation
tracks from UCSC, are examples of data that can be downloaded,
described with metadata, transformed to standard Bioconductor data
types, and stored so that they may be conveniently served up on demand
to users via the AnnotationHub client.
This document will deal with the nuts and bolts of how the back end of
the AnnotationHub works and operates.
## How are recipes created and then run?
For the 1st part of the description of how to put new data into the
AnnotationHub, please see the vignette titled "How to write recipes
for new resources for the AnnotationHub" in the AnnotationHub package.
This will explain how to create a recipe and an AnnotationHubMetadata
generating function.
After those initial two steps though, you will have to actually run
the recipe. This will require that you follow the instructions
described here.
### Calling the makeAnnotationHubResource helper
Once you have a recipe and an AnnotationHubMetadata generating
functions you will need to call the makeAnnotationHubResource function
to do some automatic setup. This function only has two required
arguments. The 1st is basically the name of a class that describes
the kind of resource you are writing code to import. It just needs to
be a unique class name and will be used internally to create a class
for dispatch. The 2nd argument is the name of your metadata
processing function from the 1st step in the other vignette. If this
function name is NAMESPACED in another package (IOW if it's a
contributed recipe) you might have to import it for the next step to
work.
```{r, makeAnnotationHubResource, eval=FALSE}
require(AnnotationHubData)
makeAnnotationHubResource("Inparanoid8ImportPreparer",
makeinparanoid8ToAHMs)
```
Once you have added this call to AnnnotationHubData, the only step
left is to export the class name into the NAMESPACE (remember that
this is that string you are providing as your 1st argument). Then
reinstall AnnotationHubData and go to the next step.
### Testing your new recipe
Next install the modified AnnotationHubData package and see if the
recipe works correctly. You just need to call the function that will
try to update the resources for your recipe. That function is called
updateResources(). The updateResources() function takes several
arguments. These are:
* ahroot: represents the place where you would like any generated
files go.
* BiocVersion: character vector for versions of Bioc your objects will
be compatible with
* preparerClasses: vector of which classes you want to process recipes
for.
* insert: try to insert metadata into the DB? As a recipe author, you
don't get to do this. So it should be FALSE when you call this or
you will get an error.
* metadataOnly: should the recipes be run to create the processed
files?
* filtering: this is normally TRUE to indicate that we don't want to
regenerate AHMs for records that are already in the local cached
database. However, you sometimes may want to set this to FALSE if
you need to do some testing.
```{r, updateResources, eval=FALSE}
library(AnnotationHubData)
mdinp = updateResources(ahroot=getwd(),
BiocVersion=biocVersion(),
preparerClasses = "Inparanoid8ImportPreparer",
insert = FALSE,
metadataOnly=TRUE)
```
When you call updateResources() like this, a set of
AnnotationHubMetadata objects should be returned. You should inspect
those to make sure that they contain all the metadata that you
intended to use. And if you run it again with metadataOnly set to
FALSE, then it should generate the final objects for you in the ahroot
directory that you specified. If both of these things happen to your
satisfaction then your recipe is ready to be run.
## How is the back end metadata stored?
When a recipe is run a lot of metadata is also required and collected
about the data that is processed by that recipe. This data is stored
as a record in a MySQL database back end. When the user runs their
client, this database is downloaded (if needed) as a sqlite dump of
that MySQL backend. Performance benefits from using MySQL come into
play whenever we need to upload new records into the back end, but it
was also decided that the portability of a sqlite DB is better for
local cached access, so both kinds of databases are used in this
implementation. But the 'parent' database is the MySQL db (the sqlite
ones are just cached copies of that).
Normally when we work with this database we will 1st do testing on a
local instance on "MachineName" and then update the production machine
later.
To actually push data to this local database you need to set an option
in your .Rprofile for where the json should be posted to. That should
look like this:
```{r}
options("AH_SERVER_POST_URL"="http://:9393/resource")
```
To log in to the database on "MachineName"" use this (you will need the
ahuser password):
```{}
mysql -p -u ahuser annotationhub
```
\begin{verbatim}
mysql -p -u ahuser annotationhub
\end{verbatim}
And for production you will need to connect here:
```{}
ssh ubuntu@annotationhub.bioconductor.org
```
And then connect to the database:
```{}
mysql -p -u ahuser annotationhub
```
Whenever you have updated the MySQL back end, a cron job should do the
job of creating the locally cached sqlite database from the MySQL
source db. This is done by a ruby script called convert_db.rb. This
file in part of code that is checked in at the following URL in
github.
```{}
https://github.com/Bioconductor/AnnotationHubServer3.0
```
### How do we update the production database?
The process of updating the production server should follow this
process:
#### run the recipe on gamay to generate and files and to insert into the local copy.
You can run the recipe by calling updateResources with the insert and
metadataOnly arguments set up like this:
```{r, eval=FALSE}
mdinp = updateResources(ahroot,
BiocVersion,
preparerClasses = "Inparanoid8ImportPreparer",
insert = TRUE,
metadataOnly=FALSE)
```
#### Test the AnnotationHub locally by pointing it to gamay like this:
See section below titled "Final test that it works on gamay"
#### Make a dump of the DB from production and then copy it over to/set it up on gamay.
For this step you need to log in, and back up the database on
production. So that should look like this (use the ahuser password
for this). Instead of 2014_11_05, use the current date.
```{}
ssh ubuntu@annotationhub.bioconductor.org
cd ~/AnnotationHubDumps/prodDumps/
mysqldump -p -u ahuser annotationhub | gzip > dbdump_2014_11_05_fromProd.sql.gz
```
Then copy it down (from gamay):
```{}
cd /mnt/cpb_anno/mcarlson/AnnotationHubDumps/prodDumps/
scp ubuntu@annotationhub.bioconductor.org:/home/ubuntu/AnnotationHubDumps/prodDumps/dbdump_2014_11_06_fromProd.sql.gz .
```
And then unpack it on gamay (use the MySQL root password for this):
```{}
mysql -p -u root
```
Then in SQL drop and create the DB
```{}
drop database annotationhub;
create database annotationhub;
\q
```
At this point there is no data in the database, so AnnotationHub is
not working! So be sure and do the next step right away.
Then unpack the dump into the DB
```{}
zcat dbdump_2014_11_05_fromProd.sql.gz | mysql -p -u root annotationhub
```
#### Rerun the recipe on gamay to insert the rows into a clean copy of the data.
```{r, eval=FALSE}
mdinp = updateResources(ahroot,
BiocVersion,
preparerClasses = "Inparanoid8ImportPreparer",
insert = TRUE,
metadataOnly=FALSE)
```
### Final test that it works on gamay
The 1st part of this is that the server on gamay has to have produced
the .sqlite version of the mySQL database we just modified. To make
that happen you can start by 1st impersonating dan and go to his
directory with the ruby scripts:
```{}
sudo su - dtenenba
cd AnnotationHubServer3.0/
```
From there you can look at his cron job to see how it runs every 10
mins...
```{}
crontab -l
```
But we don't want to wait ten minutes! So run the following ruby
script to make the DB right now.
```{}
ruby convert_db.rb
exit
```
Sometimes if you are rolling back there may be an issue where the ruby
conversion script doesn't run because it thinks the new DB has an
older timestamp. If that happens remove the timestamp file and run it
again:
```{}
rm dbtimestamp.cache
ruby convert_db.rb
exit
```
Once this is done, you then have to actually test the client and make
sure that your data is in there etc. This means that you need to
launch AnnotationHub so that it points to gamay for metadata. You can
do that like this:
```{r, test_client, eval=FALSE}
options(ANNOTATION_HUB_URL="http://gamay:9393")
library(AnnotationHub)
ah = AnnotationHub()
length(ah) ## should be larger than you started with
tail(ah) ## should have some new records
```
To manually test metadata you could just do this:
```{r, test_client2, eval=FALSE}
foo = mcols(ah)
```
And then Look for your data in 'foo'... It's also a good idea to try
and download that data with it's "AHID"
#### Dump the database and copy it to production
```{}
mysqldump -p -u ahuser annotationhub | gzip > dbdump_2014_11_05_fromGamay.sql.gz
```
Then copy it up from gamay back to production:
```{}
scp /mnt/cpb_anno/mcarlson/AnnotationHubDumps/gamayDumps/dbdump_2014_11_05_fromGamay.sql.gz ubuntu@annotationhub.bioconductor.org:/home/ubuntu/AnnotationHubDumps/gamayDumps/
```
And then uppack it (use the MySQL password for this):
```{}
mysql -p -u root
```
Then in SQL drop and create the DB
```{}
drop database annotationhub;
create database annotationhub;
\q
```
Then unpack the dump into the DB
```{}
zcat dbdump_2014_11_05_fromGamay.sql.gz | mysql -p -u root annotationhub
```
## Where is the actual file (cloud) data stored?
Unlike the metadata, the data that it stored after a recipe is run is
stored in something that looks like a file system. When testing on
this data is stored in /var/FastRWeb/, but our production server
stores and retrieves data from S3. So to get the data moved up to
production you need to have access to these S3 resources.
To copy things up to the S3 buckets you need to use the aws command
line tools. You can see help like this
```{}
aws s3 help
```
And you can get "copying" help like this
```{}
aws s3 cp help
```
And you can copy something to be read only by using the --acl argument
like this:
```{}
aws s3 cp --acl public-read hello.txt s3://annotationhub/foo/bar/hello.txt
```
To sign in at the AWS console go here: (use your username and password
the account is bioconductor)
```{}
https://bioconductor.signin.aws.amazon.com/console
```
And then click 'S3', 'annotationhub' etc.
And to recursively copy back down from the S3 bucket do like this:
```{}
aws s3 cp --dryrun --recursive s3://annotationhub/ensembl/release-75/fasta/ .
```
And then to copy back up you could do it like this:
```{}
aws s3 cp --dryrun --recursive --acl public-read ./fasta/ s3://annotationhub/ensembl/release-75/fasta/
```
But that will copy everything (not very efficient since some things
may be the same and already there) So to actually copy back we should
really use 'aws s3 sync' instead of 'aws s3 cp'.
```{}
aws s3 sync help
```
## The AnnotationHub Server
The server is implemented in Ruby. Its code is
[here](https://github.com/Bioconductor/AnnotationHubServer3.0)
with a README that explains how to install it.
(more to come...)