--- title: "How to create new _AnnotationHub_ resources" author: "Marc Carlson, Sonali Arora" date: "Modified 11 November 2015; compiled `r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('AnnotationHub')`" vignette: > %\VignetteIndexEntry{How to create new AnnotationHub resources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- # How to write recipes for new _AnnotationHub_ resources ## Overview of the process This vignette is for those wishing to process online resources into [AnnotationHub][] resources. There are four steps involved in creating resources. ## Setup This vignette is intended for users who are comfortable with _R_. Be sure to use [Bioc-devel][]. Install the [AnnotationHubData][] package using `biocLite("AnnotationHubData")`. ## Introducing _AnnotationHubMetadata_ Objects The [AnnotationHubData][] package provides a place where we store code to processes online resources into _AnnotationHub_ resources. `AnnotationHubMetadata` objects are used to describe an online resource, including a function ('recipe') that creates the resource. The steps involved include writing a recipe are: 1. Write a function which takes the metadata about the resource and processes them into `AnnotationHubMetadata` objects 2. Optional step : Write an additional function specifying how original source files are pre-processed to _R_ or other objects. Not all source files are pre-processed; one might pre-process data to make it easier or faster for the user to access the resource. 3. Optional step: Write a function specifying how a resource is to be read in to the user's _R_ session after being downloaded to a user's local cache. # Step 1: Writing `AnnotationHubMetadata` Generating Functions The following example function takes files from the latest release of inparanoid and processes them into a list of `AnnotationHubMetadata` objects (e.g., using `Map()`). The function body can contain hard-coded information, but many recipes will 'scrape' internet resources, creating many _AnnotationHub_ resources; in the example below, `.inparanoidMetadataFromUrl()` visits the inparanoid web site to find current data. The `Recipe:` field indicates the function that will perform optional step 2, transforming the original source file(s) to an object that is more convenient for user input. This function returns a list of `AnnotationHubMetadata` objects. ```{r exampleInpProcessing} makeinparanoid8ToAHMs <- function(currentMetadata, justRunUnitTests=FALSE, BiocVersion=biocVersion()) { baseUrl <- 'http://inparanoid.sbc.su.se/download/current/Orthologs_other_formats' ## Make list of metadata in a helper function meta <- .inparanoidMetadataFromUrl(baseUrl) ## then make AnnotationHubMetadata objects. Map(AnnotationHubMetadata, Description=meta$description, Genome=meta$genome, SourceFile=meta$sourceFile, SourceUrl=meta$sourceUrl, SourceVersion=meta$sourceVersion, Species=meta$species, TaxonomyId=meta$taxonomyId, Title=meta$title, RDataPath=meta$rDataPath, MoreArgs=list( Coordinate_1_based = TRUE, DataProvider = baseUrl, Maintainer = "Marc Carlson ", RDataClass = "SQLiteFile", RDataDateAdded = Sys.time(), RDataVersion = "0.0.1", Recipe = "AnnotationHubData:::inparanoid8ToDbsRecipe", Tags = c("Inparanoid", "Gene", "Homology", "Annotation"))) } ``` Here is a listing of `AnntotationHubMetadata` arguments: - `AnnotationHubRoot`: 'character(1)' Absolute path to directory structure containing resources to be added to AnnotationHub. - `SourceUrl`: 'character()' URL where resource(s) can be found - `SourceType`: 'character()' which indicates what kind of resource was initially processed. The preference is to name the type of resource if it's a single file type and to name where the resources came from if it is a compound resource. So Typical answers would be like: 'BED','FASTA' or 'Inparanoid' etc. - `SourceVersion`: 'character(1)' Version of original file - `SourceLastModifiedDate`: 'POSIXct()' The date when the source was last modified. Leaving this blank should allow the values to be retrieved for you (if your sourceURL is valid). - `SourceMd5`: 'character()' md5 hash of original file - `SourceSize`: 'numeric(1)' Number of bytes in original file - `DataProvider`: 'character(1)' Where did this resource come from? - `Title`: 'character(1)' Title for this resource - `Description`: 'character(1)' Description of the resource - `Species`: 'character(1)' Species name - `TaxonomyId`: 'character(1)' NCBI code - `Genome`: 'character(1)' Name of genome build - `Tags`: 'character()' Free-form tags - `Recipe`: 'character(1)' Name of recipe function - `RDataClass`: 'character(1)' Class of derived object (e.g. 'GRanges') - `RDataDateAdded`: 'POSIXct()' Date added to AnnotationHub. Used to determine snapshots. - `RDataPath`: 'character(1)' file path to serialized form - `Maintainer`: 'character(1)' Maintainer name and email address, 'A Maintainer ' - `BiocVersion`: 'character(1)' Under which resource was built - `Coordinate_1_based`: 'logical(1)' Do coordinates start with 1 or 0? - `DispatchClass`: 'character(1)' string used to indicate which code should be called by the client when the resource is downloaded. This is often the same as the RDataClass. But it is allowed to be a different value so that the client can do something different internally if required. - `Location_Prefix`: 'character(1)' The location prefix is the base path where the resource is coming from; the default is the _Bioconductor_ _AnnotationHub_ server. - `Notes`: 'character()' Notes about the resource. # Step 2: Recipes for Pre-Processing Resources A (optional) recipe function transformed the source data into an object served by _AnnotationHub_ to the user. It takes a single `AnnotationHubMetadata` object as an argument. Below is a recipe that generates an inparanoid database object from the metadata stored in it's `AnnotationHubMetadata` object. Note that the recipe will be invoked on the hub, so should output data to the location specified in the input `AnnotationHubMetadata` object. ```{r exampleRecipe} inparanoid8ToDbsRecipe <- function(ahm) { require(AnnotationForge) inputFiles <- metadata(ahm)$SourceFile dbname <- makeInpDb(dir=file.path(inputFiles,""), dataDir=tempdir()) db <- loadDb(file=dbname) outputPath <- file.path(metadata(ahm)$AnnotationHubRoot, metadata(ahm)$RDataPath) saveDb(db, file=outputPath) outputFile(ahm) } ``` ## Note for Steps 1 and 2 While writing this function, care has to be taken for a couple of fields: Case 1 (common) - The _AnnotationHub_ resource is downloaded directly to the user cache without any pre-processing, then 1. `SourceUrls` specify the original resource locatoin = Location_Prefix + RDataPath 2. Recipe = `NA_character_` Example - ```{} SourceUrls = "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz" Location_Prefix = "http://hgdownload.cse.ucsc.edu/", RDataPath = "goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz" Recipe = NA_character_ ``` Case 2 - The _AnnotationHub_ resource requires pre-processing 1. `SourceUrls` should merely document the original location of the untouched file 2. `Location_Prefix` + `RDataPath` should be equal to the file path on the amazon machine where all pre-processed files are stored. 3. Recipe = helper function which tells us how to pre-process the original file Example - ```{} SourceUrls="http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz", Location_Prefix = "http://s3.amazonaws.com/annotationhub/", RDataPath="chainfile/dummy.Rda" ``` If this seems confusing, please note how in both of these cases, the sourceUrl needs to reflect the location that the resource will actually come from once when the client is in use. # Step 3: Post-Processing Resources One can post-process a file when it is instantiated into _AnnotationHub_ from the user's cache. An example, would be a BED file, downloaded to the user's cache but input into _R_ as a `GRanges` using `rtrackler::import`. Implement this by defining a class that extends `AnnotationHubResource` and that implements a `get1()` method. ```{r, eval=FALSE} setClass("BEDFileResource", contains="AnnotationHubResource") setMethod(".get1", "BEDFileResource", function(x, ...) { .require("rtracklayer") yy <- getHub(x) dat <- rtracklayer::BEDFile(cache(yy)) rtracklayer::import(dat, format="bed", genome=yy$genome, ...) }) ``` The class and method definition typically need to be added to [AnnotationHub][], and require coordination with us. # Step 4: Testing and Next Steps At this point make sure that the `AnnotationHubMetadata` generating function produces a list of `AnnotationHubMetadata` objects and that the recipe (if needed) produces an appropriate output path. Contact us to add your recipe to the production hub. # Session Information ```{r SessionInfo, echo=FALSE} sessionInfo() ``` [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub [AnnotationHubData]: https://bioconductor.org/packages/AnnotationHubData [Bioc-devel]: http://bioconductor.org/developers/how-to/useDevel