When using the first version of metaseqR, one had to either embed the annotation to the gene/exon/3’ UTR counts, or to download and construct on-the-fly the required annotation when starting from BAM files. Although the counting and gene model construction results could (anf still can) be saved and re-used with other analysis parameters changed (e.g. statistical algorithms), one could not easily add for example new data to an existing dataset without re-running the whole pipeline and re-downloading annotation. On top of that, many times, the main Ensembl servers (when using Ensembl annotations) do not respond well to biomaRt calls, so the whole pipeline may stall until the servers are back.
Another main issue with the annotation used by metaseqR was that there was no straightforward way, provided by metaseqR, to archive and version the annotation used by a specific analysis and was up to the user to take care of reproducibility at this level. Furthermore, there was no straightforward way for a user to plugin own annotation elements (e.g. in the form of a GTF file) and use it in the same manner as standard annotations supported by metaseqR, e.g. when analyzing data from not-so-often studied organisms such as insects. Plugging-in own annotation was possible but usually a painful procedure, which has become now very easy.
The annotation database builder for metaseqR2 remedies the above situations. The
buildAnnotationDatabase
function should be run once with the organisms one
requires to have locally to work with and then that’s it! Of course you can
manage your database by adding and removing specific annotations (and you even
can play with an SQLite browser, although not advised, as the database structure
is rather simple). Furthermore, you can use the metaseqR2 annotation database
and management mechanism for any other type of analysis where you require to
have a simple tab-delimited annotation file, acquired with very little effort.
The following organisms (essentially genome versions) are supported for automatic database builds:
To install the metaseqR2 package, start R and enter:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("metaseqR2")
By default, the database file will be written in the
system.file(package="metaseqR2")
directory. You can specify another prefered
destination for it using the db
argument in the function call, but if you do
that, you will have to supply the localDb
argument pointing to the SQLite
database file you created to every metaseqr2 call you perform, otherwise, the
pipeline will download and use annotations on-the-fly.
In this vignette, we will build a minimal database comprising only the mouse
mm9 genome version from Ensembl. The database will be build in a temporary
directory inside session tempdir()
.
Important note: As the annotation build function makes use of Kent utilities for creating 3’UTR annotations from RefSeq and UCSC, the latter cannot be built in Windows. Therefore it is advised to either build the annotation database in a Linux system or use our pre-built databases.
library(metaseqR2)
buildDir <- file.path(tempdir(),"test_anndb")
dir.create(buildDir)
# The location of the custom database
myDb <- file.path(buildDir,"testann.sqlite")
# Since we are using Ensembl, we can also ask for a version
organisms <- list(mm9=67)
sources <- ifelse(.Platform$OS.type=="unix",c("ensembl","refseq"),"ensembl")
# If the example is not running in a multicore system, rc is ignored
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,db=myDb,rc=0.5)
## Opening metaseqR2 SQLite database /tmp/Rtmpu5V6f6/test_anndb/testann.sqlite
## Retrieving genome information for mm9 from ensembl
## Retrieving gene annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Retrieving transcript annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Merging transcripts for mm9 from ensembl version 67
## Retrieving 3' UTR annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Merging gene 3' UTRs for mm9 from ensembl version 67
## Merging transcript 3' UTRs for mm9 from ensembl version 67
## Retrieving exon annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Merging exons for mm9 from ensembl version 67
Now, that a small database is in place, let’s retrieve some data. Remember that
since the built database is not in the default location, we need to pass the
database file in each data retrieval function. The annotation is retrieved as
a GRanges
object by default.
# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="gene",
db=myDb)
genes
## GRanges object with 37583 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000090025 chr1 3044314-3044814 + | ENSMUSG00000090025
## ENSMUSG00000064842 chr1 3092097-3092206 + | ENSMUSG00000064842
## ENSMUSG00000051951 chr1 3195982-3661579 - | ENSMUSG00000051951
## ENSMUSG00000089699 chr1 3456668-3503634 + | ENSMUSG00000089699
## ENSMUSG00000088390 chr1 3668961-3669024 - | ENSMUSG00000088390
## ... ... ... ... . ...
## ENSMUSG00000052831 chrY 2086590-2097768 + | ENSMUSG00000052831
## ENSMUSG00000069031 chrY 2118049-2129045 + | ENSMUSG00000069031
## ENSMUSG00000071960 chrY 2156899-2168120 + | ENSMUSG00000071960
## ENSMUSG00000091987 chrY 2390390-2398856 + | ENSMUSG00000091987
## ENSMUSG00000090600 chrY 2550262-2552957 + | ENSMUSG00000090600
## gc_content gene_name biotype
## <numeric> <character> <character>
## ENSMUSG00000090025 48.30 Gm16088 pseudogene
## ENSMUSG00000064842 36.36 U6 snRNA
## ENSMUSG00000051951 38.51 Xkr4 protein_coding
## ENSMUSG00000089699 39.27 Gm1992 antisense
## ENSMUSG00000088390 34.38 U7 snRNA
## ... ... ... ...
## ENSMUSG00000052831 36.60 Rbmy1a1 protein_coding
## ENSMUSG00000069031 37.06 Gm10256 protein_coding
## ENSMUSG00000071960 37.12 Gm10352 protein_coding
## ENSMUSG00000091987 34.84 Gm3376 protein_coding
## ENSMUSG00000090600 44.40 Gm3395 protein_coding
## -------
## seqinfo: 21 sequences from mm9 genome
# Load standard annotation based on 3' UTR coordinates
utrs <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="utr",
db=myDb)
utrs
## GRanges object with 161406 ranges and 4 metadata columns:
## seqnames ranges strand | transcript_id
## <Rle> <IRanges> <Rle> | <character>
## ENSMUST00000160944 chr1 3044815-3045092 + | ENSMUST00000160944
## ENSMUST00000082908 chr1 3092207-3092484 + | ENSMUST00000082908
## ENSMUST00000162897 chr1 3195704-3195981 - | ENSMUST00000162897
## ENSMUST00000159265 chr1 3196326-3196603 - | ENSMUST00000159265
## ENSMUST00000070533 chr1 3204285-3204562 - | ENSMUST00000070533
## ... ... ... ... . ...
## ENSMUST00000078383 chrY 2167760-2168120 + | ENSMUST00000078383
## ENSMUST00000078383 chrY 2168121-2168398 + | ENSMUST00000078383
## ENSMUST00000166474 chrY 2398496-2398856 + | ENSMUST00000166474
## ENSMUST00000166474 chrY 2398857-2399134 + | ENSMUST00000166474
## ENSMUST00000172100 chrY 2552084-2552957 + | ENSMUST00000172100
## gene_id gene_name biotype
## <character> <character> <character>
## ENSMUST00000160944 ENSMUSG00000090025 Gm16088 pseudogene
## ENSMUST00000082908 ENSMUSG00000064842 U6 snRNA
## ENSMUST00000162897 ENSMUSG00000051951 Xkr4 protein_coding
## ENSMUST00000159265 ENSMUSG00000051951 Xkr4 protein_coding
## ENSMUST00000070533 ENSMUSG00000051951 Xkr4 protein_coding
## ... ... ... ...
## ENSMUST00000078383 ENSMUSG00000071960 Gm10352 protein_coding
## ENSMUST00000078383 ENSMUSG00000071960 Gm10352 protein_coding
## ENSMUST00000166474 ENSMUSG00000091987 Gm3376 protein_coding
## ENSMUST00000166474 ENSMUSG00000091987 Gm3376 protein_coding
## ENSMUST00000172100 ENSMUSG00000090600 Gm3395 protein_coding
## -------
## seqinfo: 21 sequences from mm9 genome
# Load summarized exon annotation based used with RNA-Seq analysis
sumEx <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="exon",
summarized=TRUE,db=myDb)
sumEx
## GRanges object with 247808 ranges and 4 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSMUSG00000090025_MEX_1 chr1 3044314-3044814 + |
## ENSMUSG00000064842_MEX_1 chr1 3092097-3092206 + |
## ENSMUSG00000051951_MEX_1 chr1 3195982-3197398 - |
## ENSMUSG00000051951_MEX_2 chr1 3203520-3207049 - |
## ENSMUSG00000051951_MEX_3 chr1 3411783-3411982 - |
## ... ... ... ... .
## ENSMUSG00000091987_MEX_5 chrY 2394921-2395029 + |
## ENSMUSG00000091987_MEX_6 chrY 2396398-2396478 + |
## ENSMUSG00000091987_MEX_7 chrY 2397909-2397997 + |
## ENSMUSG00000091987_MEX_8 chrY 2398179-2398856 + |
## ENSMUSG00000090600_MEX_1 chrY 2550262-2552957 + |
## exon_id gene_id
## <character> <character>
## ENSMUSG00000090025_MEX_1 ENSMUSG00000090025_MEX_1 ENSMUSG00000090025
## ENSMUSG00000064842_MEX_1 ENSMUSG00000064842_MEX_1 ENSMUSG00000064842
## ENSMUSG00000051951_MEX_1 ENSMUSG00000051951_MEX_1 ENSMUSG00000051951
## ENSMUSG00000051951_MEX_2 ENSMUSG00000051951_MEX_2 ENSMUSG00000051951
## ENSMUSG00000051951_MEX_3 ENSMUSG00000051951_MEX_3 ENSMUSG00000051951
## ... ... ...
## ENSMUSG00000091987_MEX_5 ENSMUSG00000091987_MEX_5 ENSMUSG00000091987
## ENSMUSG00000091987_MEX_6 ENSMUSG00000091987_MEX_6 ENSMUSG00000091987
## ENSMUSG00000091987_MEX_7 ENSMUSG00000091987_MEX_7 ENSMUSG00000091987
## ENSMUSG00000091987_MEX_8 ENSMUSG00000091987_MEX_8 ENSMUSG00000091987
## ENSMUSG00000090600_MEX_1 ENSMUSG00000090600_MEX_1 ENSMUSG00000090600
## gene_name biotype
## <character> <character>
## ENSMUSG00000090025_MEX_1 Gm16088 pseudogene
## ENSMUSG00000064842_MEX_1 U6 snRNA
## ENSMUSG00000051951_MEX_1 Xkr4 protein_coding
## ENSMUSG00000051951_MEX_2 Xkr4 protein_coding
## ENSMUSG00000051951_MEX_3 Xkr4 protein_coding
## ... ... ...
## ENSMUSG00000091987_MEX_5 Gm3376 protein_coding
## ENSMUSG00000091987_MEX_6 Gm3376 protein_coding
## ENSMUSG00000091987_MEX_7 Gm3376 protein_coding
## ENSMUSG00000091987_MEX_8 Gm3376 protein_coding
## ENSMUSG00000090600_MEX_1 Gm3395 protein_coding
## -------
## seqinfo: 21 sequences from mm9 genome
# Load standard annotation based on gene body coordinates from RefSeq
if (.Platform$OS.type=="unix") {
refGenes <- loadAnnotation(genome="mm9",refdb="refseq",level="gene",
type="gene",db=myDb)
refGenes
}
## Getting latest annotation on the fly for mm9 from refseq
## Retrieving genome information for mm9 from refseq
## Retrieving latest gene annotation for mm9 from refseq
## Loading required namespace: RMySQL
## Converting annotation to GenomicRanges object...
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Getting DNA sequences...
## Getting GC content...
## GRanges object with 20956 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id gc_content
## <Rle> <IRanges> <Rle> | <character> <numeric>
## NM_001011874 chr1 3204562-3661579 - | NM_001011874 38.56
## NM_011283 chr1 4333587-4350395 - | NM_011283 38.73
## NM_011441 chr1 4481008-4487435 - | NM_011441 49.75
## NM_001177658 chr1 4763278-4775807 - | NM_001177658 42.59
## NM_008866 chr1 4797903-4836816 + | NM_008866 40.99
## ... ... ... ... . ... ...
## NM_148943 chrY 635403-796225 - | NM_148943 36.39
## NM_009571 chrY 1362122-1426357 - | NM_009571 36.58
## NM_011564 chrY 1918380-1919568 - | NM_011564 55.17
## NM_001177569 chrY 1976248-1976584 - | NM_001177569 49.55
## NM_001166384 chrY 2156898-2168120 + | NM_001166384 37.13
## gene_name biotype
## <character> <character>
## NM_001011874 Xkr4 protein_coding
## NM_011283 Rp1 protein_coding
## NM_011441 Sox17 protein_coding
## NM_001177658 Mrpl15 protein_coding
## NM_008866 Lypla1 protein_coding
## ... ... ...
## NM_148943 Usp9y protein_coding
## NM_009571 Zfy2 protein_coding
## NM_011564 Sry protein_coding
## NM_001177569 H2al2c protein_coding
## NM_001166384 Rbmy protein_coding
## -------
## seqinfo: 21 sequences (21 circular) from mm9 genome
Or as a data frame if you prefer using asdf=TRUE
. The data frame however does
not contain metadata like Seqinfo
to be used for any susequent validations:
# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="gene",
db=myDb,asdf=TRUE)
head(genes)
## chromosome start end gene_id gc_content strand gene_name
## 1 chr1 3044314 3044814 ENSMUSG00000090025 48.30 + Gm16088
## 2 chr1 3092097 3092206 ENSMUSG00000064842 36.36 + U6
## 3 chr1 3195982 3661579 ENSMUSG00000051951 38.51 - Xkr4
## 4 chr1 3456668 3503634 ENSMUSG00000089699 39.27 + Gm1992
## 5 chr1 3668961 3669024 ENSMUSG00000088390 34.38 - U7
## 6 chr1 3773818 3773879 ENSMUSG00000089420 38.71 - U7
## biotype
## 1 pseudogene
## 2 snRNA
## 3 protein_coding
## 4 antisense
## 5 snRNA
## 6 snRNA
Apart from the supported organisms and databases, you can add a custom annotation. Such an annotation can be:
This can be achieved through the usage of
GTF files, along with
some simple metadata that you have to provide for proper import to the
annotation database. This can be achieved through the usage of the
buildCustomAnnotation
function. Details on required metadata can be found
in the function’s help page.
Important note: Please note that importing a custom genome annotation
directly from UCSC (UCSC SQL database dumps) is not supported in Windows as the
process involves using the genePredToGtf
which is not available for Windows.
Let’s try a couple of exammples. The first one is a custom annotation for the Ebola virus from UCSC:
# Setup a temporary directory to download files etc.
customDir <- file.path(tempdir(),"test_custom")
dir.create(customDir)
# Convert from GenePred to GTF - Unix/Linux only!
if (.Platform$OS.type == "unix" && !grepl("^darwin",R.version$os)) {
# Download data from UCSC
goldenPath="http://hgdownload.cse.ucsc.edu/goldenPath/"
# Gene annotation dump
download.file(paste0(goldenPath,"eboVir3/database/ncbiGene.txt.gz"),
file.path(customDir,"eboVir3_ncbiGene.txt.gz"))
# Chromosome information
download.file(paste0(goldenPath,"eboVir3/database/chromInfo.txt.gz"),
file.path(customDir,"eboVir3_chromInfo.txt.gz"))
# Prepare the build
chromInfo <- read.delim(file.path(customDir,"eboVir3_chromInfo.txt.gz"),
header=FALSE)
chromInfo <- chromInfo[,1:2]
rownames(chromInfo) <- as.character(chromInfo[,1])
chromInfo <- chromInfo[,2,drop=FALSE]
# Coversion from genePred to GTF
genePredToGtf <- file.path(customDir,"genePredToGtf")
if (!file.exists(genePredToGtf)) {
download.file(
"http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
genePredToGtf
)
system(paste("chmod 775",genePredToGtf))
}
gtfFile <- file.path(customDir,"eboVir3.gtf")
tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
"tgtf",sep="."))
command <- paste0(
"zcat ",file.path(customDir,"eboVir3_ncbiGene.txt.gz"),
" | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
" -source=eboVir3"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
gtfFile
)
system(command)
# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
gtfFile=gtfFile,
metadata=list(
organism="eboVir3_test",
source="ucsc_test",
chromInfo=chromInfo
),
db=myDb
)
# Try to retrieve some data
eboGenes <- loadAnnotation(genome="eboVir3_test",refdb="ucsc_test",
level="gene",type="gene",db=myDb)
eboGenes
}
## Opening metaseqR2 SQLite database /tmp/Rtmpu5V6f6/test_anndb/testann.sqlite
## Importing GTF /tmp/Rtmpu5V6f6/test_custom/eboVir3.gtf as GTF to make id map
## Making id map
## Importing GTF /tmp/Rtmpu5V6f6/test_custom/eboVir3.gtf as TxDb
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Retrieving gene annotation for eboVir3_test from ucsc_test version 20200430 from /tmp/Rtmpu5V6f6/test_custom/eboVir3.gtf
## Retrieving transcript annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving summarized transcript annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving 3' UTR annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving summarized 3' UTR annotation per gene for eboVir3_test from ucsc_test version 20200430
## summarizing UTRs per gene for imported GTF
## Retrieving summarized 3' UTR annotation per transcript for eboVir3_test from ucsc_test version 20200430
## summarizing UTRs per gene for imported GTF
## Retrieving exon annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving summarized exon annotation for eboVir3_test from ucsc_test version 20200430
## summarizing exons per gene for imported GTF
## GRanges object with 9 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id gc_content gene_name
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## NP KM034562v1 56-3026 + | NP 50 NP
## VP35 KM034562v1 3032-4407 + | VP35 50 VP35
## VP40 KM034562v1 4390-5894 + | VP40 50 VP40
## GP KM034562v1 5900-8305 + | GP 50 GP
## sGP KM034562v1 5900-8305 + | sGP 50 sGP
## ssGP KM034562v1 5900-8305 + | ssGP 50 ssGP
## VP30 KM034562v1 8288-9740 + | VP30 50 VP30
## VP24 KM034562v1 9885-11518 + | VP24 50 VP24
## L KM034562v1 11501-18282 + | L 50 L
## biotype
## <character>
## NP gene
## VP35 gene
## VP40 gene
## GP gene
## sGP gene
## ssGP gene
## VP30 gene
## VP24 gene
## L gene
## -------
## seqinfo: 1 sequence from eboVir3_test genome
Another example, the Atlantic cod from UCSC. The same things apply for the operating system.
if (.Platform$OS.type == "unix") {
# Gene annotation dump
download.file(paste0(goldenPath,"gadMor1/database/augustusGene.txt.gz"),
file.path(customDir,"gadMori1_augustusGene.txt.gz"))
# Chromosome information
download.file(paste(goldenPath,"gadMor1/database/chromInfo.txt.gz",sep=""),
file.path(customDir,"gadMori1_chromInfo.txt.gz"))
# Prepare the build
chromInfo <- read.delim(file.path(customDir,"gadMori1_chromInfo.txt.gz"),
header=FALSE)
chromInfo <- chromInfo[,1:2]
rownames(chromInfo) <- as.character(chromInfo[,1])
chromInfo <- chromInfo[,2,drop=FALSE]
# Coversion from genePred to GTF
genePredToGtf <- file.path(customDir,"genePredToGtf")
if (!file.exists(genePredToGtf)) {
download.file(
"http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
genePredToGtf
)
system(paste("chmod 775",genePredToGtf))
}
gtfFile <- file.path(customDir,"gadMori1.gtf")
tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
"tgtf",sep="."))
command <- paste0(
"zcat ",file.path(customDir,"gadMori1_augustusGene.txt.gz"),
" | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
" -source=gadMori1"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
gtfFile
)
system(command)
# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
gtfFile=gtfFile,
metadata=list(
organism="gadMor1_test",
source="ucsc_test",
chromInfo=chromInfo
),
db=myDb
)
# Try to retrieve some data
gadGenes <- loadAnnotation(genome="gadMor1_test",refdb="ucsc_test",
level="gene",type="gene",db=myDb)
gadGenes
}
Another example, Armadillo from Ensembl. This should work irrespectively of operating system. We are downloading chromosomal information from UCSC.
# Gene annotation dump from Ensembl
download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/",
"dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"))
# Chromosome information will be provided from the following BAM file
# available from Ensembl. We have noticed that when using Windows as the OS,
# a remote BAM files cannot be opened by scanBamParam, so for this example,
# chromosome length information will not be available when running in Windows.
bamForInfo <- NULL
if (.Platform$OS.type == "unix")
bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/",
"dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam")
# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
metadata=list(
organism="dasNov3_test",
source="ensembl_test",
chromInfo=bamForInfo
),
db=myDb
)
## Opening metaseqR2 SQLite database /tmp/Rtmpu5V6f6/test_anndb/testann.sqlite
## Importing GTF /tmp/Rtmpu5V6f6/test_custom/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz as GTF to make id map
## Making id map
## Importing GTF /tmp/Rtmpu5V6f6/test_custom/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz as TxDb
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Retrieving gene annotation for dasNov3_test from ensembl_test version 20200430 from /tmp/Rtmpu5V6f6/test_custom/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz
## Retrieving transcript annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving summarized transcript annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving 3' UTR annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving summarized 3' UTR annotation per gene for dasNov3_test from ensembl_test version 20200430
## summarizing UTRs per gene for imported GTF
## Retrieving summarized 3' UTR annotation per transcript for dasNov3_test from ensembl_test version 20200430
## summarizing UTRs per gene for imported GTF
## Retrieving exon annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving summarized exon annotation for dasNov3_test from ensembl_test version 20200430
## summarizing exons per gene for imported GTF
# Try to retrieve some data
dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test",
level="gene",type="gene",db=myDb)
dasGenes
## GRanges object with 33374 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSDNOG00000040597 AAGV03000462.1 666-1525 + | ENSDNOG00000040597
## ENSDNOG00000040886 AAGV03001531.1 3-1269 - | ENSDNOG00000040886
## ENSDNOG00000049893 AAGV03002204.1 21-1190 - | ENSDNOG00000049893
## ENSDNOG00000042161 AAGV03003272.1 56-996 - | ENSDNOG00000042161
## ENSDNOG00000037889 AAGV03003599.1 30-296 + | ENSDNOG00000037889
## ... ... ... ... . ...
## ENSDNOG00000040274 NC_001821.1 13570-14097 - | ENSDNOG00000040274
## ENSDNOG00000033793 NC_001821.1 14100-14168 - | ENSDNOG00000033793
## ENSDNOG00000048587 NC_001821.1 14171-15310 + | ENSDNOG00000048587
## ENSDNOG00000037257 NC_001821.1 15312-15379 + | ENSDNOG00000037257
## ENSDNOG00000041926 NC_001821.1 15382-15452 - | ENSDNOG00000041926
## gc_content gene_name biotype
## <numeric> <character> <character>
## ENSDNOG00000040597 50 ENSDNOG00000040597 protein_coding
## ENSDNOG00000040886 50 ENSDNOG00000040886 protein_coding
## ENSDNOG00000049893 50 ENSDNOG00000049893 lincRNA
## ENSDNOG00000042161 50 ENSDNOG00000042161 protein_coding
## ENSDNOG00000037889 50 ENSDNOG00000037889 pseudogene
## ... ... ... ...
## ENSDNOG00000040274 50 ND6 protein_coding
## ENSDNOG00000033793 50 ENSDNOG00000033793 Mt_tRNA
## ENSDNOG00000048587 50 CYTB protein_coding
## ENSDNOG00000037257 50 ENSDNOG00000037257 Mt_tRNA
## ENSDNOG00000041926 50 ENSDNOG00000041926 Mt_tRNA
## -------
## seqinfo: 4823 sequences from dasNov3_test genome
A quite complete build (with latest versions of Ensembl annotations) would look like (supposing the default annotation database location):
organisms <- list(
hg18=67,
hg19=75,
hg38=97:98,
mm9=67,
mm10=97:98,
rn5=79,
rn6=97:98,
dm3=78,
dm6=97:98,
danrer7=79,
danrer10=91,
danrer11=97:98,
pantro4=90,
pantro5=97:98,
susscr3=89,
susscr11=97:98,
equcab2=97:98
)
sources <- c("ensembl","ucsc","refseq")
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,rc=0.5)
The aforementioned complete built can be found here Complete builts will become available from time to time (e.g. with every new Ensembl relrase) for users who do not wish to create annotation databases on their own. Root access may be required (depending on the metaseqR2 library location) to place it in the default location where it can be found automatically.
If for some reason you do not want to build and use an annotation database for
metaseqR2 analyses (not recommended) or you wish to perform an analysis with an
organism that does not yet exist in the database, the loadAnnotation
function
will perform all required actions (download and create a GRanges
object)
on-the-fly as long as there is an internet connection.
However, the above function does not handle custom annotations in GTF files.
In a scenario where you want to use a custom annotation only once, you should
supply the annotation
argument to the metaseqr2
function, which is almost
the same as the metadata
argument used in buildCustomAnnotation
, actually
augmented by a list member for the GTF
file, that is:
annotation <- list(
gtf="PATH_TO_GTF",
organism="ORGANISM_NAME",
source="SOURCE_NAME",
chromInfo="CHROM_INFO"
)
The above argument can be passed to the metaseqr2 call in the respective position.
For further details about custom annotations on the fly, please check
buildCustomAnnotation
and importCustomAnnotation
functions.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.56.0
## [3] rtracklayer_1.48.0 Biostrings_2.56.0
## [5] XVector_0.28.0 metaseqR2_1.0.11
## [7] locfit_1.5-9.4 limma_3.44.1
## [9] DESeq2_1.28.0 SummarizedExperiment_1.18.0
## [11] DelayedArray_0.14.0 matrixStats_0.56.0
## [13] Biobase_2.48.0 GenomicRanges_1.40.0
## [15] GenomeInfoDb_1.24.0 IRanges_2.22.1
## [17] S4Vectors_0.26.0 BiocGenerics_0.34.0
## [19] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] R.utils_2.9.2 tidyselect_1.0.0
## [3] heatmaply_1.1.0 RSQLite_2.2.0
## [5] AnnotationDbi_1.50.0 htmlwidgets_1.5.1
## [7] grid_4.0.0 TSP_1.1-10
## [9] BiocParallel_1.22.0 baySeq_2.22.0
## [11] DESeq_1.40.0 munsell_0.5.0
## [13] codetools_0.2-16 preprocessCore_1.50.0
## [15] DT_0.13 colorspace_1.4-1
## [17] knitr_1.28 GenomeInfoDbData_1.2.3
## [19] hwriter_1.3.2 bit64_0.9-7
## [21] rhdf5_2.32.0 vctrs_0.2.4
## [23] lambda.r_1.2.4 xfun_0.13
## [25] BiocFileCache_1.12.0 EDASeq_2.22.0
## [27] gclus_1.3.2 R6_2.4.1
## [29] rmdformats_0.3.7 seriation_1.2-8
## [31] NBPSeq_0.3.0 bitops_1.0-6
## [33] assertthat_0.2.1 scales_1.1.0
## [35] bsseq_1.24.0 gtable_0.3.0
## [37] affy_1.66.0 log4r_0.3.2
## [39] rlang_0.4.5 genefilter_1.70.0
## [41] lazyeval_0.2.2 DSS_2.36.0
## [43] BiocManager_1.30.10 yaml_2.2.1
## [45] reshape2_1.4.4 abind_1.4-5
## [47] GenomicFeatures_1.40.0 qvalue_2.20.0
## [49] RMySQL_0.10.20 tools_4.0.0
## [51] lava_1.6.7 bookdown_0.18
## [53] ggplot2_3.3.0 affyio_1.58.0
## [55] ellipsis_0.3.0 gplots_3.0.3
## [57] RColorBrewer_1.1-2 Rcpp_1.0.4.6
## [59] plyr_1.8.6 progress_1.2.2
## [61] zlibbioc_1.34.0 purrr_0.3.4
## [63] RCurl_1.98-1.2 prettyunits_1.1.1
## [65] openssl_1.4.1 viridis_0.5.1
## [67] zoo_1.8-7 cluster_2.1.0
## [69] magrittr_1.5 data.table_1.12.8
## [71] futile.options_1.0.1 survcomp_1.38.0
## [73] aroma.light_3.18.0 hms_0.5.3
## [75] evaluate_0.14 xtable_1.8-4
## [77] XML_3.99-0.3 VennDiagram_1.6.20
## [79] jpeg_0.1-8.1 gridExtra_2.3
## [81] compiler_4.0.0 biomaRt_2.44.0
## [83] tibble_3.0.1 KernSmooth_2.23-17
## [85] crayon_1.3.4 R.oo_1.23.0
## [87] htmltools_0.4.0 tidyr_1.0.2
## [89] geneplotter_1.66.0 DBI_1.1.0
## [91] SuppDists_1.1-9.5 formatR_1.7
## [93] corrplot_0.84 dbplyr_1.4.3
## [95] MASS_7.3-51.6 rappdirs_0.3.1
## [97] ShortRead_1.46.0 Matrix_1.2-18
## [99] rmeta_3.0 permute_0.9-5
## [101] vsn_3.56.0 R.methodsS3_1.8.0
## [103] gdata_2.18.0 pkgconfig_2.0.3
## [105] GenomicAlignments_1.24.0 registry_0.5-1
## [107] plotly_4.9.2.1 foreach_1.5.0
## [109] annotate_1.66.0 webshot_0.5.2
## [111] prodlim_2019.11.13 stringr_1.4.0
## [113] digest_0.6.25 rmarkdown_2.1
## [115] dendextend_1.13.4 edgeR_3.30.0
## [117] DelayedMatrixStats_1.10.0 curl_4.3
## [119] Rsamtools_2.4.0 gtools_3.8.2
## [121] NOISeq_2.32.0 lifecycle_0.2.0
## [123] jsonlite_1.6.1 ABSSeq_1.42.0
## [125] Rhdf5lib_1.10.0 survivalROC_1.0.3
## [127] futile.logger_1.4.3 viridisLite_0.3.0
## [129] askpass_1.1 pillar_1.4.3
## [131] lattice_0.20-41 httr_1.4.1
## [133] survival_3.1-12 glue_1.4.0
## [135] png_0.1-7 iterators_1.0.12
## [137] pander_0.6.3 bit_1.1-15.2
## [139] stringi_1.4.6 HDF5Array_1.16.0
## [141] bootstrap_2019.6 blob_1.2.1
## [143] latticeExtra_0.6-29 caTools_1.18.0
## [145] memoise_1.1.0 dplyr_0.8.5