The Organism.dplyr creates an on disk sqlite database to hold data of an organism combined from an ‘org’ package (e.g., org.Hs.eg.db) and a genome coordinate functionality of the ‘TxDb’ package (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene). It aims to provide an integrated presentation of identifiers and genomic coordinates. And a src_organism object is created to point to the database.
The src_organism object is created as an extension of src_sql and src_sqlite from dplyr, which inherited all dplyr methods. It also implements the select() interface from AnnotationDbi and genomic coordinates extractors from GenomicFeatures.
The src_organism() constructor creates an on disk sqlite database file with data from a given ‘TxDb’ package and corresponding ‘org’ package. When dbpath is given, file is created at the given path, otherwise temporary file is created.
library(Organism.dplyr)
Running src_organism() without a given path will save the sqlite file to a tempdir():
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")
Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date.
path <- "path/to/my.sqlite"
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path)
supportedOrganisms() provides a list of organisms with corresponding ‘org’ and ‘TxDb’ packages being supported.
supportedOrganisms()
## # A tibble: 21 x 3
##                   organism        OrgDb
##                      <chr>        <chr>
##  1              Bos taurus org.Bt.eg.db
##  2  Caenorhabditis elegans org.Ce.eg.db
##  3  Caenorhabditis elegans org.Ce.eg.db
##  4        Canis familiaris org.Cf.eg.db
##  5 Drosophila melanogaster org.Dm.eg.db
##  6 Drosophila melanogaster org.Dm.eg.db
##  7             Danio rerio org.Dr.eg.db
##  8           Gallus gallus org.Gg.eg.db
##  9            Homo sapiens org.Hs.eg.db
## 10            Homo sapiens org.Hs.eg.db
## # ... with 11 more rows, and 1 more variables: TxDb <chr>
Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent ‘TxDb’ package is used.
src <- src_ucsc("human", path)
An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with TxDb.Hsapiens.UCSC.hg38.knownGene, with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package
src <- src_organism(dbpath = hg38light())
src
## src:  sqlite 3.19.3 [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx
All methods from package dplyr can be used for a src_organism object.
Look at all available tables.
src_tbls(src)
##  [1] "id_accession"  "id_transcript" "id"            "id_omim_pm"   
##  [5] "id_protein"    "id_go"         "id_go_all"     "ranges_gene"  
##  [9] "ranges_tx"     "ranges_exon"   "ranges_cds"
Look at data from one specific table.
tbl(src, "id")
## # Source:   table<id> [?? x 6]
## # Database: sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
##    entrez      map         ensembl symbol               genename    alias
##     <chr>    <chr>           <chr>  <chr>                  <chr>    <chr>
##  1      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein      A1B
##  2      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein      ABG
##  3      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein      GAB
##  4      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein HYST2477
##  5      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein     A1BG
##  6     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2     AAC2
##  7     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2    NAT-2
##  8     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2     PNAT
##  9     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2     NAT2
## 10    100 20q13.12 ENSG00000196839    ADA    adenosine deaminase      ADA
## # ... with more rows
Look at fields of one table.
colnames(tbl(src, "id"))
## [1] "entrez"   "map"      "ensembl"  "symbol"   "genename" "alias"
Below are some examples of querying tables using dplyr.
SNORD% is from SQL, with % representing a wild-card match to any string)tbl(src, "id") %>%
    filter(symbol %like% "SNORD%") %>%
    dplyr::select(entrez, map, ensembl, symbol) %>%
    distinct() %>% arrange(symbol) %>% collect()
## # A tibble: 9 x 4
##      entrez     map         ensembl     symbol
##       <chr>   <chr>           <chr>      <chr>
## 1 100033413 15q11.2 ENSG00000207063 SNORD116-1
## 2 100033414 15q11.2 ENSG00000207001 SNORD116-2
## 3 100033415 15q11.2 ENSG00000207014 SNORD116-3
## 4 100033416 15q11.2 ENSG00000275529 SNORD116-4
## 5 100033417 15q11.2 ENSG00000207191 SNORD116-5
## 6 100033418 15q11.2 ENSG00000207442 SNORD116-6
## 7 100033419 15q11.2 ENSG00000207133 SNORD116-7
## 8 100033420 15q11.2 ENSG00000207093 SNORD116-8
## 9 100033421 15q11.2 ENSG00000206727 SNORD116-9
inner_join(tbl(src, "id"), tbl(src, "id_go")) %>%
    filter(symbol == "ADA") %>%
    dplyr::select(entrez, ensembl, symbol, go, evidence, ontology)
## Joining, by = "entrez"
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
##    entrez         ensembl symbol         go evidence ontology
##     <chr>           <chr>  <chr>      <chr>    <chr>    <chr>
##  1    100 ENSG00000196839    ADA GO:0001666      IDA       BP
##  2    100 ENSG00000196839    ADA GO:0001821      IEA       BP
##  3    100 ENSG00000196839    ADA GO:0001829      IEA       BP
##  4    100 ENSG00000196839    ADA GO:0001883      IEA       MF
##  5    100 ENSG00000196839    ADA GO:0001889      IEA       BP
##  6    100 ENSG00000196839    ADA GO:0001890      IEA       BP
##  7    100 ENSG00000196839    ADA GO:0002314      IEA       BP
##  8    100 ENSG00000196839    ADA GO:0002636      IEA       BP
##  9    100 ENSG00000196839    ADA GO:0002686      IEA       BP
## 10    100 ENSG00000196839    ADA GO:0002906      IEA       BP
## # ... with more rows
txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>%
    dplyr::select(symbol, tx_id) %>%
    group_by(symbol) %>%
    summarise(count = count(symbol)) %>%
    dplyr::select(symbol, count) %>%
    arrange(desc(count)) %>%
    collect(n=Inf)
## Joining, by = "entrez"
txcount
## # A tibble: 18 x 2
##        symbol count
##         <chr> <int>
##  1       AKT3   270
##  2   RNA5-8S5   128
##  3    NAALAD2    50
##  4       A1BG    40
##  5       MED6    39
##  6       CDH2    25
##  7      NR2E3    18
##  8    RNA28S5    18
##  9       NAT2     8
## 10   ANO1-AS2     6
## 11 SNORD116-2     4
## 12 SNORD116-3     4
## 13 SNORD116-5     4
## 14        ADA     3
## 15 SNORD116-1     3
## 16 SNORD116-4     2
## 17 SNORD116-8     2
## 18 ZBTB11-AS1     1
library(ggplot2)
ggplot(txcount, aes(x = symbol, y = count)) + 
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Transcript count") +
    labs(x = "Symbol") +
    labs(y = "Count")
inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
    filter(symbol %in% c("ADA", "NAT2")) %>%
    dplyr::select(gene_chrom, gene_start, gene_end, gene_strand,
                  symbol, map) %>%
    collect() %>% GenomicRanges::GRanges()
## Joining, by = "entrez"
## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames               ranges strand |      symbol         map
##          <Rle>            <IRanges>  <Rle> | <character> <character>
##   [1]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [2]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [3]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [4]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [5]    chr20 [44619522, 44651742]      - |         ADA    20q13.12
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
Methods select(), keytypes(), keys(), columns() and mapIds from AnnotationDbi are implemented for src_organism objects.
Use keytypes() to discover which keytypes can be passed to keytype argument of methods select() or keys().
keytypes(src)
##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"     
##  [5] "cds_id"       "cds_name"     "cds_start"    "cds_strand"  
##  [9] "ensembl"      "ensemblprot"  "ensembltrans" "entrez"      
## [13] "enzyme"       "evidence"     "evidenceall"  "exon_chrom"  
## [17] "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"    
## [25] "gene_start"   "gene_strand"  "genename"     "go"          
## [29] "goall"        "ipi"          "map"          "omim"        
## [33] "ontology"     "ontologyall"  "pfam"         "pmid"        
## [37] "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"    
## [45] "tx_strand"    "tx_type"      "unigene"      "uniprot"
Use columns() to discover which kinds of data can be returned for the src_organism object.
columns(src)
##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"     
##  [5] "cds_id"       "cds_name"     "cds_start"    "cds_strand"  
##  [9] "ensembl"      "ensemblprot"  "ensembltrans" "entrez"      
## [13] "enzyme"       "evidence"     "evidenceall"  "exon_chrom"  
## [17] "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"    
## [25] "gene_start"   "gene_strand"  "genename"     "go"          
## [29] "goall"        "ipi"          "map"          "omim"        
## [33] "ontology"     "ontologyall"  "pfam"         "pmid"        
## [37] "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"    
## [45] "tx_strand"    "tx_type"      "unigene"      "uniprot"
keys() returns keys for the src_organism object. By default it returns the primary keys for the database, and returns the keys from that keytype when the keytype argument is used.
Keys of entrez
head(keys(src))
## [1] "1"         "10"        "100"       "1000"      "10000"     "100008586"
Keys of symbol
head(keys(src, "symbol"))
## [1] "A1BG"     "ADA"      "AKT3"     "ANO1-AS2" "CDH2"     "DUXB"
select() retrieves the data as a tibble based on parameters for selected keys columns and keytype arguments. If requested columns that have multiple matches for the keys, select_tbl() will return a tibble with one row for each possible match, and select() will return a data frame.
keytype <- "symbol"
keys <- c("ADA", "NAT2")
columns <- c("entrez", "tx_id", "tx_name","exon_id")
select_tbl(src, keys, columns, keytype)
## Joining, by = "entrez"
## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
##    symbol entrez  tx_id    tx_name exon_id
##     <chr>  <chr>  <int>      <chr>   <int>
##  1    ADA    100 169786 uc061xfj.1  501379
##  2    ADA    100 169786 uc061xfj.1  501383
##  3    ADA    100 169786 uc061xfj.1  501385
##  4    ADA    100 169786 uc061xfj.1  501387
##  5    ADA    100 169786 uc061xfj.1  501389
##  6    ADA    100 169786 uc061xfj.1  501390
##  7    ADA    100 169786 uc061xfj.1  501392
##  8    ADA    100 169787 uc002xmj.4  501379
##  9    ADA    100 169787 uc002xmj.4  501382
## 10    ADA    100 169787 uc002xmj.4  501384
## # ... with more rows
mapIds() gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector.
mapIds(src, keys, column = "tx_name", keytype)
## Joining, by = "entrez"
##          ADA         NAT2 
## "uc002xmj.4" "uc003wyw.2"
mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList")
## Joining, by = "entrez"
## CharacterList of length 2
## [["ADA"]] uc002xmj.4 uc061xfj.1 uc061xfl.1
## [["NAT2"]] uc003wyw.2 uc064kqw.1
Eleven genomic coordinates extractor methods are available in this package: transcripts(), exons(), cds(), genes(), promoters(), transcriptsBy(), exonsBy(), cdsBy(), intronsByTranscript(), fiveUTRsByTranscript(), threeUTRsByTranscript(). Data can be returned in two versions, for instance tibble (transcripts_tbl()) and GRanges or GRangesList (transcripts()).
Filters can be applied to all extractor functions. A named list of vectors can be used to restrict the output, valid filters can be retrieved by supportedFilters().
supportedFilters()
##  [1] "AccnumFilter"       "AliasFilter"        "CdsChromFilter"    
##  [4] "CdsNameFilter"      "CdsStrandFilter"    "EnsemblFilter"     
##  [7] "EnsemblprotFilter"  "EnsembltransFilter" "EntrezFilter"      
## [10] "EnzymeFilter"       "EvidenceFilter"     "EvidenceallFilter" 
## [13] "ExonChromFilter"    "ExonNameFilter"     "ExonStrandFilter"  
## [16] "FlybaseFilter"      "FlybaseCgFilter"    "FlybaseProtFilter" 
## [19] "GeneChromFilter"    "GeneStrandFilter"   "GenenameFilter"    
## [22] "GoFilter"           "GoallFilter"        "IpiFilter"         
## [25] "MapFilter"          "MgiFilter"          "OmimFilter"        
## [28] "OntologyFilter"     "OntologyallFilter"  "PfamFilter"        
## [31] "PmidFilter"         "PrositeFilter"      "RefseqFilter"      
## [34] "SymbolFilter"       "TxChromFilter"      "TxNameFilter"      
## [37] "TxStrandFilter"     "TxTypeFilter"       "UnigeneFilter"     
## [40] "UniprotFilter"      "WormbaseFilter"     "ZfinFilter"        
## [43] "CdsIdFilter"        "CdsStartFilter"     "CdsEndFilter"      
## [46] "ExonIdFilter"       "ExonStartFilter"    "ExonEndFilter"     
## [49] "ExonRankFilter"     "GeneStartFilter"    "GeneEndFilter"     
## [52] "TxIdFilter"         "TxStartFilter"      "TxEndFilter"
All filters take two parameters: value and condition, condition could be one of “==”, “!=”, “startsWith”, “endsWith”, “>”, “<”, “>=” and “<=”, default condition is “==”.
EnsemblFilter("ENSG00000196839")
## class: EnsemblFilter 
## condition: == 
## value: ENSG00000196839
SymbolFilter("SNORD", "startsWith")
## class: SymbolFilter 
## condition: startsWith 
## value: SNORD
A GRangesFilter() can also be used as filter for the methods with result displaying as GRanges or GRangesList.
filters <- list(SymbolFilter("SNORD", "startsWith"))
transcripts_tbl(src, filter=filters)
## Joining, by = "entrez"
## # Source:     lazy query [?? x 7]
## # Database:   sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## # Ordered by: tx_id
##   tx_chrom tx_start   tx_end tx_strand  tx_id    tx_name     symbol
##      <chr>    <int>    <int>     <chr>  <int>      <chr>      <chr>
## 1    chr15 25051477 25051571         + 123215 uc001yxg.4 SNORD116-1
## 2    chr15 25054210 25054304         + 123216 uc001yxi.4 SNORD116-2
## 3    chr15 25056860 25056954         + 123217 uc001yxk.4 SNORD116-3
## 4    chr15 25059538 25059633         + 123218 uc001yxl.5 SNORD116-4
## 5    chr15 25062333 25062427         + 123219 uc001yxo.4 SNORD116-5
## 6    chr15 25065026 25065121         + 123221 uc001yxp.5 SNORD116-2
## 7    chr15 25067788 25067882         + 123222 uc059gus.1 SNORD116-5
## 8    chr15 25070432 25070526         + 123223 uc001yxr.4 SNORD116-8
## 9    chr15 25073107 25073201         + 123224 uc001yxs.4 SNORD116-3
filters <- list(
    SymbolFilter("SNORD", "startsWith"),
    GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121"))
)
transcripts(src, filter=filters)
## Joining, by = "entrez"
## GRanges object with 2 ranges and 3 metadata columns:
##       seqnames               ranges strand |     tx_id     tx_name
##          <Rle>            <IRanges>  <Rle> | <integer> <character>
##   [1]    chr15 [25062333, 25062427]      + |    123219  uc001yxo.4
##   [2]    chr15 [25065026, 25065121]      + |    123221  uc001yxp.5
##            symbol
##       <character>
##   [1]  SNORD116-5
##   [2]  SNORD116-2
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
Transcript coordinates of gene symbol equal to “ADA” and transcript start position between 87863438 and 87933487.
transcripts_tbl(src, filter = list(
    SymbolFilter("ADA"),
    TxStartFilter(44619810,"<")
))
## Joining, by = "entrez"
## # Source:     lazy query [?? x 7]
## # Database:   sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## # Ordered by: tx_id
##   tx_chrom tx_start   tx_end tx_strand  tx_id    tx_name symbol
##      <chr>    <int>    <int>     <chr>  <int>      <chr>  <chr>
## 1    chr20 44619522 44626491         - 169786 uc061xfj.1    ADA
## 2    chr20 44619522 44651742         - 169787 uc002xmj.4    ADA
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2         ggplot2_2.2.1        GenomicRanges_1.28.6
##  [4] GenomeInfoDb_1.12.3  IRanges_2.10.5       S4Vectors_0.14.7    
##  [7] BiocGenerics_0.22.1  Organism.dplyr_1.2.2 dplyr_0.7.4         
## [10] BiocStyle_2.4.1     
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.6.5 lattice_0.20-35           
##  [3] colorspace_1.3-2           htmltools_0.3.6           
##  [5] BiocFileCache_1.0.1        rtracklayer_1.36.6        
##  [7] yaml_2.1.14                GenomicFeatures_1.28.5    
##  [9] blob_1.1.0                 XML_3.98-1.9              
## [11] rlang_0.1.2                glue_1.1.1                
## [13] DBI_0.7                    BiocParallel_1.10.1       
## [15] rappdirs_0.3.1             bit64_0.9-7               
## [17] dbplyr_1.1.0               plyr_1.8.4                
## [19] matrixStats_0.52.2         GenomeInfoDbData_0.99.0   
## [21] bindr_0.1                  stringr_1.2.0             
## [23] zlibbioc_1.22.0            Biostrings_2.44.2         
## [25] munsell_0.4.3              gtable_0.2.0              
## [27] memoise_1.1.0              evaluate_0.10.1           
## [29] labeling_0.3               Biobase_2.36.2            
## [31] knitr_1.17                 biomaRt_2.32.1            
## [33] AnnotationDbi_1.38.2       Rcpp_0.12.13              
## [35] scales_0.5.0               backports_1.1.1           
## [37] DelayedArray_0.2.7         XVector_0.16.0            
## [39] bit_1.1-12                 Rsamtools_1.28.0          
## [41] digest_0.6.12              stringi_1.1.5             
## [43] grid_3.4.2                 rprojroot_1.2             
## [45] tools_3.4.2                bitops_1.0-6              
## [47] magrittr_1.5               lazyeval_0.2.0            
## [49] RCurl_1.95-4.8             tibble_1.3.4              
## [51] RSQLite_2.0                pkgconfig_2.0.1           
## [53] Matrix_1.2-11              assertthat_0.2.0          
## [55] rmarkdown_1.6              httr_1.3.1                
## [57] R6_2.2.2                   GenomicAlignments_1.12.2  
## [59] compiler_3.4.2