1 Introduction

The Organism.dplyr creates an on disk sqlite database to hold data of an organism combined from an ‘org’ package (e.g., org.Hs.eg.db) and a genome coordinate functionality of the ‘TxDb’ package (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene). It aims to provide an integrated presentation of identifiers and genomic coordinates. And a src_organism object is created to point to the database.

The src_organism object is created as an extension of src_sql and src_sqlite from dplyr, which inherited all dplyr methods. It also implements the select() interface from AnnotationDbi and genomic coordinates extractors from GenomicFeatures.

2 Constructing a src_organism

2.1 Make sqlite datebase from ‘TxDb’ package

The src_organism() constructor creates an on disk sqlite database file with data from a given ‘TxDb’ package and corresponding ‘org’ package. When dbpath is given, file is created at the given path, otherwise temporary file is created.

library(Organism.dplyr)

Running src_organism() without a given path will save the sqlite file to a tempdir():

src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")

Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date.

path <- "path/to/my.sqlite"
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path)

supportedOrganisms() provides a list of organisms with corresponding ‘org’ and ‘TxDb’ packages being supported.

supportedOrganisms()

## # A tibble: 21 x 3
##    organism                OrgDb        TxDb                                 
##    <chr>                   <chr>        <chr>                                
##  1 Bos taurus              org.Bt.eg.db TxDb.Btaurus.UCSC.bosTau8.refGene    
##  2 Caenorhabditis elegans  org.Ce.eg.db TxDb.Celegans.UCSC.ce11.refGene      
##  3 Caenorhabditis elegans  org.Ce.eg.db TxDb.Celegans.UCSC.ce6.ensGene       
##  4 Canis familiaris        org.Cf.eg.db TxDb.Cfamiliaris.UCSC.canFam3.refGene
##  5 Drosophila melanogaster org.Dm.eg.db TxDb.Dmelanogaster.UCSC.dm3.ensGene  
##  6 Drosophila melanogaster org.Dm.eg.db TxDb.Dmelanogaster.UCSC.dm6.ensGene  
##  7 Danio rerio             org.Dr.eg.db TxDb.Drerio.UCSC.danRer10.refGene    
##  8 Gallus gallus           org.Gg.eg.db TxDb.Ggallus.UCSC.galGal4.refGene    
##  9 Homo sapiens            org.Hs.eg.db TxDb.Hsapiens.UCSC.hg18.knownGene    
## 10 Homo sapiens            org.Hs.eg.db TxDb.Hsapiens.UCSC.hg19.knownGene    
## # … with 11 more rows

2.2 Make sqlite datebase from organism name

Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent ‘TxDb’ package is used.

src <- src_ucsc("human", path)

2.3 Access existing sqlite file

An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with TxDb.Hsapiens.UCSC.hg38.knownGene, with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package

src <- src_organism(dbpath = hg38light())
src

## src:  sqlite 3.35.5 [/tmp/RtmpllfGUU/Rinstacede2f52c28c/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

3 The “dplyr” interface

All methods from package dplyr can be used for a src_organism object.

Look at all available tables.

src_tbls(src)

##  [1] "id_accession"  "id_transcript" "id"            "id_omim_pm"   
##  [5] "id_protein"    "id_go"         "id_go_all"     "ranges_gene"  
##  [9] "ranges_tx"     "ranges_exon"   "ranges_cds"

Look at data from one specific table.

tbl(src, "id")

## # Source:   table<id> [?? x 6]
## # Database: sqlite 3.35.5 []
##    entrez map      ensembl         symbol genename               alias   
##    <chr>  <chr>    <chr>           <chr>  <chr>                  <chr>   
##  1 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein A1B     
##  2 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein ABG     
##  3 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein GAB     
##  4 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein HYST2477
##  5 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein A1BG    
##  6 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  AAC2    
##  7 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  NAT-2   
##  8 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  PNAT    
##  9 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  NAT2    
## 10 100    20q13.12 ENSG00000196839 ADA    adenosine deaminase    ADA1    
## # … with more rows

Look at fields of one table.

colnames(tbl(src, "id"))

## [1] "entrez"   "map"      "ensembl"  "symbol"   "genename" "alias"

Below are some examples of querying tables using dplyr.

Gene symbol starting with “SNORD” (the notation SNORD% is from SQL, with % representing a wild-card match to any string)

tbl(src, "id") %>%
    filter(symbol %like% "SNORD%") %>%
    dplyr::select(entrez, map, ensembl, symbol) %>%
    distinct() %>% arrange(symbol) %>% collect()

## # A tibble: 8 x 4
##   entrez    map     ensembl         symbol    
##   <chr>     <chr>   <chr>           <chr>     
## 1 100033413 15q11.2 ENSG00000207063 SNORD116-1
## 2 100033414 15q11.2 ENSG00000207001 SNORD116-2
## 3 100033415 15q11.2 ENSG00000207014 SNORD116-3
## 4 100033416 15q11.2 ENSG00000275529 SNORD116-4
## 5 100033417 15q11.2 ENSG00000207191 SNORD116-5
## 6 100033418 15q11.2 ENSG00000207442 SNORD116-6
## 7 100033419 15q11.2 ENSG00000207133 SNORD116-7
## 8 100033420 15q11.2 ENSG00000207093 SNORD116-8

Gene ontology (GO) info for gene symbol “ADA”

inner_join(tbl(src, "id"), tbl(src, "id_go")) %>%
    filter(symbol == "ADA") %>%
    dplyr::select(entrez, ensembl, symbol, go, evidence, ontology)

## Joining, by = "entrez"

## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.35.5 []
##    entrez ensembl         symbol go         evidence ontology
##    <chr>  <chr>           <chr>  <chr>      <chr>    <chr>   
##  1 100    ENSG00000196839 ADA    GO:0001666 IDA      BP      
##  2 100    ENSG00000196839 ADA    GO:0001821 IEA      BP      
##  3 100    ENSG00000196839 ADA    GO:0001829 IEA      BP      
##  4 100    ENSG00000196839 ADA    GO:0001883 IEA      MF      
##  5 100    ENSG00000196839 ADA    GO:0001889 IEA      BP      
##  6 100    ENSG00000196839 ADA    GO:0001890 IEA      BP      
##  7 100    ENSG00000196839 ADA    GO:0002314 IEA      BP      
##  8 100    ENSG00000196839 ADA    GO:0002636 IEA      BP      
##  9 100    ENSG00000196839 ADA    GO:0002686 IEA      BP      
## 10 100    ENSG00000196839 ADA    GO:0002906 IEA      BP      
## # … with more rows

Gene transcript counts per gene symbol

txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>%
    dplyr::select(symbol, tx_id) %>%
    group_by(symbol) %>%
    summarize(count = n()) %>%
    dplyr::select(symbol, count) %>%
    arrange(desc(count)) %>%
    collect(n=Inf)

## Joining, by = "entrez"

txcount

## # A tibble: 18 x 2
##    symbol     count
##    <chr>      <int>
##  1 AKT3         288
##  2 NAALAD2      100
##  3 CDH2          77
##  4 A1BG          40
##  5 MED6          39
##  6 ADA           18
##  7 NR2E3         18
##  8 LINC02584      8
##  9 NAT2           8
## 10 SNORD116-2     4
## 11 SNORD116-3     4
## 12 SNORD116-5     4
## 13 SNORD116-1     3
## 14 POU5F1P5       2
## 15 SNORD116-4     2
## 16 SNORD116-8     2
## 17 DUXB           1
## 18 ZBTB11-AS1     1

library(ggplot2)
ggplot(txcount, aes(x = symbol, y = count)) + 
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Transcript count") +
    labs(x = "Symbol") +
    labs(y = "Count")

Gene coordinates of symbol “ADA” and “NAT2” as GRanges

inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
    filter(symbol %in% c("ADA", "NAT2")) %>%
    dplyr::select(gene_chrom, gene_start, gene_end, gene_strand,
                  symbol, map) %>%
    collect() %>% GenomicRanges::GRanges()

## Joining, by = "entrez"

## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames            ranges strand |      symbol         map
##          <Rle>         <IRanges>  <Rle> | <character> <character>
##   [1]     chr8 18391282-18401218      + |        NAT2        8p22
##   [2]     chr8 18391282-18401218      + |        NAT2        8p22
##   [3]     chr8 18391282-18401218      + |        NAT2        8p22
##   [4]     chr8 18391282-18401218      + |        NAT2        8p22
##   [5]    chr20 44619522-44652233      - |         ADA    20q13.12
##   [6]    chr20 44619522-44652233      - |         ADA    20q13.12
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

4 The “select” interface

Methods select(), keytypes(), keys(), columns() and mapIds from AnnotationDbi are implemented for src_organism objects.

Use keytypes() to discover which keytypes can be passed to keytype argument of methods select() or keys().

keytypes(src)

##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"      "cds_id"      
##  [6] "cds_name"     "cds_start"    "cds_strand"   "ensembl"      "ensemblprot" 
## [11] "ensembltrans" "entrez"       "enzyme"       "evidence"     "evidenceall" 
## [16] "exon_chrom"   "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"     "gene_start"  
## [26] "gene_strand"  "genename"     "go"           "goall"        "ipi"         
## [31] "map"          "omim"         "ontology"     "ontologyall"  "pfam"        
## [36] "pmid"         "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"     "tx_strand"   
## [46] "tx_type"      "uniprot"

Use columns() to discover which kinds of data can be returned for the src_organism object.

columns(src)

##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"      "cds_id"      
##  [6] "cds_name"     "cds_start"    "cds_strand"   "ensembl"      "ensemblprot" 
## [11] "ensembltrans" "entrez"       "enzyme"       "evidence"     "evidenceall" 
## [16] "exon_chrom"   "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"     "gene_start"  
## [26] "gene_strand"  "genename"     "go"           "goall"        "ipi"         
## [31] "map"          "omim"         "ontology"     "ontologyall"  "pfam"        
## [36] "pmid"         "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"     "tx_strand"   
## [46] "tx_type"      "uniprot"

keys() returns keys for the src_organism object. By default it returns the primary keys for the database, and returns the keys from that keytype when the keytype argument is used.

Keys of entrez

head(keys(src))

## [1] "1"         "10"        "100"       "1000"      "10000"     "100008586"

Keys of symbol

head(keys(src, "symbol"))

## [1] "A1BG"    "NAT2"    "ADA"     "CDH2"    "AKT3"    "GAGE12F"

select() retrieves the data as a tibble based on parameters for selected keys columns and keytype arguments. If requested columns that have multiple matches for the keys, select_tbl() will return a tibble with one row for each possible match, and select() will return a data frame.

keytype <- "symbol"
keys <- c("ADA", "NAT2")
columns <- c("entrez", "tx_id", "tx_name","exon_id")
select_tbl(src, keys, columns, keytype)

## Joining, by = "entrez"

## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.35.5 []
##    symbol entrez tx_id tx_name           exon_id
##    <chr>  <chr>  <int> <chr>               <int>
##  1 NAT2   10     94538 ENST00000286479.4  262229
##  2 NAT2   10     94538 ENST00000286479.4  262231
##  3 NAT2   10     94539 ENST00000520116.1  262230
##  4 NAT2   10     94539 ENST00000520116.1  262232
##  5 NAT2   10     94538 ENST00000286479.4  262229
##  6 NAT2   10     94538 ENST00000286479.4  262231
##  7 NAT2   10     94539 ENST00000520116.1  262230
##  8 NAT2   10     94539 ENST00000520116.1  262232
##  9 NAT2   10     94538 ENST00000286479.4  262229
## 10 NAT2   10     94538 ENST00000286479.4  262231
## # … with more rows

mapIds() gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector.

mapIds(src, keys, column = "tx_name", keytype)

## Joining, by = "entrez"

##                 ADA                NAT2 
## "ENST00000372874.9" "ENST00000286479.4"

mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList")

## Joining, by = "entrez"

## CharacterList of length 2
## [["ADA"]] ENST00000372874.9 ENST00000464097.5 ... ENST00000545776.5
## [["NAT2"]] ENST00000286479.4 ENST00000520116.1 ... ENST00000520116.1

5 Genomic Coordinates Extractor Interfaces

Eleven genomic coordinates extractor methods are available in this package: transcripts(), exons(), cds(), genes(), promoters(), transcriptsBy(), exonsBy(), cdsBy(), intronsByTranscript(), fiveUTRsByTranscript(), threeUTRsByTranscript(). Data can be returned in two versions, for instance tibble (transcripts_tbl()) and GRanges or GRangesList (transcripts()).

Filters can be applied to all extractor functions. The output can be resctricted by an AnnotationFilter, an AnnotationFilterList, or an expression that can be tranlated into an AnnotationFilterList. Valid filters can be retrieved by supportedFilters(src).

supportedFilters(src)

##                filter        field
## 1        AccnumFilter       accnum
## 2         AliasFilter        alias
## 3      CdsChromFilter    cds_chrom
## 44       CdsEndFilter      cds_end
## 42        CdsIdFilter       cds_id
## 4       CdsNameFilter     cds_name
## 43     CdsStartFilter    cds_start
## 5     CdsStrandFilter   cds_strand
## 6       EnsemblFilter      ensembl
## 7   EnsemblprotFilter  ensemblprot
## 8  EnsembltransFilter ensembltrans
## 9        EntrezFilter       entrez
## 10       EnzymeFilter       enzyme
## 11     EvidenceFilter     evidence
## 12  EvidenceallFilter  evidenceall
## 13    ExonChromFilter   exon_chrom
## 47      ExonEndFilter     exon_end
## 45       ExonIdFilter      exon_id
## 14     ExonNameFilter    exon_name
## 48     ExonRankFilter    exon_rank
## 46    ExonStartFilter   exon_start
## 15   ExonStrandFilter  exon_strand
## 17    FlybaseCgFilter   flybase_cg
## 16      FlybaseFilter      flybase
## 18  FlybaseProtFilter flybase_prot
## 54      GRangesFilter      granges
## 19    GeneChromFilter   gene_chrom
## 50      GeneEndFilter     gene_end
## 49    GeneStartFilter   gene_start
## 20   GeneStrandFilter  gene_strand
## 21     GenenameFilter     genename
## 22           GoFilter           go
## 23        GoallFilter        goall
## 24          IpiFilter          ipi
## 25          MapFilter          map
## 26          MgiFilter          mgi
## 27         OmimFilter         omim
## 28     OntologyFilter     ontology
## 29  OntologyallFilter  ontologyall
## 30         PfamFilter         pfam
## 31         PmidFilter         pmid
## 32      PrositeFilter      prosite
## 33       RefseqFilter       refseq
## 34       SymbolFilter       symbol
## 35      TxChromFilter     tx_chrom
## 53        TxEndFilter       tx_end
## 51         TxIdFilter        tx_id
## 36       TxNameFilter      tx_name
## 52      TxStartFilter     tx_start
## 37     TxStrandFilter    tx_strand
## 38       TxTypeFilter      tx_type
## 39      UniprotFilter      uniprot
## 40     WormbaseFilter     wormbase
## 41         ZfinFilter         zfin

All filters take two parameters: value and condition, condition could be one of “==”, “!=”, “startsWith”, “endsWith”, “>”, “<”, “>=” and “<=”, default condition is “==”.

EnsemblFilter("ENSG00000196839")

## class: EnsemblFilter 
## condition: == 
## value: ENSG00000196839

SymbolFilter("SNORD", "startsWith")

## class: SymbolFilter 
## condition: startsWith 
## value: SNORD

The following illustrates several ways of inputting filters to an extractor function.

smbl <- SymbolFilter("SNORD", "startsWith")
transcripts_tbl(src, filter=smbl)

## # A tibble: 9 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol    
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr>     
## 1 chr15    25051477 25051571 +         159137 ENST00000384335.1 SNORD116-1
## 2 chr15    25054210 25054304 +         159141 ENST00000384274.1 SNORD116-2
## 3 chr15    25056860 25056954 +         159142 ENST00000384287.1 SNORD116-3
## 4 chr15    25059538 25059633 +         159143 ENST00000384733.1 SNORD116-4
## 5 chr15    25062333 25062427 +         159145 ENST00000384462.1 SNORD116-5
## 6 chr15    25065026 25065121 +         159148 ENST00000384711.1 SNORD116-2
## 7 chr15    25067788 25067882 +         159156 ENST00000384404.1 SNORD116-5
## 8 chr15    25070432 25070526 +         159158 ENST00000384365.1 SNORD116-8
## 9 chr15    25073107 25073201 +         159159 ENST00000384000.1 SNORD116-3

filter <- AnnotationFilterList(smbl)
transcripts_tbl(src, filter=filter)

## # A tibble: 9 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol    
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr>     
## 1 chr15    25051477 25051571 +         159137 ENST00000384335.1 SNORD116-1
## 2 chr15    25054210 25054304 +         159141 ENST00000384274.1 SNORD116-2
## 3 chr15    25056860 25056954 +         159142 ENST00000384287.1 SNORD116-3
## 4 chr15    25059538 25059633 +         159143 ENST00000384733.1 SNORD116-4
## 5 chr15    25062333 25062427 +         159145 ENST00000384462.1 SNORD116-5
## 6 chr15    25065026 25065121 +         159148 ENST00000384711.1 SNORD116-2
## 7 chr15    25067788 25067882 +         159156 ENST00000384404.1 SNORD116-5
## 8 chr15    25070432 25070526 +         159158 ENST00000384365.1 SNORD116-8
## 9 chr15    25073107 25073201 +         159159 ENST00000384000.1 SNORD116-3

transcripts_tbl(src, filter=~smbl)

## # A tibble: 9 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol    
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr>     
## 1 chr15    25051477 25051571 +         159137 ENST00000384335.1 SNORD116-1
## 2 chr15    25054210 25054304 +         159141 ENST00000384274.1 SNORD116-2
## 3 chr15    25056860 25056954 +         159142 ENST00000384287.1 SNORD116-3
## 4 chr15    25059538 25059633 +         159143 ENST00000384733.1 SNORD116-4
## 5 chr15    25062333 25062427 +         159145 ENST00000384462.1 SNORD116-5
## 6 chr15    25065026 25065121 +         159148 ENST00000384711.1 SNORD116-2
## 7 chr15    25067788 25067882 +         159156 ENST00000384404.1 SNORD116-5
## 8 chr15    25070432 25070526 +         159158 ENST00000384365.1 SNORD116-8
## 9 chr15    25073107 25073201 +         159159 ENST00000384000.1 SNORD116-3

A GRangesFilter() can also be used as a filter for the methods with result displaying as GRanges or GRangesList.

gr <- GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121"))
transcripts(src, filter=~smbl & gr)

## GRanges object with 2 ranges and 3 metadata columns:
##       seqnames            ranges strand |     tx_id           tx_name
##          <Rle>         <IRanges>  <Rle> | <integer>       <character>
##   [1]    chr15 25062333-25062427      + |    159145 ENST00000384462.1
##   [2]    chr15 25065026-25065121      + |    159148 ENST00000384711.1
##            symbol
##       <character>
##   [1]  SNORD116-5
##   [2]  SNORD116-2
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

Filters in extractor functions support &, |, !(negation), and ()(grouping). Transcript coordinates of gene symbol equal to “ADA” and transcript start position less than 44619810.

transcripts_tbl(src, filter = AnnotationFilterList(
    SymbolFilter("ADA"),
    TxStartFilter(44619810,"<"),
    logicOp="&")
)

## # A tibble: 3 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr> 
## 1 chr20    44619522 44626491 -         215140 ENST00000464097.5 ADA   
## 2 chr20    44619522 44651699 -         215141 ENST00000372874.9 ADA   
## 3 chr20    44619579 44651681 -         215142 ENST00000536532.5 ADA

## Equivalent to
transcripts_tbl(src, filter = ~symbol == "ADA" & tx_start < 44619810)

## # A tibble: 3 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr> 
## 1 chr20    44619522 44626491 -         215140 ENST00000464097.5 ADA   
## 2 chr20    44619522 44651699 -         215141 ENST00000372874.9 ADA   
## 3 chr20    44619579 44651681 -         215142 ENST00000536532.5 ADA

Transcripts coordinates of gene symbol equal to “ADA” or transcript end position equal to 243843236.

txend <- TxEndFilter(243843236, '==')
transcripts_tbl(src, filter = ~symbol == "ADA" | txend)

## # A tibble: 10 x 7
##    tx_chrom  tx_start    tx_end tx_strand  tx_id tx_name           symbol
##    <chr>        <int>     <int> <chr>      <int> <chr>             <chr> 
##  1 chr1     243488233 243843236 -          20140 ENST00000336199.9 AKT3  
##  2 chr20     44619522  44626491 -         215140 ENST00000464097.5 ADA   
##  3 chr20     44619522  44651699 -         215141 ENST00000372874.9 ADA   
##  4 chr20     44619579  44651681 -         215142 ENST00000536532.5 ADA   
##  5 chr20     44619810  44651691 -         215143 ENST00000492931.5 ADA   
##  6 chr20     44619810  44651691 -         215144 ENST00000537820.1 ADA   
##  7 chr20     44619810  44651691 -         215145 ENST00000539235.5 ADA   
##  8 chr20     44626323  44651661 -         215146 ENST00000545776.5 ADA   
##  9 chr20     44626517  44652114 -         215147 ENST00000536076.1 ADA   
## 10 chr20     44636071  44652233 -         215148 ENST00000535573.1 ADA

Using negation to find transcript coordinates of gene symbol equal to “ADA” and transcript start positions NOT less than 44619810.

transcripts_tbl(src, filter = ~symbol == "ADA" & !tx_start < 44618910)

## # A tibble: 9 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr> 
## 1 chr20    44619522 44626491 -         215140 ENST00000464097.5 ADA   
## 2 chr20    44619522 44651699 -         215141 ENST00000372874.9 ADA   
## 3 chr20    44619579 44651681 -         215142 ENST00000536532.5 ADA   
## 4 chr20    44619810 44651691 -         215143 ENST00000492931.5 ADA   
## 5 chr20    44619810 44651691 -         215144 ENST00000537820.1 ADA   
## 6 chr20    44619810 44651691 -         215145 ENST00000539235.5 ADA   
## 7 chr20    44626323 44651661 -         215146 ENST00000545776.5 ADA   
## 8 chr20    44626517 44652114 -         215147 ENST00000536076.1 ADA   
## 9 chr20    44636071 44652233 -         215148 ENST00000535573.1 ADA

Using grouping to find transcript coordinates of a long filter statement.

transcripts_tbl(src,
    filter = ~(symbol == 'ADA' & !(tx_start >= 44619810 | tx_end < 44651742)) | 
              (smbl & !tx_end > 25056954)
)

## # A tibble: 3 x 7
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol    
##   <chr>       <int>    <int> <chr>      <int> <chr>             <chr>     
## 1 chr15    25051477 25051571 +         159137 ENST00000384335.1 SNORD116-1
## 2 chr15    25054210 25054304 +         159141 ENST00000384274.1 SNORD116-2
## 3 chr15    25056860 25056954 +         159142 ENST00000384287.1 SNORD116-3

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.3.3           GenomicRanges_1.44.0    GenomeInfoDb_1.28.0    
##  [4] IRanges_2.26.0          S4Vectors_0.30.0        BiocGenerics_0.38.0    
##  [7] Organism.dplyr_1.20.0   AnnotationFilter_1.16.0 dplyr_1.0.6            
## [10] BiocStyle_2.20.0       
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                matrixStats_0.58.0         
##  [3] bit64_4.0.5                 filelock_1.0.2             
##  [5] progress_1.2.2              httr_1.4.2                 
##  [7] tools_4.1.0                 bslib_0.2.5.1              
##  [9] utf8_1.2.1                  R6_2.5.0                   
## [11] DBI_1.1.1                   lazyeval_0.2.2             
## [13] colorspace_2.0-1            withr_2.4.2                
## [15] tidyselect_1.1.1            prettyunits_1.1.1          
## [17] bit_4.0.4                   curl_4.3.1                 
## [19] compiler_4.1.0              cli_2.5.0                  
## [21] Biobase_2.52.0              DelayedArray_0.18.0        
## [23] rtracklayer_1.52.0          labeling_0.4.2             
## [25] bookdown_0.22               sass_0.4.0                 
## [27] scales_1.1.1                rappdirs_0.3.3             
## [29] stringr_1.4.0               digest_0.6.27              
## [31] Rsamtools_2.8.0             rmarkdown_2.8              
## [33] XVector_0.32.0              pkgconfig_2.0.3            
## [35] htmltools_0.5.1.1           MatrixGenerics_1.4.0       
## [37] dbplyr_2.1.1                fastmap_1.1.0              
## [39] highr_0.9                   rlang_0.4.11               
## [41] rstudioapi_0.13             RSQLite_2.2.7              
## [43] jquerylib_0.1.4             BiocIO_1.2.0               
## [45] generics_0.1.0              farver_2.1.0               
## [47] jsonlite_1.7.2              BiocParallel_1.26.0        
## [49] RCurl_1.98-1.3              magrittr_2.0.1             
## [51] GenomeInfoDbData_1.2.6      Matrix_1.3-3               
## [53] Rcpp_1.0.6                  munsell_0.5.0              
## [55] fansi_0.4.2                 lifecycle_1.0.0            
## [57] stringi_1.6.2               yaml_2.2.1                 
## [59] SummarizedExperiment_1.22.0 zlibbioc_1.38.0            
## [61] BiocFileCache_2.0.0         grid_4.1.0                 
## [63] blob_1.2.1                  crayon_1.4.1               
## [65] lattice_0.20-44             Biostrings_2.60.0          
## [67] GenomicFeatures_1.44.0      hms_1.1.0                  
## [69] KEGGREST_1.32.0             magick_2.7.2               
## [71] knitr_1.33                  ps_1.6.0                   
## [73] pillar_1.6.1                rjson_0.2.20               
## [75] biomaRt_2.48.0              XML_3.99-0.6               
## [77] glue_1.4.2                  evaluate_0.14              
## [79] BiocManager_1.30.15         png_0.1-7                  
## [81] vctrs_0.3.8                 gtable_0.3.0               
## [83] purrr_0.3.4                 assertthat_0.2.1           
## [85] cachem_1.0.5                xfun_0.23                  
## [87] restfulr_0.0.13             tibble_3.1.2               
## [89] GenomicAlignments_1.28.0    AnnotationDbi_1.54.0       
## [91] memoise_2.0.0               ellipsis_0.3.2

Organism.dplyr

19 May 2021

Contents