BgeeCall, a R package for automatic gene expression calls generation

BgeeCall is a collection of functions that uses Bgee expertise to create gene expression present/absent calls

If you find a bug or have any issues with BgeeCall please write a bug report in our GitHub issues manager available at (URL).

How present/absent calls are generated

In Bgee present/absent gene expression calls for RNA-seq are generated using a threshold specific of each RNA-Seq library, calculated using reads mapped to reference intergenic regions. This is unlike the more usual use of an arbitrary threshold below which a gene is not considered as expressed (e.g log2(TPM) = 1).

Bgee database

Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It notably integrates RNA-Seq libraries for 29 species.

Reference intergenic regions

Reference intergenic regions are defined in the Bgee RNA-Seq pipeline. Candidate intergenic regions are defined using gene annotation data. For each species, over all available libraries, reads are mapped to these intergenic regions with kallisto, as well as to genes. This “intergenic expression” is deconvoluted to distinguish reference intergenic from non annotated genes, which have higher expression. Reference intergenic regions are then defined as intergenic regions with low expression level over all RNA-Seq libraries, relative to genes. This step allows not to consider regions wrongly considered as intergenic because of potential gene annotation quality problem as intergenic. For more information please refer to the Bgee RNA-Seq pipeline.

Threshold of present/absent

BgeeCall pipeline allows to download reference intergenic regions resulting from the expertise of the Bgee team. Moreover BgeeCall allows to use these reference intergenic regions to automatically generate gene expression calls for your own RNA-Seq libraries as long as the species is integrated to Bgee The present/absent abundance threshold is calculated for each library using the formula :

\[ \frac {proportion\ of\ reference\ intergenic\ present}{proportion\ of\ protein\ coding\ present} = 0.05 \]

Installation

In R:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BgeeCall")

How to use the BgeeCall package

BgeeCall is highly tunable. Do not hesitate to have a look at the reference manual to have a precise descripton of all slots of the 4 main S4 classes (AbundanceMetadata, KallistoMetadata, BgeeMetadata and UserMetadata) or of all available functions. BgeeCall needs kallisto to run. If you do not have kallisto installed you will find more information how to install it here

Load the package

library(BgeeCall)

Quick start

With the BgeeCall package it is easy to generate present/absent gene expression calls. The most time comsuming task of this calls generation is the generation of the kallisto transcriptome index. As the time needed for this step depend on the size of the transcriptome, we choose, as an example, the smallest transcriptome file among all species available on Bgee (C. elegans). To generate these calls you will need :

a transcriptome
gene annotations
your RNA-Seq reads in fastq files

For this vignette we created a toy fastq file example based on the SRX099901 library using the ShortRead R package

library("ShortRead")
# keep 48.000 reads
sampler <- FastqSampler(file.path("absolute_path","/SRX099901/SRR350955.fastq.gz"), 48000)
set.seed(1); SRR350955 <- yield(sampler)
writeFastq(object = SRR350955, file =file.path( "absolute_path","SRX099901_subset", "SRR350955_subset.fastq.gz"), mode = "w", full = FALSE, compress = TRUE)

In this example we used the Bioconductor AnnotationHub to load transcriptome and gene annotations but you can load them from wherever you want.

ah <- AnnotationHub::AnnotationHub()
ah_resources <- AnnotationHub::query(ah, c("Ensembl", "Caenorhabditis elegans", "84"))
annotation_object <- ah_resources[["AH50789"]]
transcriptome_object <- rtracklayer::import.2bit(ah_resources[["AH50453"]])

Once you have access to transcriptome, gene annotations and your RNA-Seq library, an object of class UserMetadata has to be created.

# create an object of class UserMetadata and specify the species ID
user_BgeeCall <- new("UserMetadata", species_id = "6239")
# import annotation and transcriptome in the user_BgeeCall object
# it is possible to import them using an S4 object (GRanges, DNAStringSet) or a file (gtf, fasta)
user_BgeeCall <- setAnnotationFromObject(user_BgeeCall, annotation_object, "WBcel235_84")
user_BgeeCall <- setTranscriptomeFromObject(user_BgeeCall, transcriptome_object, "WBcel235")
# provide path to the directory of your RNA-Seq library
user_BgeeCall <- setRNASeqLibPath(user_BgeeCall, 
                                  system.file("extdata", 
                                              "SRX099901_subset", 
                                              package = "BgeeCall"))

And that’s it… You can run the generation of your present/absent gene expression calls

calls_output <- generate_calls_workflow(userMetadata = user_BgeeCall)

#> 
#> Querying Bgee to get intergenic release information...
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read_tsv
#> 1 
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read_tsv
#> 1 
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Generate present/absent expression calls.
#> 
#> TPM cutoff for which 95% of the expressed genes are coding found at TPM = 4.64724e-06

Each analyze generates 4 files and return path to each one of them.

calls_tsv_path : path to main tsv file with TPM, count, length, biotype, type, and presence/absence of expression summarized at gene level (or at the transcript level if it was requested)

head.DataTable(x = read.table(calls_output$calls_tsv_path, header = TRUE), n = 5)
#>               id abundance counts  length        biotype  type    call
#> 1 WBGene00000001   27.0993      2 1556.20 protein_coding genic present
#> 2 WBGene00000002    0.0000      0 1761.00 protein_coding genic  absent
#> 3 WBGene00000003    0.0000      0 1549.00 protein_coding genic  absent
#> 4 WBGene00000004   13.8730      1 1519.92 protein_coding genic present
#> 5 WBGene00000005    0.0000      0 1487.00 protein_coding genic  absent

cutoff_info_file_path : path to tsv summary of the analyze containing the proportion of gene, protein coding and intergenic defined as expressed. It also contains the library ID and the present/absent TPM threshold

read.table(calls_output$cutoff_info_file_path)
#>                              V1                V2
#> 1                     libraryId  SRX099901_subset
#> 2                     cutoffTPM       4.64724e-06
#> 3        proportionGenicPresent  25.3210509597495
#> 4            numberGenicPresent              5580
#> 5                   numberGenic             22037
#> 6       proportionCodingPresent  27.1433462121583
#> 7           numberPresentCoding              5550
#> 8                  numberCoding             20447
#> 9   proportionIntergenicPresent 0.490998363338789
#> 10      numberIntergenicPresent                21
#> 11             numberIntergenic              4277
#> 12 ratioIntergenicCodingPresent              0.05

abundance_tsv : path to tsv kallisto quant output file

head.DataTable(x = read.table(calls_output$abundance_tsv, header = TRUE), n = 5)
#>    target_id length eff_length est_counts     tpm
#> 1 Y110A7A.10   1787    1556.20          2 27.0993
#> 2    F27C8.1   1940    1761.00          0  0.0000
#> 3    F07C3.7   1728    1549.00          0  0.0000
#> 4    F52H2.2   1739    1519.92          1 13.8730
#> 5 T13A10.10a   1734    1555.00          0  0.0000
calls_output$TPM_distribution_path
#> [1] "/tmp/RtmpHIDSuI/Rinst4fc9444d5903/BgeeCall/extdata/intergenic_0.1/all_results/SRX099901_subset/gene_TPM_genic_intergenic+cutoff.pdf"
calls_output$abundance_tsv
#> [1] "/tmp/RtmpHIDSuI/Rinst4fc9444d5903/BgeeCall/extdata/intergenic_0.1/all_results/SRX099901_subset/abundance.tsv"

TPM_distribution_path : path to plot in pdf reprensting density distribution of TPM values for all sequences, protein coding sequences, and intergenic sequences. The grey line corresponds to TPM threshold used to generate present/absent calls.

Biobase::openPDF(calls_output$TPM_distribution_path)

Generate present/absent calls for more than one RNA-Seq library

The function run_from_object() is perfect to generate calls for one library. You will potentialy be also interested to run more than one call generation at the same time. It is possible to do that by using the run_from_file() or the run_from_dataframe() functions. With these functions you will be able to run calls generation for different:

RNA-Seq libraries
transcriptome
gene annotation
runs from the same RNA-Seq library
species as long as they are part of Bgee

A template of the file usable as input of the function run_from_file() is available at the root directory of the package with the name userMetadataTemplate.tsv. In this template each column correspond to one parameter used to generate gene expression calls. Each line will correspond to one expression calls generation analyze. It is not mandatory to add a value to the run_ids column except if you want to generate expression calls for a subset of the runs of one RNA-Seq library as described in Generate calls for a subset of RNA-Seq runs Once the file has been fill in expression calls can be generated with :

generate_calls_workflow(userFile = "path_to_your_file.tsv")

list species available on Bgee

BgeeCall allows to generate gene expression call for any RNA-Seq libraries as long as the species is present in Bgee. To see all species in the last version of Bgee run :

list_bgee_species()
#> 
#> 
#> Querying Bgee to get release information...
#> 
#> Building URL to query species in Bgee release 14...
#> 
#> Submitting URL to Bgee webservice... (https://r.bgee.org/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.10.0)
#> 
#> Query to Bgee webservice successful!
#>       ID           GENUS     SPECIES_NAME         COMMON_NAME AFFYMETRIX
#> 1   6239  Caenorhabditis          elegans            nematode       TRUE
#> 2   7217      Drosophila        ananassae                          FALSE
#> 3   7227      Drosophila     melanogaster           fruit fly       TRUE
#> 4   7230      Drosophila       mojavensis                          FALSE
#> 5   7237      Drosophila    pseudoobscura                          FALSE
#> 6   7240      Drosophila         simulans                          FALSE
#> 7   7244      Drosophila          virilis                          FALSE
#> 8   7245      Drosophila           yakuba                          FALSE
#> 9   7955           Danio            rerio           zebrafish       TRUE
#> 10  8364         Xenopus       tropicalis western clawed frog      FALSE
#> 11  9031          Gallus           gallus             chicken      FALSE
#> 12  9258 Ornithorhynchus         anatinus            platypus      FALSE
#> 13  9365       Erinaceus        europaeus            hedgehog      FALSE
#> 14  9544          Macaca          mulatta             macaque       TRUE
#> 15  9593         Gorilla          gorilla             gorilla      FALSE
#> 16  9597             Pan         paniscus              bonobo      FALSE
#> 17  9598             Pan      troglodytes          chimpanzee      FALSE
#> 18  9606            Homo          sapiens               human       TRUE
#> 19  9615           Canis lupus familiaris                 dog      FALSE
#> 20  9685           Felis            catus                 cat      FALSE
#> 21  9796           Equus         caballus               horse      FALSE
#> 22  9823             Sus           scrofa                 pig      FALSE
#> 23  9913             Bos           taurus              cattle      FALSE
#> 24  9986     Oryctolagus        cuniculus              rabbit      FALSE
#> 25 10090             Mus         musculus               mouse       TRUE
#> 26 10116          Rattus       norvegicus                 rat       TRUE
#> 27 10141           Cavia        porcellus          guinea pig      FALSE
#> 28 13616     Monodelphis        domestica             opossum      FALSE
#> 29 28377          Anolis     carolinensis         green anole      FALSE
#>      EST IN_SITU RNA_SEQ
#> 1  FALSE    TRUE    TRUE
#> 2  FALSE   FALSE    TRUE
#> 3   TRUE    TRUE    TRUE
#> 4  FALSE   FALSE    TRUE
#> 5  FALSE   FALSE    TRUE
#> 6  FALSE   FALSE    TRUE
#> 7  FALSE   FALSE    TRUE
#> 8  FALSE   FALSE    TRUE
#> 9   TRUE    TRUE    TRUE
#> 10  TRUE    TRUE    TRUE
#> 11 FALSE   FALSE    TRUE
#> 12 FALSE   FALSE    TRUE
#> 13 FALSE   FALSE    TRUE
#> 14 FALSE   FALSE    TRUE
#> 15 FALSE   FALSE    TRUE
#> 16 FALSE   FALSE    TRUE
#> 17 FALSE   FALSE    TRUE
#> 18  TRUE   FALSE    TRUE
#> 19 FALSE   FALSE    TRUE
#> 20 FALSE   FALSE    TRUE
#> 21 FALSE   FALSE    TRUE
#> 22 FALSE   FALSE    TRUE
#> 23 FALSE   FALSE    TRUE
#> 24 FALSE   FALSE    TRUE
#> 25  TRUE    TRUE    TRUE
#> 26 FALSE   FALSE    TRUE
#> 27 FALSE   FALSE    TRUE
#> 28 FALSE   FALSE    TRUE
#> 29 FALSE   FALSE    TRUE

list reference intergenic releases

Different releases of Bgee reference intergenic sequences are available. It is possible to list all these releases :

list_intergenic_release()
#> Downloading release information of Bgee intergenic regions...
#>   release releaseDate                             FTPURL
#> 1     0.1  2018-12-21 ftp://ftp.bgee.org/intergenic/0.1/
#> 2     0.2  2019-02-07 ftp://ftp.bgee.org/intergenic/0.2/
#>                                                    referenceIntergenicFastaURL
#> 1 ftp://ftp.bgee.org/intergenic/0.1/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 2 ftp://ftp.bgee.org/intergenic/0.2/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#>   minimumVersionBgeeCall
#> 1                  0.9.9
#> 2                  0.9.9
#>                                                                                                                             description
#> 1                                                                                          intergenic regions used to generate Bgee 14.
#> 2  cleaned intergenic sequences based on release 0.1 (remove blocks of Ns longer than 100 and sequences containing more than 5% of Ns).
#>                                                    messageToUsers
#> 1                                                                
#> 2 be careful, this intergenic release has not been tested by Bgee

It is then possible to choose one specific release to create a BgeeMetadata object.

bgee <- new("BgeeMetadata", intergenic_release = "0.1")
#> 
#> Querying Bgee to get intergenic release information...

By default the intergenic used when a BgeeMetadataobject is created is the last created one.

Generate present/absent calls at transcript level (beta version)

kallisto generates TPMs at the transcript level. In the Bgee pipeline we summarize this expression at the gene level to calculate our present/absent calls. In BgeeCall it is now possible to generate present/absent calls at the transcript level. Be careful when using this feature as it has not been tested for the moment. To generate such calls you only have to create one object of the class KallistoMetadata and edit the value of one attribute

kallisto <- new("KallistoMetadata", txOut = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

Tune how to use kallisto

Download or reuse your own kallisto

By default BgeeCall suppose that kallisto is installed. If kallisto is not installed on your computer you can either :

let BgeeCall automatically download the version 0.45 of kallisto. BgeeCall will use it to quantify abundance of transcripts. It will only be used by this package and will have no impact on your potential already existing version of kallisto.

kallisto <- new("KallistoMetadata", install_kallisto = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

You can download and install kallisto by following the instructions on the official website : http://pachterlab.github.io/kallisto/download

Edit kallisto quant attributes

By default kallisto is run with the same parameters that we use in the RNA-Seq Bgee pipeline:

single end : “-t 1 –single -l 180 -s 30 –bias”
paired end : “-t 1 –bias”

It is possible to modify them and use your favourite kallisto parameters

kallisto <- new("KallistoMetadata", single_end_parameters = "-t 3 --single -l 150 -s 30", pair_end_parameters = "-t 2 -b --seed 36")
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

Choose between two kmer size

By default 2 indexes with 2 different kmer sizes can be used by BgeeCall The default kmer size of kallisto (31) is used for libraries with reads length equal or larger than 50 nt. A kmer size of 21 is used for libraries with reads length smaller than 50 nt. We decided not to allow to tune kmers size because the generation of the index is time consuming and index generation takes even more time with small kmers size (< 21 nt). However it is possible to modify the threshold of read length allowing to choose between default and small kmer size.

# libraries with reads smaller than 70nt will use the index with kmer size = 21
kallisto <- new("KallistoMetadata", read_size_kmer_threshold = 70)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

Generate calls for a subset of RNA-Seq runs

By default gene expression calls are generated using all runs of the RNA-Seq library. It is possible to select only a subset of these runs.

# RNA-Seq run SRR350955_subsetof from the RNA-Seq library will be used to generate the calls
user_BgeeCall <- setRunIds(user_BgeeCall, c("SRR350955_subset"))
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)

When run IDs are selected, the name output directory combine the library ID and all selected run IDs. In our example the expression calls will be stored in the directory SRX099901_SRR350955_subset.

Modify present/absent threshold

By default the threshold of present/absent is calculated with the formula :

proportion of ref intergenic present / proportion of protein coding present = 0.05

This 0.05 corresponds to the ratio used in the Bgee pipeline. However it is possible to edit this value. Be careful when editing this value as it has a big impact on your present absent.

kallisto <- new("KallistoMetadata", cutoff = 0.1)

Generate calls with a simple arborescence of directories

By default the arborescence of directories created by BgeeCall is complex. This complexity allows to generate gene expression calls for the same RNA-Seq library using different transcriptomes or gene annotations. The UserMetadata class has an attribute allowing to simplify this arborescence and store the result of all libraries in the same directory.

user_BgeeCall <- setRunIds(user_BgeeCall, "")
user_BgeeCall <- setSimpleArborescence(user_BgeeCall, TRUE)
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)

Be careful when you use this option. If you run different analysis for the same RNA-Seq library the results will be overwritten.

#Session Info

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] rtracklayer_1.44.4   GenomicRanges_1.36.1 GenomeInfoDb_1.20.0 
#> [4] IRanges_2.18.3       S4Vectors_0.22.1     BiocGenerics_0.30.0 
#> [7] BgeeCall_1.0.1      
#> 
#> loaded via a namespace (and not attached):
#>  [1] Biobase_2.44.0                httr_1.4.1                   
#>  [3] tidyr_1.0.0                   bit64_0.9-7                  
#>  [5] AnnotationHub_2.16.1          topGO_2.36.0                 
#>  [7] shiny_1.4.0                   assertthat_0.2.1             
#>  [9] interactiveDisplayBase_1.22.0 BiocManager_1.30.7           
#> [11] BiocFileCache_1.8.0           blob_1.2.0                   
#> [13] BSgenome_1.52.0               GenomeInfoDbData_1.2.1       
#> [15] Rsamtools_2.0.3               yaml_2.2.0                   
#> [17] progress_1.2.2                pillar_1.4.2                 
#> [19] RSQLite_2.1.2                 backports_1.1.5              
#> [21] lattice_0.20-38               glue_1.3.1                   
#> [23] digest_0.6.21                 promises_1.1.0               
#> [25] XVector_0.24.0                htmltools_0.4.0              
#> [27] httpuv_1.5.2                  Matrix_1.2-17                
#> [29] XML_3.98-1.20                 pkgconfig_2.0.3              
#> [31] SparseM_1.77                  biomaRt_2.40.5               
#> [33] zlibbioc_1.30.0               GO.db_3.8.2                  
#> [35] purrr_0.3.2                   xtable_1.8-4                 
#> [37] later_1.0.0                   BiocParallel_1.18.1          
#> [39] tibble_2.1.3                  SummarizedExperiment_1.14.1  
#> [41] GenomicFeatures_1.36.4        magrittr_1.5                 
#> [43] crayon_1.3.4                  mime_0.7                     
#> [45] memoise_1.1.0                 evaluate_0.14                
#> [47] graph_1.62.0                  data.table_1.12.4            
#> [49] tools_3.6.1                   prettyunits_1.0.2            
#> [51] hms_0.5.1                     lifecycle_0.1.0              
#> [53] matrixStats_0.55.0            stringr_1.4.0                
#> [55] Rhdf5lib_1.6.2                DelayedArray_0.10.0          
#> [57] AnnotationDbi_1.46.1          Biostrings_2.52.0            
#> [59] compiler_3.6.1                rlang_0.4.0                  
#> [61] rhdf5_2.28.1                  grid_3.6.1                   
#> [63] RCurl_1.95-4.12               tximport_1.12.3              
#> [65] rappdirs_0.3.1                bitops_1.0-6                 
#> [67] rmarkdown_1.16                DBI_1.0.0                    
#> [69] curl_4.2                      R6_2.4.0                     
#> [71] GenomicAlignments_1.20.1      knitr_1.25                   
#> [73] dplyr_0.8.3                   fastmap_1.0.1                
#> [75] bit_1.1-14                    zeallot_0.1.0                
#> [77] readr_1.3.1                   stringi_1.4.3                
#> [79] Rcpp_1.0.2                    BgeeDB_2.10.0                
#> [81] vctrs_0.2.0                   dbplyr_1.4.2                 
#> [83] tidyselect_0.2.5              xfun_0.10