Contents

1 Introduction

Originally, this package was written when the kallisto | bustools concept was still experimental, to test a new and fast way to generate the gene count matrix from fastq files for scRNA-seq. In the past year, kallisto | bustools has matured. Now there’s a wrapper kb-python that can download a prebuilt kallisto index for human and mice and call kallisto bus and bustools to get the gene count matrix. So largely, the old way of calling kallisto bus and bustools, and some functionalities of BUSpaRse, such as getting transcript to gene mapping, are obsolete.

So now the focus of BUSpaRse has shifted to finer control of the transcripts that go into the transcriptome and more options. Now all tr2g_* functions (except tr2g_ensembl) can filter transcripts for gene and transcript biotypes, only keep standard chromosomes (so no scaffolds and haplotypes), and extract the filtered transcripts from the transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files from Ensembl and RefSeq, TxDb, and EnsDb can all be used here.

library(BUSpaRse)
library(BSgenome.Hsapiens.UCSC.hg38)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:Matrix':
#> 
#>     expand, unname
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: rtracklayer
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(EnsDb.Hsapiens.v86)
#> Loading required package: ensembldb
#> Loading required package: AnnotationFilter
#> 
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#> 
#>     filter

2 Downloading a transcriptome

The transcriptome can be downloaded from a specified version of Ensembl and filtered for biotypes and standard chromosomes, not only for the vertebrate database (www.ensembl.org and its mirrors), but also other Ensembl sites for plants, fungi, protists, and metazoa. The gene_biotype_use = "cellranger" means that the same gene biotypes Cell Ranger uses for its reference package are used here. By default, only standard chromosomes are kept. The dl_transcriptome function not only downloads the transcriptome and filters it, it also output the tr2g.tsv file of all transcripts in the filtered transcriptome, without column names, so can be directly used for bustools.

Wonder which biotypes are available? The lists of all gene and transcript biotypes from Ensembl are now provided in this package, and can be queried by data("ensembl_gene_biotypes") and data("ensembl_tx_biotypes").

Resources for common invertebrate model organisms such as Drosophila melanogaster and C. elegans are actually available on the vertebrate site (www.ensembl.org).

# For Drosophila
dl_transcriptome("Drosophila melanogaster", out_path = "fly", 
                 gene_biotype_use = "cellranger", verbose = FALSE)
#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"

#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("fly")
#> [1] "Drosophila_melanogaster.BDGP6.32.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                       
#> [3] "tx_filtered.fa"

The first file is the original fasta file. The second is the tr2g file without column names. The third is the filtered fasta file.

For C. elegans, from an archived version of Ensembl

dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE,
                 gene_biotype_use = "cellranger", ensembl_version = 97)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("worm")
#> [1] "Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                      
#> [3] "tx_filtered.fa"

For Saccharomyces cerevisiae. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site.

dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast", 
                 type = "fungus", gene_biotype_use = "cellranger", 
                 verbose = FALSE)
#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"

#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("yeast")
#> [1] "Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                       
#> [3] "tx_filtered.fa"

3 Obtaining transcript to gene information

3.1 From Ensembl

The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.

# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99, 
                        other_attrs = "description", 
                        gene_biotype_use = "cellranger")
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g_mm)
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
#> Version is only available to vertebrates.
#> Querying biomart for transcript and gene IDs of Arabidopsis thaliana
#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"
#> File /tmp/RtmpnRv5pA/Rbuild2b5c5c7582b9f9/BUSpaRse/vignettes/tr2g.tsv already exists.
head(tr2g_at)

3.2 From FASTA file

We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.

By default, the tr2g.tsv file and filtered fasta file (if filtering for biotypes and chromosomes) are written to disk, but these can be turned off so only the tr2g data frame is returned into the R session.

# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE)
head(tr2g_fa)

3.3 From GTF and GFF3 files

If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file. Now tr2g_gtf and tr2g_gff3 can extract transcriptome from a genome that is either a BSgenome or a DNAStringSet.

# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38,
                    gene_biotype_use = "cellranger",
                    out_path = "gtf")
#> 635 sequences in the genome are absent from the annotation.
head(tr2g_tg)

A new GTF or GFF3 file after filtering biotypes and chromosomes is also written, and this can be turned off by setting save_filtered_gtf = FALSE or save_filtered_gff = FALSE. The transcriptome, with biotypes filtered and only standard chromosomes, is in transcriptome.fa. Use compress_fa = TRUE to gzip it.

list.files("gtf")
#> [1] "gtf_filtered.gtf" "tr2g.tsv"         "transcriptome.fa"

3.4 From TxDb

TxDb is a class for storing transcript annotations from the Bioconductor package GenomicFeatures. Unfortunately, TxDb.Hsapiens.UCSC.hg38.knownGene does not have biotype information or gene symbols.

tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE,
                     write_tr2g = FALSE)
#> 'select()' returned 1:1 mapping between keys and columns
head(tr2g_hs)

3.5 From EnsDb

EnsDb is a class for Ensembl gene annotations, from the Bioconductor package ensembldb. Ensembl annotations as EnsDb are available on AnnotationHub (since version 87), and some older versions are stand alone packages (e.g. EnsDb.Hsapiens.v86).

tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, 
                        write_tr2g = FALSE, gene_biotype_use = "cellranger",
                        use_gene_version = FALSE, use_transcript_version = FALSE)
head(tr2g_hs86)

3.6 Deprecation

There used to be sections about sort_tr2g and save_tr2g_bustools, but these functions have been superseded by the new version of tr2g functions and dl_transcriptome, which sort the transcriptome after extracting it so the tr2g and the transcriptome are in the same order. The new version of tr2g functions and dl_transcriptome also by default writes the tr2g.tsv without column names with the first column as transcript and the second as gene to disk.

sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] EnsDb.Hsapiens.v86_2.99.0               
#>  [2] ensembldb_2.20.0                        
#>  [3] AnnotationFilter_1.20.0                 
#>  [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.15.0
#>  [5] GenomicFeatures_1.48.0                  
#>  [6] AnnotationDbi_1.58.0                    
#>  [7] Biobase_2.56.0                          
#>  [8] BSgenome.Hsapiens.UCSC.hg38_1.4.4       
#>  [9] BSgenome_1.64.0                         
#> [10] rtracklayer_1.56.0                      
#> [11] Biostrings_2.64.0                       
#> [12] XVector_0.36.0                          
#> [13] GenomicRanges_1.48.0                    
#> [14] GenomeInfoDb_1.32.0                     
#> [15] IRanges_2.30.0                          
#> [16] S4Vectors_0.34.0                        
#> [17] BiocGenerics_0.42.0                     
#> [18] ggplot2_3.3.5                           
#> [19] zeallot_0.1.0                           
#> [20] Matrix_1.4-1                            
#> [21] BUSpaRse_1.10.0                         
#> [22] TENxBUSData_1.9.0                       
#> [23] BiocStyle_2.24.0                        
#> 
#> loaded via a namespace (and not attached):
#>  [1] colorspace_2.0-3              rjson_0.2.21                 
#>  [3] ellipsis_0.3.2                farver_2.1.0                 
#>  [5] bit64_4.0.5                   interactiveDisplayBase_1.34.0
#>  [7] fansi_1.0.3                   xml2_1.3.3                   
#>  [9] cachem_1.0.6                  knitr_1.38                   
#> [11] jsonlite_1.8.0                Rsamtools_2.12.0             
#> [13] dbplyr_2.1.1                  png_0.1-7                    
#> [15] shiny_1.7.1                   BiocManager_1.30.17          
#> [17] compiler_4.2.0                httr_1.4.2                   
#> [19] assertthat_0.2.1              fastmap_1.1.0                
#> [21] lazyeval_0.2.2                cli_3.3.0                    
#> [23] later_1.3.0                   htmltools_0.5.2              
#> [25] prettyunits_1.1.1             tools_4.2.0                  
#> [27] gtable_0.3.0                  glue_1.6.2                   
#> [29] GenomeInfoDbData_1.2.8        dplyr_1.0.8                  
#> [31] rappdirs_0.3.3                Rcpp_1.0.8.3                 
#> [33] jquerylib_0.1.4               vctrs_0.4.1                  
#> [35] ExperimentHub_2.4.0           xfun_0.30                    
#> [37] stringr_1.4.0                 plyranges_1.16.0             
#> [39] mime_0.12                     lifecycle_1.0.1              
#> [41] restfulr_0.0.13               XML_3.99-0.9                 
#> [43] AnnotationHub_3.4.0           zlibbioc_1.42.0              
#> [45] scales_1.2.0                  hms_1.1.1                    
#> [47] promises_1.2.0.1              MatrixGenerics_1.8.0         
#> [49] ProtGenerics_1.28.0           parallel_4.2.0               
#> [51] SummarizedExperiment_1.26.0   yaml_2.3.5                   
#> [53] curl_4.3.2                    memoise_2.0.1                
#> [55] sass_0.4.1                    biomaRt_2.52.0               
#> [57] stringi_1.7.6                 RSQLite_2.2.12               
#> [59] BiocVersion_3.15.2            highr_0.9                    
#> [61] BiocIO_1.6.0                  filelock_1.0.2               
#> [63] BiocParallel_1.30.0           rlang_1.0.2                  
#> [65] pkgconfig_2.0.3               matrixStats_0.62.0           
#> [67] bitops_1.0-7                  evaluate_0.15                
#> [69] lattice_0.20-45               purrr_0.3.4                  
#> [71] GenomicAlignments_1.32.0      bit_4.0.4                    
#> [73] tidyselect_1.1.2              magrittr_2.0.3               
#> [75] bookdown_0.26                 R6_2.5.1                     
#> [77] magick_2.7.3                  generics_0.1.2               
#> [79] DelayedArray_0.22.0           DBI_1.1.2                    
#> [81] pillar_1.7.0                  withr_2.5.0                  
#> [83] KEGGREST_1.36.0               RCurl_1.98-1.6               
#> [85] tibble_3.1.6                  crayon_1.5.1                 
#> [87] utf8_1.2.2                    BiocFileCache_2.4.0          
#> [89] rmarkdown_2.14                progress_1.2.2               
#> [91] grid_4.2.0                    blob_1.2.3                   
#> [93] digest_0.6.29                 xtable_1.8-4                 
#> [95] tidyr_1.2.0                   httpuv_1.6.5                 
#> [97] munsell_0.5.0                 bslib_0.3.1