txcutr 1.6.0
Some RNA-sequencing library preparations generate sequencing reads from specific locations in transcripts. For example, 3’-end tag methods, which include methods like Lexogen’s Quant-Seq or 10X’s Chromium Single Cell 3’, generate reads from the 3’ ends of transcripts. For applications interested in quantifying alternative cleavage site usage, using a truncated form of transcriptome annotation can help simplify downstream analysis and, in some pipelines, provide for more accurate estimates. This package implements methods to simplify the generation of such truncated annotations.
txcutr
R
is an open-source statistical environment which can be easily modified to enhance its functionality via packages. txcutr is a R
package available via the Bioconductor repository for packages. R
can be installed on any operating system from CRAN after which you can install txcutr by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("txcutr")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
txcutr is based on many other packages and in particular on those that have implemented the infrastructure needed for dealing with transcriptome annotations. That is, packages like GenomicRanges and GenomicFeatures. While we anticipate most use cases for txcutr
to not require manipulating GRanges
or TxDb
objects directly, it would be beneficial to be familiar with the vignettes provided in these packages.
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R
and Bioconductor
have a steep learning curve so it is critical to learn where to ask for help. We would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the txcutr
tag and check any previous posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
txcutr
We hope that txcutr will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation("txcutr")
#> To cite package 'txcutr' in publications use:
#>
#> Fansler M (2023). _txcutr: Transcriptome CUTteR_.
#> doi:10.18129/B9.bioc.txcutr
#> <https://doi.org/10.18129/B9.bioc.txcutr>, R package version 1.6.0,
#> <https://bioconductor.org/packages/txcutr>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {txcutr: Transcriptome CUTteR},
#> author = {Mervin Fansler},
#> year = {2023},
#> note = {R package version 1.6.0},
#> url = {https://bioconductor.org/packages/txcutr},
#> doi = {10.18129/B9.bioc.txcutr},
#> }
txcutr
Many transcriptome annotations, especially for model organisms, are already directly available as TxDb
annotation packages. One can browse a list of available TxDb
objects on the Bioconductor website.
For demonstration purposes, we will work with the SGD genes for the yeast Saccharomyces cerevisiae, and restrict the processing to chrI.
library(GenomicFeatures)
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
## constrain to "chrI"
seqlevels(txdb) <- "chrI"
## view the lengths of first ten transcripts
transcriptLengths(txdb)[1:10,]
#> tx_id tx_name gene_id nexon tx_len
#> 1 1 YAL069W YAL069W 1 315
#> 2 2 YAL068W-A YAL069W 1 255
#> 3 3 YAL067W-A YAL067W-A 1 228
#> 4 4 YAL066W YAL066W 1 309
#> 5 5 YAL064W-B YAL064W-B 1 381
#> 6 6 YAL064W YAL064W 1 285
#> 7 7 YAL062W YAL062W 1 1374
#> 8 8 YAL061W YAL061W 1 1254
#> 9 9 YAL060W YAL060W 1 1149
#> 10 10 YAL059W YAL059W 1 639
As seen from the above output, the transcript lengths are variable, but in a 3’-end tag-based RNA-sequencing library, we expect reads to only come from a narrow region at the 3’ end. We can use the methods in the txcutr package to restrict the annotation to that region for each transcript.
Let’s use the truncateTxome
method to restrict transcripts to their last 500 nucleotides (nts).
library(txcutr)
txdb_w500 <- truncateTxome(txdb, maxTxLength=500)
#> 'select()' returned 1:1 mapping between keys and columns
#> Truncating transcripts...
#> Done.
#> Checking for duplicate transcripts...
#> Removed 0 duplicates.
#> Creating exon ranges...
#> Done.
#> Creating tx ranges...
#> Done.
#> Creating gene ranges...
#> Done.
transcriptLengths(txdb_w500)[1:10,]
#> tx_id tx_name gene_id nexon tx_len
#> 1 1 YAL069W YAL069W 1 315
#> 2 2 YAL068W-A YAL069W 1 255
#> 3 3 YAL067W-A YAL067W-A 1 228
#> 4 4 YAL066W YAL066W 1 309
#> 5 5 YAL064W-B YAL064W-B 1 381
#> 6 6 YAL064W YAL064W 1 285
#> 7 7 YAL062W YAL062W 1 500
#> 8 8 YAL061W YAL061W 1 500
#> 9 9 YAL060W YAL060W 1 500
#> 10 10 YAL059W YAL059W 1 500
Observe how the all transcripts that were previously longer than 500 nts are now exactly 500 nts. Also, any transcripts that were already short enough remain unchanged.
We can now export this new transcriptome as a GTF file that could be used in an alignment pipeline or genome viewer. Note that ending the file name with .gz will automatically signal to that the file should be exported with gzip compression.
gtf_file <- tempfile("sacCer3_chrI.sgd.txcutr_w500", fileext=".gtf.gz")
exportGTF(txdb_w500, file=gtf_file)
If we need to prepare an index for alignment or pseudo-alignment, this requires exporting the corresponding sequences of the truncated transcripts. To do this, will need to load a BSgenome
object for sacCer3.
library(BSgenome.Scerevisiae.UCSC.sacCer3)
sacCer3 <- BSgenome.Scerevisiae.UCSC.sacCer3
fasta_file <- tempfile("sacCer3_chrI.sgd.txcutr_w500", fileext=".fa.gz")
exportFASTA(txdb_w500, genome=sacCer3, file=fasta_file)
Another file that might be needed by some quantification tools is a merge table. That is, a file that represents a map from the input transcripts to either output transcripts or genes. In some pipelines, one may wish to merge any transcripts that have any overlap. In others, there might be some minimum amount of distance between 3’ ends below which it may be uninteresting to discern between. For this, txcutr
provides a generateMergeTable
method, together with a minDistance
argument.
To create a merge for any overlaps whatsoever, one would set the minDistance
to be identical to the maximum transcript length.
df_merge <- generateMergeTable(txdb_w500, minDistance=500)
head(df_merge, 10)
#> tx_in tx_out gene_out
#> 1 YAL001C YAL001C YAL001C
#> 2 YAL002W YAL002W YAL002W
#> 3 YAL003W YAL003W YAL003W
#> 4 YAL004W YAL004W YAL004W
#> 5 YAL005C YAL005C YAL005C
#> 6 YAL007C YAL007C YAL007C
#> 7 YAL008W YAL008W YAL008W
#> 8 YAL009W YAL009W YAL009W
#> 9 YAL010C YAL010C YAL010C
#> 10 YAL011W YAL011W YAL011W
In this particular case, these first transcripts each map to themselves. This should not be a surprise, since very few genes in the input annotation had alternative transcripts. However, we can filter to highlight any transcripts that would get merged.
df_merge[df_merge$tx_in != df_merge$tx_out,]
#> tx_in tx_out gene_out
#> 28 YAL026C YAL026C-A YAL026C
#> 84 YAL069W YAL068W-A YAL069W
#> 118 YAR073W YAR075W YAR073W
Algorithmically, txcutr
will give preference to more distal transcripts in the merge assignment. For example, looking at transcripts from gene YAL026C:
transcripts(txdb_w500, columns=c("tx_name", "gene_id"),
filter=list(gene_id=c("YAL026C")))
#> GRanges object with 2 ranges and 2 metadata columns:
#> seqnames ranges strand | tx_name gene_id
#> <Rle> <IRanges> <Rle> | <character> <CharacterList>
#> [1] chrI 95386-95823 - | YAL026C-A YAL026C
#> [2] chrI 95630-96129 - | YAL026C YAL026C
#> -------
#> seqinfo: 1 sequence from sacCer3 genome
we see that they are on the negative strand and YAL026C-A is more distal in this orientation. This is consistent with YAL026C being merged into YAL026C-A if we are merging all overlapping transcripts.
Note that one can also directly export the merge table using the exportMergeTable
function.
The GenomicFeatures package provides a number of means to create custom TxDb
objects that can then be used by txcutr
. Alternative workflows might include starting from a GTF/GFF3 transcriptome annotation and using the GenomicRanges::makeTxDbFromGFF
method, or using the GenomicRanges::makeTxDbFromBiomart
method to create the object from Biomart resources.
The truncation step can be computationally intensive, especially for species whose annotations include many alternative transcript isoforms, such as mouse or human. The truncateTxome
method makes use of BiocParallel::bplapply
when applicable, and will respect the BiocParallel::register()
setting, unless an explicit BPPARAM
argument is supplied.
We encourage use of a parallel parameter option, such as MulticoreParam
, but it should be noted that with mammalian transcriptomes this results in a maximum RAM usage between 3-4GB per core. Please ensure adequate memory per core is available.
This tool developed out of a pipeline for quantifying 3’ UTR isoform expression from scRNA-seq. In such an application, it can be helpful to pre-filter transcripts that may be included in an annotation, but do not have validated 3’ ends. Of particular note are the GENCODE annotations: we recommend pre-filtering GENCODE GTF/GFF files to remove any entries with the tag mRNA_end_NF.
The txcutr package (Fansler, 2023) was made possible thanks to:
This package was developed using biocthis.
Code for creating the vignette
## Create the vignette
library("rmarkdown")
system.time(render("intro.Rmd", "BiocStyle::html_document"))
## Extract the R code
library("knitr")
knit("intro.Rmd", tangle = TRUE)
Date the vignette was generated.
#> [1] "2023-04-25 19:09:32 EDT"
Wallclock time spent generating the vignette.
#> Time difference of 34.322 secs
R
session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 RC (2023-04-13 r84269)
#> os Ubuntu 22.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-04-25
#> pandoc 2.7.3 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> AnnotationDbi * 1.62.0 2023-04-25 [2] Bioconductor
#> backports 1.4.1 2021-12-13 [2] CRAN (R 4.3.0)
#> bibtex 0.5.1 2023-01-26 [2] CRAN (R 4.3.0)
#> Biobase * 2.60.0 2023-04-25 [2] Bioconductor
#> BiocFileCache 2.8.0 2023-04-25 [2] Bioconductor
#> BiocGenerics * 0.46.0 2023-04-25 [2] Bioconductor
#> BiocIO 1.10.0 2023-04-25 [2] Bioconductor
#> BiocManager 1.30.20 2023-02-24 [2] CRAN (R 4.3.0)
#> BiocParallel 1.34.0 2023-04-25 [2] Bioconductor
#> BiocStyle * 2.28.0 2023-04-25 [2] Bioconductor
#> biomaRt 2.56.0 2023-04-25 [2] Bioconductor
#> Biostrings 2.68.0 2023-04-25 [2] Bioconductor
#> bit 4.0.5 2022-11-15 [2] CRAN (R 4.3.0)
#> bit64 4.0.5 2020-08-30 [2] CRAN (R 4.3.0)
#> bitops 1.0-7 2021-04-24 [2] CRAN (R 4.3.0)
#> blob 1.2.4 2023-03-17 [2] CRAN (R 4.3.0)
#> bookdown 0.33 2023-03-06 [2] CRAN (R 4.3.0)
#> bslib 0.4.2 2022-12-16 [2] CRAN (R 4.3.0)
#> cachem 1.0.7 2023-02-24 [2] CRAN (R 4.3.0)
#> cli 3.6.1 2023-03-23 [2] CRAN (R 4.3.0)
#> codetools 0.2-19 2023-02-01 [3] CRAN (R 4.3.0)
#> crayon 1.5.2 2022-09-29 [2] CRAN (R 4.3.0)
#> curl 5.0.0 2023-01-12 [2] CRAN (R 4.3.0)
#> DBI 1.1.3 2022-06-18 [2] CRAN (R 4.3.0)
#> dbplyr 2.3.2 2023-03-21 [2] CRAN (R 4.3.0)
#> DelayedArray 0.26.0 2023-04-25 [2] Bioconductor
#> digest 0.6.31 2022-12-11 [2] CRAN (R 4.3.0)
#> dplyr 1.1.2 2023-04-20 [2] CRAN (R 4.3.0)
#> evaluate 0.20 2023-01-17 [2] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [2] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [2] CRAN (R 4.3.0)
#> filelock 1.0.2 2018-10-05 [2] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.3.0)
#> GenomeInfoDb * 1.36.0 2023-04-25 [2] Bioconductor
#> GenomeInfoDbData 1.2.10 2023-04-17 [2] Bioconductor
#> GenomicAlignments 1.36.0 2023-04-25 [2] Bioconductor
#> GenomicFeatures * 1.52.0 2023-04-25 [2] Bioconductor
#> GenomicRanges * 1.52.0 2023-04-25 [2] Bioconductor
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.3.0)
#> hms 1.1.3 2023-03-21 [2] CRAN (R 4.3.0)
#> htmltools 0.5.5 2023-03-23 [2] CRAN (R 4.3.0)
#> httr 1.4.5 2023-02-24 [2] CRAN (R 4.3.0)
#> IRanges * 2.34.0 2023-04-25 [2] Bioconductor
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.3.0)
#> jsonlite 1.8.4 2022-12-06 [2] CRAN (R 4.3.0)
#> KEGGREST 1.40.0 2023-04-25 [2] Bioconductor
#> knitr 1.42 2023-01-25 [2] CRAN (R 4.3.0)
#> lattice 0.21-8 2023-04-05 [3] CRAN (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.3.0)
#> lubridate 1.9.2 2023-02-10 [2] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.3.0)
#> Matrix 1.5-4 2023-04-04 [3] CRAN (R 4.3.0)
#> MatrixGenerics 1.12.0 2023-04-25 [2] Bioconductor
#> matrixStats 0.63.0 2022-11-18 [2] CRAN (R 4.3.0)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [2] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.3.0)
#> plyr 1.8.8 2022-11-11 [2] CRAN (R 4.3.0)
#> png 0.1-8 2022-11-29 [2] CRAN (R 4.3.0)
#> prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.3.0)
#> progress 1.2.2 2019-05-16 [2] CRAN (R 4.3.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.3.0)
#> rappdirs 0.3.3 2021-01-31 [2] CRAN (R 4.3.0)
#> Rcpp 1.0.10 2023-01-22 [2] CRAN (R 4.3.0)
#> RCurl 1.98-1.12 2023-03-27 [2] CRAN (R 4.3.0)
#> RefManageR * 1.4.0 2022-09-30 [2] CRAN (R 4.3.0)
#> restfulr 0.0.15 2022-06-16 [2] CRAN (R 4.3.0)
#> rjson 0.2.21 2022-01-09 [2] CRAN (R 4.3.0)
#> rlang 1.1.0 2023-03-14 [2] CRAN (R 4.3.0)
#> rmarkdown 2.21 2023-03-26 [2] CRAN (R 4.3.0)
#> Rsamtools 2.16.0 2023-04-25 [2] Bioconductor
#> RSQLite 2.3.1 2023-04-03 [2] CRAN (R 4.3.0)
#> rtracklayer 1.60.0 2023-04-25 [2] Bioconductor
#> S4Vectors * 0.38.0 2023-04-25 [2] Bioconductor
#> sass 0.4.5 2023-01-24 [2] CRAN (R 4.3.0)
#> sessioninfo * 1.2.2 2021-12-06 [2] CRAN (R 4.3.0)
#> stringi 1.7.12 2023-01-11 [2] CRAN (R 4.3.0)
#> stringr 1.5.0 2022-12-02 [2] CRAN (R 4.3.0)
#> SummarizedExperiment 1.30.0 2023-04-25 [2] Bioconductor
#> tibble 3.2.1 2023-03-20 [2] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [2] CRAN (R 4.3.0)
#> timechange 0.2.0 2023-01-11 [2] CRAN (R 4.3.0)
#> txcutr * 1.6.0 2023-04-25 [1] Bioconductor
#> TxDb.Scerevisiae.UCSC.sacCer3.sgdGene * 3.2.2 2023-04-17 [2] Bioconductor
#> utf8 1.2.3 2023-01-31 [2] CRAN (R 4.3.0)
#> vctrs 0.6.2 2023-04-19 [2] CRAN (R 4.3.0)
#> xfun 0.39 2023-04-20 [2] CRAN (R 4.3.0)
#> XML 3.99-0.14 2023-03-19 [2] CRAN (R 4.3.0)
#> xml2 1.3.3 2021-11-30 [2] CRAN (R 4.3.0)
#> XVector 0.40.0 2023-04-25 [2] Bioconductor
#> yaml 2.3.7 2023-01-23 [2] CRAN (R 4.3.0)
#> zlibbioc 1.46.0 2023-04-25 [2] Bioconductor
#>
#> [1] /tmp/RtmpR544Gd/Rinst2870126f3d23f7
#> [2] /home/biocbuild/bbs-3.17-bioc/R/site-library
#> [3] /home/biocbuild/bbs-3.17-bioc/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
This vignette was generated using BiocStyle (Oleś, 2023) with knitr (Xie, 2023) and rmarkdown (Allaire, Xie, Dervieux et al., 2023) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.21. 2023. URL: https://github.com/rstudio/rmarkdown.
[2] M. Fansler. txcutr: Transcriptome CUTteR. R package version 1.6.0. 2023. DOI: 10.18129/B9.bioc.txcutr. URL: https://bioconductor.org/packages/txcutr.
[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.28.0. 2023. DOI: 10.18129/B9.bioc.BiocStyle. URL: https://bioconductor.org/packages/BiocStyle.
[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2023. URL: https://www.R-project.org/.
[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[7] H. Wickham, W. Chang, R. Flight, et al. sessioninfo: R Session Information. R package version 1.2.2. 2021. URL: https://CRAN.R-project.org/package=sessioninfo.
[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.42. 2023. URL: https://yihui.org/knitr/.