--- title: "Bioc.gff: GFF3 File Format Support" author: "Michael Lawrence" date: "`r format(Sys.Date(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{Bioc.gff: GFF3 File Format Support} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: true package: Bioc.gff --- ```{r setup, include=FALSE} knitr::opts_chunk$set( cache = TRUE, comment = "#>" ) library(BiocStyle) ``` # Bioc.gff ## Introduction The `r Biocpkg("Bioc.gff")` package provides support for the General Feature Format (GFF) file format, which is widely used for representing genomic features and annotations. This package allows users to read, write, and manipulate GFF versions 1, 2 (GTF), and 3 in R, making it easier to work with genomic data. While the `r Biocpkg("rtracklayer")` package offers robust GFF support, it is a large package with many features beyond file import. `Bioc.gff` fills a specific niche in the Bioconductor ecosystem by providing a lightweight, focused solution with minimal dependencies. This modularity is beneficial for developers of other packages who need to parse GFF files without inheriting the extensive dependency footprint of `rtracklayer`. For the end user, it offers a simple and direct interface for GFF manipulation. Note that much of the code in the package is ported and adapted from the `r Biocpkg("rtracklayer")` Bioconductor package. The intention is that the GFF functionality residing in `r Biocpkg("rtracklayer")` will be removed from that package in favor of using `Bioc.gff`, thereby reducing its size and complexity. # Installation You can install the `Bioc.gff` package from Bioconductor using the following: ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Bioc.gff") ``` # Loading packages ```{r load} library(Bioc.gff) library(GenomicRanges) ``` # Usage The `Bioc.gff` package provides functions to read and write GFF files, as well as to manipulate GFF data structures. The following sections provide some examples of how to use the package. ## Reading GFF Files You can read GFF files using the `readGFF` function. This function supports GFF versions 1, 2, and 3. The following example demonstrates how to read a GFF file: ```{r find-gff} gff_file <- system.file("extdata", "genes.gff3", package = "Bioc.gff") ``` ### Using the `import` function The `import` function deduces that the input string is a GFF file type and calls the appropriate method for reading the GFF file. ```{r read-gff} import(gff_file) ``` Here the output object is a `GRanges` class, which contains genomic coordinates and associated metadata. The `GRanges` object indicates the ranges with the `seqnames`, `ranges`, and `strand`, columns. The associated metadata is stored in the `mcols` slot, which (in this example) includes attributes like `source`, `type`, `phase`, `ID`, `Name`, `geneName`, `Alias` and `genome`. #### Selectively importing ranges You can also selectively import ranges from the GFF file using the `which` argument. This allows you to filter the data based on specific criteria, such as chromosome or strand. For example, to import only the ranges on chromosome 1: ```{r read-gff-which} which <- GRanges("chr10:90000-93000") import(gff_file, which = which, genome = "hg19") ``` Note that you can indicate the genome build using the `genome` argument, which is useful for ensuring that the imported data is compatible with other genomic data you may be working with. ### Using the `readGFF` function As an alternative, you can use the `readGFF` function directly, which is more explicit about the GFF version: ```{r read-gff-version} readGFF(gff_file, version = "3") ``` This function returns a `DataFrame` object, which contains the same information as the `GRanges` object but in a tabular format. Note that one can use the `makeGRangesFromDataFrame` function to convert the `DataFrame` returned by `readGFF` into a `GRanges` object, which is often more convenient for downstream analysis: ```{r read-gff-granges} readGFF(gff_file, version = "3") |> makeGRangesFromDataFrame( keep.extra.columns = TRUE ) ``` ### Reading remote GFF files You can also read GFF files from remote URLs. The `import` function can handle URLs directly, allowing you to read GFF files hosted on remote servers. In this example, we read a GFF3 file from the miRBase database: ```{r read-gff-remote-direct,eval=FALSE} remote_gff_url <- "https://www.mirbase.org/download/hsa.gff3" import(remote_gff_url, version = "3") ``` To show the example output in the vignette, a more advanced approach is used below. This is done to avoid repeated downloads of the remote file when the vignette is re-built on the Bioconductor Build System (BBS). We use the `BiocFileCache` package to cache the file locally. ```{r read-gff-remote} library(BiocFileCache) bfc <- BiocFileCache::BiocFileCache() remote_gff_url <- "https://www.mirbase.org/download/hsa.gff3" bquery <- bfcquery(bfc, remote_gff_url, "rname", exact = TRUE) if (!nrow(bquery)) bfcadd(x = bfc, rname = remote_gff_url, rtype = "web", download = TRUE) gff_local <- bfcrpath( bfc, rnames = remote_gff_url, exact = TRUE, download = FALSE, rtype = "web" ) ``` Finally, the relevant function remains the same for reading a GFF file i.e., via `import()`: ```{r read-gff-remote2} import(gff_local, version = "3") ``` ## Conversion to GFF While `TxDb` objects are highly efficient for querying transcript annotations within R, it is often necessary to export this data to a standard, portable format for use with external tools or for sharing with collaborators. The GFF3 format is a widely accepted standard for this purpose. Converting a `TxDb` or a derivative object (like a `GRangesList` of exons) into a GFF3-compatible `GRanges` object allows for easy export. This is particularly useful for visualizing annotations in genome browsers like IGV or UCSC, or for input into downstream analysis pipelines that expect GFF3 files. ```{r as-gff} library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene exonsBy(txdb, by = "tx") |> asGFF() ``` ## GFF to TxDb You can convert a `GFF` object to a `TxDb` object using the `makeTxDbFromGFF` in the `r Biocpkg("txdbmaker")` package. This is useful for creating a transcript database from GFF annotations, which can then be used for various genomic analyses. ```{r make-txdb-from-gff} library(txdbmaker) txdb <- makeTxDbFromGFF( file = gff_local, format = "gff3", dataSource = "https://www.mirbase.org/download/hsa.gff3", organism = "Homo sapiens", taxonomyId = 9606 ) genome <- grepv("genome-build-id", readLines(gff_local)) |> strsplit("# genome-build-id:\\s+") |> unlist() |> tail(1L) genome(txdb) <- genome txdb ``` # SessionInfo ```{r session-info} sessionInfo() ```