---
title: "Bioc.gff: GFF3 File Format Support"
author: "Michael Lawrence"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
vignette: >
  %\VignetteIndexEntry{Bioc.gff: GFF3 File Format Support}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
package: Bioc.gff
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    cache = TRUE,
    comment = "#>"
)
library(BiocStyle)
```

# Bioc.gff

## Introduction

The `r Biocpkg("Bioc.gff")` package provides support for the General Feature
Format (GFF) file format, which is widely used for representing genomic features
and annotations. This package allows users to read, write, and manipulate GFF
versions 1, 2 (GTF), and 3 in R, making it easier to work with genomic data.

While the `r Biocpkg("rtracklayer")` package offers robust GFF support, it is a
large package with many features beyond file import. `Bioc.gff` fills a specific
niche in the Bioconductor ecosystem by providing a lightweight, focused solution
with minimal dependencies. This modularity is beneficial for developers of other
packages who need to parse GFF files without inheriting the extensive dependency
footprint of `rtracklayer`. For the end user, it offers a simple and direct
interface for GFF manipulation.

Note that much of the code in the package is ported and adapted from the
`r Biocpkg("rtracklayer")` Bioconductor package. The intention is that the GFF
functionality residing in `r Biocpkg("rtracklayer")` will be removed from 
that package in favor of using `Bioc.gff`, thereby reducing its size and
complexity.

# Installation

You can install the `Bioc.gff` package from Bioconductor using the following:

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Bioc.gff")
```

# Loading packages

```{r load}
library(Bioc.gff)
library(GenomicRanges)
```

# Usage

The `Bioc.gff` package provides functions to read and write GFF files, as well
as to manipulate GFF data structures. The following sections provide some
examples of how to use the package.

## Reading GFF Files

You can read GFF files using the `readGFF` function. This function supports
GFF versions 1, 2, and 3. The following example demonstrates how to read a GFF
file:

```{r find-gff}
gff_file <- system.file("extdata", "genes.gff3", package = "Bioc.gff")
```

### Using the `import` function

The `import` function deduces that the input string is a GFF file type and calls
the appropriate method for reading the GFF file.

```{r read-gff}
import(gff_file)
```

Here the output object is a `GRanges` class, which contains genomic coordinates
and associated metadata. The `GRanges` object indicates the ranges with the
`seqnames`, `ranges`, and `strand`,  columns. The associated metadata is
stored in the `mcols` slot, which (in this example) includes attributes like
`source`, `type`, `phase`, `ID`, `Name`, `geneName`, `Alias` and `genome`.

#### Selectively importing ranges

You can also selectively import ranges from the GFF file using the `which`
argument. This allows you to filter the data based on specific criteria, such as
chromosome or strand. For example, to import only the ranges on chromosome 1:

```{r read-gff-which}
which <- GRanges("chr10:90000-93000")
import(gff_file, which = which, genome = "hg19")
```

Note that you can indicate the genome build using the `genome` argument, which
is useful for ensuring that the imported data is compatible with other genomic
data you may be working with.

### Using the `readGFF` function

As an alternative, you can use the `readGFF` function directly, which is
more explicit about the GFF version:

```{r read-gff-version}
readGFF(gff_file, version = "3")
```

This function returns a `DataFrame` object, which contains the same information
as the `GRanges` object but in a tabular format. Note that one can use the
`makeGRangesFromDataFrame` function to convert the `DataFrame` returned by
`readGFF` into a `GRanges` object, which is often more convenient for downstream
analysis:

```{r read-gff-granges}
readGFF(gff_file, version = "3") |>
    makeGRangesFromDataFrame(
        keep.extra.columns = TRUE
    )
```

### Reading remote GFF files

You can also read GFF files from remote URLs. The `import` function can handle
URLs directly, allowing you to read GFF files hosted on remote servers. 
In this example, we read a GFF3 file from the miRBase database:

```{r read-gff-remote-direct,eval=FALSE}
remote_gff_url <- "https://www.mirbase.org/download/hsa.gff3"
import(remote_gff_url, version = "3")
```

To show the example output in the vignette, a more advanced approach is used
below. This is done to avoid repeated downloads of the remote file when the
vignette is re-built on the Bioconductor Build System (BBS). We use the
`BiocFileCache` package to cache the file locally.

```{r read-gff-remote}
library(BiocFileCache)
bfc <- BiocFileCache::BiocFileCache()
remote_gff_url <- "https://www.mirbase.org/download/hsa.gff3"
bquery <- bfcquery(bfc, remote_gff_url, "rname", exact = TRUE)
if (!nrow(bquery))
    bfcadd(x = bfc, rname = remote_gff_url, rtype = "web", download = TRUE)
gff_local <- bfcrpath(
    bfc, rnames = remote_gff_url, exact = TRUE, download = FALSE, rtype = "web"
)
```

Finally, the relevant function remains the same for reading a GFF file i.e.,
via `import()`:

```{r read-gff-remote2}
import(gff_local, version = "3")
```

## Conversion to GFF

While `TxDb` objects are highly efficient for querying transcript annotations
within R, it is often necessary to export this data to a standard, portable
format for use with external tools or for sharing with collaborators. The GFF3
format is a widely accepted standard for this purpose. Converting a `TxDb` or a
derivative object (like a `GRangesList` of exons) into a GFF3-compatible
`GRanges` object allows for easy export. This is particularly useful for
visualizing annotations in genome browsers like IGV or UCSC, or for input into
downstream analysis pipelines that expect GFF3 files.

```{r as-gff}
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exonsBy(txdb, by = "tx") |>
    asGFF()
```

## GFF to TxDb

You can convert a `GFF` object to a `TxDb` object using the `makeTxDbFromGFF`
in the `r Biocpkg("txdbmaker")` package. This is useful for creating a
transcript database from GFF annotations, which can then be used for various
genomic analyses.

```{r make-txdb-from-gff}
library(txdbmaker)
txdb <- makeTxDbFromGFF(
    file = gff_local,
    format = "gff3",
    dataSource = "https://www.mirbase.org/download/hsa.gff3",
    organism = "Homo sapiens",
    taxonomyId = 9606
)
genome <- grepv("genome-build-id", readLines(gff_local)) |>
    strsplit("# genome-build-id:\\s+") |>
    unlist() |>
    tail(1L)
genome(txdb) <- genome
txdb
```

# SessionInfo

```{r session-info}
sessionInfo()
```