---
title: "Creating a Variant-Modified Reference"
author:
- name: Stevie Pederson
  affiliation: Black Ochre Data Labs, The Kids Research Institute Australia, Adelaide, Australia
  email: stevie.pederson@thekids.org.au
package: transmogR
bibliography: '`r system.file("references.bib", package = "transmogR")`'
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
abstract: |
  The use of personalised or population-level variants opens the door to the 
  possibility of creating a variant-modified reference genome, or transcriptome.
  The package transmogR allows the creation of both, with a focus on combining
  both in order to created a custom reference transcriptome, along with decoy 
  transcripts for use with the pseudo-aligner salmon.
vignette: |
  %\VignetteIndexEntry{Creating a Variant-Modified Reference}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, echo = FALSE}
knitr::opts_chunk$set(message = FALSE, crop = NULL)
```


# Introduction

The incorporation of personalised or population-level variants into a 
*reference genome* has been shown to have a significant impact on subsequent 
alignments [@Kaminow2022-dz].
Whilst implemented for splice-aware alignment or RNA-Seq data using 
*STARconsensus*, the package `transmogR` enables the creation of a 
variant-modified *reference transcriptome* for use with pseudo aligners such as 
*salmon* [@Srivastava2020-tm].
In addition, multiple visualisation and summarisation methods are included for a
cursory analysis of any custom variant sets being used.

Whilst the subsequent code is demonstrated on a small genomic region, the 
complete process for creating a modified a reference can run in under 20 minutes 
if using 4 or more cores.

Importantly, it is expected that the user will have carefully prepared their set of variants such that only the following exact variant types are included.

- **SNV**/**SNP**: The associated range must be of width 1 with a single base as the replacement allele
- **Insertions**: The position being replaced must be a single nucleotide with the replacement bases being > 1nt
- **Deletions**: The position being replaced must be > 1nt, whilst the replacement allele must be a single base

Any variants which do not conform to these criteria will likely cause a series of frustrating errors when attempting to use this package.

# Setup

## Installation

In order to perform the operations in this vignette, the following packages 
require installation.

```r 
if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
BiocManager::install("transmogR")
```

Once these packages are installed, we can load them easily

```{r load-packages}
library(VariantAnnotation)
library(rtracklayer)
library(extraChIPs)
library(transmogR)
library(GenomicFeatures)
library(BSgenome.Hsapiens.UCSC.hg38)
```

## Required Data

In order to create a modified reference, three primary data objects are 
required: 1) a reference genome; 2) a set of genomic variants; and 3) a set 
of exon-level co-ordinates defining transcript structure.

For this vignette, we'll use GRCh38 as our primary reference genome, but 
restricting the sequences to *chr1* only.
The package can take either a `DNAStringSet` or `BSgenome` object as the 
reference genome.

```{r chr1}
chr1 <- getSeq(BSgenome.Hsapiens.UCSC.hg38, "chr1")
chr1 <- as(chr1, "DNAStringSet")
names(chr1) <- "chr1"
chr1
```

Given this represents the complete reference genome, we can also setup a `Seqinfo` object for all downstream analysis, and to ensure all objects are working with the same reference genome.

```{r sq}
sq <- seqinfo(chr1)
genome(sq) <- "GRCh38"
sq
```

A small set of variants from the 1000 Genomes Project^[https://www.internationalgenome.org] has been provided with the package in VCF format.

```{r vcf}
vcf <- system.file("extdata/1000GP_subset.vcf.gz", package = "transmogR")
vcf <- VcfFile(vcf)
```

Variant callers can often produce variant calls which are incompatible and the function `checkVariants()` will check for overlapping InDels, as well as returning both the `REF` and `ALT` columns as a character vectors.
This function can be applied to any set of variants parsed by other tools, or can be applied directly to a `VcfFile` to check variants whilst parsing.


```{r var}
var <- cleanVariants(vcf)
seqinfo(var) <- sq
var
```

In this case, no incompatible variants were included, but if they are present, an error will be produced by default.
This behaviour can be modified by setting the `ol_vars` argument to return no overlapping InDels (`ol_vars = 'none'`), the longest or shortest in terms of the largest or smallest change to the chromosome length, or even the first or last of the overlapping variants.

An additional set of transcripts derived from Gencode v44^[https://www.gencodegenes.org/human/] has also been provided.

```{r gtf}
f <- system.file("extdata/gencode.v44.subset.gtf.gz", package = "transmogR")
gtf <- import.gff(f, which = GRanges(sq))
seqinfo(gtf) <- sq
gtf
```

Splitting this gtf into feature types can also be very handy for downstream 
processes.

```{r gtf-split}
gtf <- splitAsList(gtf, gtf$type)
```

# Inspecting Variants

Knowing where our variants lie, and how they relate to each other can be 
informative, and as such, some simple visualisation and summarisation functions
have been included.
In the following, we can check to see how many exons directly overlap a variant,
showing how many unique genes this summarises to.
Any ids, which don't overlap a variants are also described in the plot title.

```{r upset-var, fig.cap = "Included variants which overlap exonic sequences, summarised by unique gene ids"}
upsetVarByCol(gtf$exon, var, mcol = "gene_id")
```

In addition, we can obtain a simple breakdown of overlapping regions using a 
GRangesList.
We can use the function `defineRegions()` from `extraChIPs` to define regions 
based on gene & transcript locations.
This function assigns each position in the genome uniquely to a given feature
hierarchically, using all provided transcripts, meaning some exons will be 
considered as promoters.
To ensure that all exons are included as exons, we can just substitute in the 
values from our gtf for this feature type.

```{r overlaps-by-var}
regions <- defineRegions(gtf$gene, gtf$transcript, gtf$exon, proximal = 0)
regions$exon <- granges(gtf$exon)
overlapsByVar(regions, var)
```

# Creating Modified Reference Sequences

Several approaches are available for creating modified reference genomes or transcriptomes, as outlines below.

## Modifying a Reference Transcriptome

A simple option for creating a variant-modified reference transcriptome is to 
call the function `transmogrify()`, providing the reference sequence, a set of
variants and the genomic co-ordinates of sequence features (e.g. transcripts) 
for which the sequences are required.

An optional tag can be added to all transcripts to indicate which have been 
modified, with an additional tag able to be included which indicates which type
of variant has been incorporated into the new transcript sequence.
Variant tags will be one of `s`, `i` or `d` to indicate SNV, Insertions or 
Deletions respectively.
In our example dataset, only one transcript contains both SNVs and an insertion.

```{r trans-mod}
trans_mod <- transmogrify(chr1, var, gtf$exon, tag = "1000GP", var_tags = TRUE)
trans_mod
names(trans_mod)[grepl("_si", names(trans_mod))]
```

This can be simply exported again using `writeXStringSet()`.

This approach effectively steps through all the provided transcripts inserting
the relevant variants where required, and as such can be somewhat time-consuming
when working with genome-wide variants and a complete transcriptome.

## Modifying a Reference Genome

In addition to creating a set of modified transcripts, the genomic sequence can 
also be modified with a call to `genomogrify()`, using either a `DNAStringSet` 
or `BSgenome` object to provide the reference sequences.
A tag can be optionally added to all sequence names to avoid confusion, however,
in the below example this option has not been utilised

```{r chr1-mod}
chr1_mod <- genomogrify(chr1, var)
chr1_mod
```

The new reference genome can be exported to fasta format using 
`writeXStringSet()`.
Incorporation of both of these into a 'gentrome' for use with the pseudo-aligner `salmon` [@Srivastava2020-tm] is a straight-forward concatenation.

```{r gentrome, eval=FALSE}
c(trans_mod, chr1_mod) %>% writeXStringSet("gentrome.fa.gz", compress = TRUE)
writeLines(names(chr1_mod), "decoys.txt")
```

## Finding Modified Co-ordinates

Both `genomogrify()` and `transmogrify()` rely on the lower-level functions `owl()` and 
`indelcator()` which *overwrite letters* or substitute *indels* respectively.
These are also able to be called individually as shown in the help pages.
However, `transmogrify()` can run for an extended time if modifying tens of thousands of transcripts, as would be the case for creating an entirely new reference transcriptome.

In addition to the above, the co-ordinates from a GRanges object can be modified to account for all length-modifying variants, i.e. InDels.
These shifted co-ordinates can then be used with a variant-modified reference genome to directly extract the modified transcript sequences using `extractTranscriptSeq()` from `GenomicFeatures` [@Lawrence2013].
This is traditionally performed using a `GRangesList` with each element containing the exons for a given transcript.

```{r ref-trans}
ref_exons_by_trans <- gtf$exon %>% splitAsList(.$transcript_id) 
ref_trans <- extractTranscriptSeqs(chr1, ref_exons_by_trans)
ref_trans
```

In order to perform an analogous operation on a variant-modified reference, we need to shift the co-ordinates of any feature in accordance with the variants incoporated into the modified reference.
Whilst not being a concern if only SNVs are used as variants, any Insertion or Deletion will shift genomic co-ordinates.
This can be performed using `shiftByVar()` which returns a set of GRanges with co-ordinates shifted to account for any variants

```{r new-exons}
new_exons <- shiftByVar(gtf$exon, var)
```

This will return an object with a modified `Seqinfo()` object so that sequence
lengths will match the modified chromosomes.

```{r compare-seqinfo}
seqinfo(new_exons)
seqinfo(chr1_mod)
```

This enables simple extraction of transcript sequences using the capabilities 
of `GenomicFeatures`

```{r new-exon-by-trans}
new_exon_by_trans <- splitAsList(new_exons, new_exons$transcript_id)
```

Whilst `transmogrify` will add tags to transcript names when requested, if choosing this approach, tags may be added manually, with `varTags()` being designed for this possbility.
Importantly, checking for overlaps with variants needs to be performed *before shifting the co-ordinates*, and these can then be added at any subsequent point in the process where the order of the transcript names is known.

```{r tags}
tags <- varTags(ref_exons_by_trans, var, tag = "1000GP")
names(new_exon_by_trans) <- paste0(names(new_exon_by_trans), tags)
```

Now, the transcript sequences can be extracted, with tags incorporated

```{r new-trans-mod}
new_trans_mod <- extractTranscriptSeqs(chr1_mod, new_exon_by_trans)
```

This will return all unmodified transcripts, as unmodified, despite the new co-ordinates, and will return modified transcripts which are identical to those returned by `transmogrify()`

```{r check-modified}
modified <- grepl("_", names(new_trans_mod))
all(new_trans_mod[!modified] == ref_trans[!modified])
all(new_trans_mod[modified] == trans_mod[modified])
```


In general, this approach is much faster and less resource hungry than using `transmogrify()`, however both approaches will require significant RAM.
Whilst some common-use laptops may have the required capacity, resources provided by HPC access will guarantee the process will execute to completion.

# Additional Capabilities

## Pseudo-Autosomal Regions (PAR-Y)

Beyond these lower-level functions, PAR-Y regions for hg38, hg19 and CHM13 are
able to obtained using `parY()` and passing the appropriate `seqinfo` object.
This `seqinfo` object checks the length of the Y-chromosome and guesses which 
reference genome has been used.


```{r par-y}
sq_hg38 <- seqinfo(BSgenome.Hsapiens.UCSC.hg38)
parY(sq_hg38)
```

If the user wishes to exclude transcripts in the PAR-Y region, these ranges can 
be passed to `transmogrify()` and any transcripts which overlap the PAR-Y
region will be excluded.
Alternatively, passing the entire Y-chromosome to `transmogrify()` will 
exclude all transcripts in the Y-chromosome, as may be preferred for 
female-specific references.

These regions can also be passed to `genomogrify()` as a mask, with all bases 
within the masked region being re-assigned as an N.

## Splice Junctions

In addition, the set of splice junctions associated with any transcript can be 
obtained using our known exons.

```{r sj}
ec <- c("transcript_id", "transcript_name", "gene_id", "gene_name")
sj <- sjFromExons(gtf$exon, extra_cols = ec)
sj
```

Many splice junctions will be shared across multiple transcripts, so the 
returned set of junctions can also be simplified using `chopMC()` from 
`extraChIPs`.

```{r chop-sj}
chopMC(sj)
```

As a further alternative, splice junctions can be returned as a set of 
interactions, with donor sites being assigned to the `anchorOne` element, and
acceptor sites being placed in the `anchorTwo` element
This enables the identification of all splice junctions for specific 
transcripts.

```{r sj-as-gi}
sj <- sjFromExons(gtf$exon, extra_cols = ec, as = "GInteractions")
subset(sj, transcript_name == "DDX11L17-201")
```


# Session info

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References