---
title: "Getting started with ISAnalytics"
author: 
  - name: Giulia Pais
    affiliation: | 
     San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, 
     Via Olgettina 60, 20132 Milano - Italia
    email: giuliapais1@gmail.com, calabria.andrea@hsr.it
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('ISAnalytics')`"
vignette: >
  %\VignetteIndexEntry{ISAnalytics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r GenSetup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
    ## Related to
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ISAnalytics = citation("ISAnalytics")[1],
    ClonalTrackingPaper = BibEntry(
        bibtype = "Article",
        title = paste(
            "Efficient gene editing of human long-term hematopoietic",
            "stem cells validated by clonal tracking"
        ),
        author = "Ferrari Samuele, Jacob Aurelien, Beretta Stefano",
        journaltitle = "Nat Biotechnol 38, 1298–1308",
        date = "2020-11",
        doi = "https://doi.org/10.1038/s41587-020-0551-y"
    )
)

ngs_exp_fig <- fs::path("../man", "figures", "ngs_data_exp.png")
```

# Introduction

ISAnalytics is an R package developed to analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies.

# Installation and options

`ISAnalytics` can be installed quickly in different ways:

* You can install it via [Bioconductor](http://bioconductor.org)
* You can install it via GitHub using the package `devtools`

There are always 2 versions of the package active:

* `RELEASE` is the latest stable version
* `DEVEL` is the development version, it is the most up-to-date version where
all new features are introduced

## Installation from bioconductor

RELEASE version:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ISAnalytics")
```

DEVEL version:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("ISAnalytics")
```

## Installation from GitHub

RELEASE:
```{r eval=FALSE}
if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "RELEASE_3_17",
                         dependencies = TRUE,
                         build_vignettes = TRUE)
```

DEVEL:
```{r eval=FALSE}
if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "devel",
                         dependencies = TRUE,
                         build_vignettes = TRUE)
```

## Setting options

`ISAnalytics` has a verbose option that allows some functions to print 
additional information to the console while they're executing. 
To disable this feature do:

```{r OptVerbose, eval=FALSE}
# DISABLE
options("ISAnalytics.verbose" = FALSE)

# ENABLE
options("ISAnalytics.verbose" = TRUE)

```

Some functions also produce report in a user-friendly HTML format, 
to set this feature:

```{r OptWidg, eval=FALSE}
# DISABLE HTML REPORTS
options("ISAnalytics.reports" = FALSE)

# ENABLE HTML REPORTS
options("ISAnalytics.reports" = TRUE)
```

# Setting up the workflow

In the newer version of ISAnalytics, we introduced a "dynamic variables system",
to allow more flexibility in terms of input formats. Before starting with the
analysis workflow, you can specify how your inputs are structured so that
the package can process them. For more information on how to do this
take a look at `vignette("workflow_start", package = "ISAnalytics")`.

# The first steps

The first steps of the analysis workflow involve the import and parsing of
data and metadata files from disk.

* Import metadata with `import_association_file()` and/or
`import_Vispa2_stats()`
* Import data with `import_single_Vispa2Matrix()` or
`import_parallel_Vispa2Matrices()`

Refer to the vignette 
`vignette("workflow_start", package = "ISAnalytics")` for
more details.

# Data cleaning and pre-processing  

ISAnalytics offers several different functions for cleaning and pre-processing
your data.

* Recalibration: identifies integration events that are near to each other
and condenses them into a single event whenever appropriate -
`compute_near_integrations()`
* Outliers identification and removal: identifies samples that are considered
outliers according to user-defined logic and filters them out - 
`outlier_filter()`
* Collision removal: identifies collision events between independent samples -
`remove_collisions()`, see also the dedicated vignette
`vignette("workflow_start", package = "ISAnalytics")`
* Filter based on cell lineage purity: identifies and removes contamination
between different cell types - `purity_filter()`
* Data and metadata aggregation: allows the union of biological samples from
single pcr replicates or other arbitrary aggregations -
`aggregate_values_by_key()`, `aggregate_metadata()`, see also the 
dedicated vignette 
`vignette("workflow_start", package = "ISAnalytics")`

# Answering biological questions

You can answer very different biological questions by using the
provided functions with appropriate inputs. 

* Descriptive statistics: `sample_statistics()`
* IS relative abundance: `compute_abundance()`, `integration_alluvial_plot()`
* Top abundant IS: `top_integrations()`
* Top targeted genes: `top_targeted_genes()`
* Grubbs test for common insertion sites (CIS): `CIS_grubbs()`,
`CIS_volcano_plot()`
* Fisher's exact test for gene frequency and IS distribution on target genome:
`gene_frequency_fisher()`, `fisher_scatterplot()`, `circos_genomic_density()`
* Clonal sharing analyses: `is_sharing()`, `iss_source()`, `sharing_heatmap()`,
`sharing_venn()`
* Estimate HSPCs population size: `HSC_population_size_estimate()`,
`HSC_population_plot()`

For more, please refer to the full function reference.

# Working with other kinds of data

ISAnalytics is designed to be flexible concerning input formats, thus it is 
suited to process various kinds of data provided the correct dynamic 
configuration is set.

We demonstrate this with an example that uses barcodes data. 
The matrix is publicly available [here](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE144340&format=file&file=GSE144340%5FMatrix%5F542%2Etsv%2Egz) `r Citep(bib[["ClonalTrackingPaper"]])`, metadata
was provided to us by the authors and it is available in the package 
additional files.

```{r}
library(ISAnalytics)

# Set appropriate data and metadata specs ----
metadata_specs <- tibble::tribble(
    ~names, ~types, ~transform, ~flag, ~tag,
    "ProjectID", "char", NULL, "required", "project_id",
    "SubjectID", "char", NULL, "required", "subject",
    "Tissue", "char", NULL, "required", "tissue",
    "TimePoint", "int", NULL, "required", "tp_days",
    "CellMarker", "char", NULL, "required", "cell_marker",
    "ID", "char", NULL, "required", "pcr_repl_id",
    "SourceFileName", "char", NULL, "optional", NA_character_,
    "Link", "char", NULL, "optional", NA_character_
)
set_af_columns_def(metadata_specs)

mandatory_specs <- tibble::tribble(
    ~names, ~types, ~transform, ~flag, ~tag,
    "BarcodeSeq", "char", NULL, "required", NA_character_
)
set_mandatory_IS_vars(mandatory_specs)

# Files ----
data_folder <- tempdir()
utils::unzip(zipfile = system.file("testdata", 
                                   "testdata.zip",
                                   package = "ISAnalytics"), 
             exdir = data_folder, overwrite = TRUE)
meta_file <- "barcodes_example_af.tsv.xz"
matrix_file <- "GSE144340_Matrix_542.tsv.xz"

# Data import ----
af <- import_association_file(fs::path(data_folder, meta_file),
    report_path = NULL
)
af

matrix <- import_single_Vispa2Matrix(fs::path(data_folder, matrix_file),
    sample_names_to = "ID"
)
matrix

# Descriptive stats ----
desc_stats <- sample_statistics(matrix, af,
    sample_key = pcr_id_column(),
    value_columns = "Value"
)$metadata |>
    dplyr::rename(distinct_barcodes = "nIS")
desc_stats

# Aggregation and new stats ----
agg_key <- c("SubjectID")
agg <- aggregate_values_by_key(matrix, af,
    key = agg_key,
    group = "BarcodeSeq",
    join_af_by = pcr_id_column()
)
agg

agg_meta_functions <- tibble::tribble(
    ~Column, ~Function, ~Args, ~Output_colname,
    "TimePoint", ~ mean(.x, na.rm = TRUE), NA, "{.col}_avg",
    "CellMarker", ~ length(unique(.x)), NA, "distinct_cell_marker_count",
    "ID", ~ length(unique(.x)), NA, "distinct_id_count"
)
agg_meta <- aggregate_metadata(
    af,
    aggregating_functions = agg_meta_functions,
    grouping_keys = agg_key
)
agg_meta

agg_stats <- sample_statistics(agg, agg_meta,
    sample_key = agg_key,
    value_columns = "Value_sum"
)$metadata |>
    dplyr::rename(distinct_barcodes = "nIS")
agg_stats

# Abundance ----
abundance <- compute_abundance(agg, columns = "Value_sum", key = agg_key)
abundance

reset_dyn_vars_config()
```

# Using the Shiny interface

The package provides a simple Shiny interface for data exploration and plotting.
To start the interface use:

```{r eval=FALSE}
NGSdataExplorer()
```

The application main page will show a loading screen for a file. 
It is possible to load files also from the R environment, for example, 
before opening the app, we can load the included association file:

```{r eval=FALSE}
data("association_file")
```

Once in the application we can choose `"association_file"` from the 
R environment loading option screen and click on "Import data". Once
data is imported, we can click on the "Explore" tab in the upper navbar:
here we will see 2 tabs, one allows interactive exploration of data in tabular
form, in the other tab we can plot data. It is possible to customize several
different parameters for the plot and finally save it to file with the 
dedicated button at the end of the page.

The Shiny interface is still currently under active development and new features
will be added in the near future.


# Ensuring reproducibility of results

Several implemented functions produce static HTML reports that can be saved 
on disk, or tabular files.
Reports contain the relevant information on how the function was called,
inputs and outputs statistics, and session info for reproducibility.

# Browse documentation online and keep updated

ISAnalytics has it's dedicated package website where you can browse the 
documentation and vignettes easily, in addition to keeping up to date with all
relevant updates. Visit the website at https://calabrialab.github.io/ISAnalytics/

# Problems?

If you have any issues the documentation can't solve, get in touch by 
opening an issue on GitHub or contacting the maintainers

# Bibliography

```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```