---
title: "TENxIO: Import Single Cell Data Files"
author: "Marcel Ramos"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{TENxIO Quick Start Guide}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: no
    toc: yes
    toc_depth: 4
Package: TENxIO
---

<!-- badges: start -->
<!-- badges: end -->

```{r, setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE)
```

# Introduction

`TENxIO` allows users to import 10X pipeline files into known Bioconductor
classes. The package is not comprehensive, there are files that are not
supported. It currently does not support Visium datasets. It does replace
some functionality in `DropletUtils`. If you would like a file format to be
supported. Please open an issue at https://github.com/waldronlab/TENxIO.

# Supported Formats

| **Extension**       | **Class**     | **Imported as**      |
|---------------------|---------------|----------------------|
| .h5                 | TENxH5        | SingleCellExperiment w/ TENxMatrix |
| .mtx / .mtx.gz      | TENxMTX       | SummarizedExperiment w/ dgCMatrix |
| .tar.gz             | TENxFileList  | SingleCellExperiment w/ dgCMatrix |
| peak_annotation.tsv | TENxPeaks     | GRanges              |
| fragments.tsv.gz    | TENxFragments | RaggedExperiment     |
| .tsv / .tsv.gz      | TSVFile\*     | tibble               |

**Note** (\*). The `TSVFile` class is used internally and not exported.

# Tested 10X Products

We have tested these functions with _some_ 
[datasets](https://www.10xgenomics.com/resources/datasets) from 10x Genomics
including those from:

* Single Cell Gene Expression
* Single Cell ATAC
* Single Cell Multiome ATAC + Gene Expression

Note. That extensive testing has not been performed and the codebase may require
some adaptation to ensure compatibility with all pipeline outputs.

# Currently not supported

* Spatial Gene Expression

# Bioconductor implementations

We are aware of existing functionality in both `DropletUtils` and
`SpatialExperiment`. We are working with the authors of those packages to cover
the use cases in both those packages and possibly port I/O functionality into
`TENxIO`. We are using long tests and the `DropletTestFiles` package to
cover example datasets on `ExperimentHub`, if you would like to know more, see
the `longtests` directory on GitHub.

# Installation

```{r,eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("waldronlab/TENxIO")
```

# Load the package

```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(TENxIO)
```

# Description

`TENxIO` offers an set of classes that allow users to easily work with files
typically obtained from the 10X Genomics website. Generally, these are outputs
from the Cell Ranger pipeline. 

# Procedure 

Loading the data into a Bioconductor class is a two step process. First, the
file must be identified by either the user or the `TENxFile` function. The
appropriate function will be evoked to provide a `TENxIO` class representation,
e.g., `TENxH5` for HDF5 files with an `.h5` extension. Secondly, the `import`
method for that particular file class will render a common Bioconductor class
representation for the user. The main representations used by the package are
`SingleCellExperiment`, `SummarizedExperiment`, `GRanges`, and
`RaggedExperiment`.

# Dataset versioning

The versioning schema in the package mostly applies to HDF5 resources and is
loosely based on versions of 10X datasets. For the most part, version 3 datasets
usually contain ranged information at specific locations in the data file.
Version 2 datasets will usually contain a `genes.tsv` file, rather than
`features.tsv` as in version 3. If the file version is unknown, the software
will attempt to derive the version from the data where possible.

# File classes

## TENxFile

The `TENxFile` class is the catch-all class superclass that allows transition
to subclasses pertinent to specific files. It inherits from the `BiocFile`
class and allows for easy dispatching `import` methods.

```{r}
showClass("TENxFile")
```

### `ExperimentHub` resources

`TENxFile` can handle resources from `ExperimentHub` with careful inputs. 
For example, one can import a `TENxBrainData` dataset via the appropriate
`ExperimentHub` identifier (`EH1039`):

```{r}
hub <- ExperimentHub::ExperimentHub()
hub["EH1039"]
```

Currently, `ExperimentHub` resources do not have an extension and it is best to
provide that to the `TENxFile` constructor function.

```{r,eval=FALSE}
fname <- hub[["EH1039"]]
TENxFile(fname, extension = "h5", group = "mm10", version = "2")
```

Note. `EH1039` is a large ~ 4GB file and files without extension as those
obtained from `ExperimentHub` will emit a warning so that the user is aware that
the import operation may fail, esp. if the internal structure of the file is
modified.

### TENxH5

`TENxIO` mainly supports version 3 and 2 type of H5 files. These are files
with specific groups and names as seen in `h5.version.map`, an internal
`data.frame` map that guides the import operations.

```{r}
TENxIO:::h5.version.map
```

In the case that, there is a file without genomic coordinate information, the
constructor function can take an `NA_character_` input for the `ranges`
argument.

The `TENxH5` constructor function can be used on either version of these H5
files. In this example, we use a subset of the PBMC granulocyte
H5 file obtained from the [10X website](https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5).

```{r}
h5f <- system.file(
    "extdata", "pbmc_granulocyte_ff_bc_ex.h5",
    package = "TENxIO", mustWork = TRUE
)
library(rhdf5)
h5ls(h5f)
```

Note. The `h5ls` function gives an overview of the structure of the file.
It matches version 3 in our version map. 

The show method gives an overview of the data components in the file:

```{r}
con <- TENxH5(h5f)
con
```

### import TENxH5 method

We can simply use the import method to convert the file representation to
a Bioconductor class representation, typically a `SingleCellExperiment`.

```{r}
import(con)
```

**Note**. Although the main representation in the package is
`SingleCellExperiment`, there could be a need for alternative data class
representations of the data. The `projection` field in the `TENxH5` show method
is an initial attempt to allow alternative representations.

## TENxMTX

Matrix Market formats are also supported (`.mtx` extension). These are typically
imported as SummarizedExperiment as they usually contain count data.

```{r}
mtxf <- system.file(
    "extdata", "pbmc_3k_ff_bc_ex.mtx",
    package = "TENxIO", mustWork = TRUE
)
con <- TENxMTX(mtxf)
con
```

## import MTX method

The `import` method yields a `SummarizedExperiment` without colnames or
rownames. 

```{r}
import(con)
```

## TENxFileList

Generally, the 10X website will provide tarballs (with a `.tar.gz`
extension) which can be imported with the `TENxFileList` class. The tarball
can contain components of a gene expression experiment including the matrix
data, row data (aka 'features') expressed as Ensembl identifiers, gene symbols,
etc. and barcode information for the columns.

The `TENxFileList` class allows importing multiple files within a `tar.gz`
archive. The `untar` function with the `list = TRUE` argument shows all the file
names in the tarball.

```{r}
fl <- system.file(
    "extdata", "pbmc_granulocyte_sorted_3k_ff_bc_ex_matrix.tar.gz",
    package = "TENxIO", mustWork = TRUE
)
untar(fl, list = TRUE)
```

We then use the `import` method across all file types to obtain an integrated
Bioconductor representation that is ready for analysis. Files in `TENxFileList`
can be represented as a `SingleCellExperiment` with row names and column names.

```{r}
con <- TENxFileList(fl)
import(con)
```

## TENxPeaks

Peak files can be handled with the `TENxPeaks` class. These files are usually
named `*peak_annotation` files with a `.tsv` extension. Peak files are
represented as `GRanges`.

```{r}
pfl <- system.file(
    "extdata", "pbmc_granulocyte_sorted_3k_ex_atac_peak_annotation.tsv",
    package = "TENxIO", mustWork = TRUE
)
tenxp <- TENxPeaks(pfl)
peak_anno <- import(tenxp)
peak_anno
```

## TENxFragments

Fragment files are quite large and we make use of the `Rsamtools` package to
import them with the `yieldSize` parameter. By default, we use a `yieldSize` of
200.

```{r}
fr <- system.file(
    "extdata", "pbmc_3k_atac_ex_fragments.tsv.gz",
    package = "TENxIO", mustWork = TRUE
)
```

Internally, we use the `TabixFile` constructor function to work with indexed
`tsv.gz` files.

**Note**. A warning is emitted whenever a `yieldSize` parameter is not set.

```{r}
tfr <- TENxFragments(fr)
tfr
```

Because there may be a variable number of fragments per barcode, we use a
`RaggedExperiment` representation for this file type.

```{r}
fra <- import(tfr)
fra
```

Similar operations to those used with SummarizedExperiment are supported. For
example, the genomic ranges can be displayed via `rowRanges`:

```{r}
rowRanges(fra)
```

# Session Information

```{r}
sessionInfo()
```