--- title: "HiCool" author: "Jacques Serizay" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{HiCool} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, eval = TRUE, echo=FALSE, results="hide", message = FALSE, warning = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) suppressPackageStartupMessages({ library(HiCool) }) ``` # Processing sequencing Hi-C libraries with `HiCool` The `HiCool` R/Bioconductor package provides an **end-to-end interface** to process and normalize Hi-C paired-end fastq reads into `.(m)cool` files. 1. The heavy lifting (fastq mapping, pairs parsing and pairs filtering) is performed by the underlying lightweight `hicstuff` python library ([https://github.com/koszullab/hicstuff](https://github.com/koszullab/hicstuff)). 2. Pairs filering is done using the approach described in [Cournac et al., 2012](https://doi.org/10.1186/1471-2164-13-436) and implemented in `hicstuff`. 3. `cooler` ([https://github.com/open2c/cooler](https://github.com/open2c/cooler)) library is used to parse pairs into a multi-resolution, balanced `.mcool` file. `.(m)cool` is a compact, indexed HDF5 file format specifically tailored for efficiently storing HiC-based data. The `.(m)cool` file format was developed by Abdennur and Mirny and [published in 2019](https://doi.org/10.1093/bioinformatics/btz540). 4. Internally, all these external dependencies are automatically installed and managed in R by a `basilisk` environment. The main processing function offered in this package is `HiCool()`. To process `.fastq` reads into `.pairs` & `.mcool` files, one needs to provide: - The path to each fastq file (`r1` and `r2`); - The genome reference, as a path to a `.fasta` sequence file, a path to a pre-computed `bowtie2` index or a supported ID character (`hg38`, `mm10`, `dm6`, `R64-1-1`, `WBcel235`, `GRCz10`, `Galgal4`); - The restriction enzyme(s) used for Hi-C. ```{r eval = FALSE} x <- HiCool( r1 = '', r2 = '', restriction = '', resolutions = "", genome = '' ) ``` Here is a concrete example of Hi-C data processing. - Example fastq files are retrieved using the `HiContactsData` package. - Two restriction enzymes are used (these are the enzymes used in the Arima Kit). - The final `.mcool` file will have three levels of resolutions, from 1000bp to 8000bp. - The data will be mapped on `R64-1-1`, the yeast genome reference. - All output files will be placed in `output/` directory. ```{r} library(HiCool) hcf <- HiCool( r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'), r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'), restriction = 'DpnII,HinfI', resolutions = c(4000, 8000, 16000), genome = 'R64-1-1', output = './HiCool/' ) ``` ```{r} hcf S4Vectors::metadata(hcf) ``` # Optional parameters Extra optional arguments can be passed to the `hicstuff` workhorse library: - `iterative` (default: `TRUE`): By default, `hicstuff` first truncates your set of reads to 20bp and attempts to align the truncated reads, then moves on to aligning 40bp-truncated reads for those which could not be mapped, etc. This procedure is longer than a traditional mapping but allows for more pairs to be rescued. Set to `FALSE` if you want to perform standard alignment of fastq files without iterative alignment; - `balancing_args` (default: `" --min-nnz 10 --mad-max 5 "`): Specify here any balancing argument to be used by `cooler` when normalizing the binned contact matrices. Full list of options available at [cooler documentation website](https://cooler.readthedocs.io/en/latest/cli.html#cooler-balance); - `threads` (default: `1L`): Number of CPUs to use to process data; - `exclude_chr` (default: `'Mito|chrM|MT'`): List here any chromosome you wish to remove from the final contact matrix file; - `keep_bam` (default: `FALSE`): Set to `TRUE` if you wish to keep the pair of `.bam` files; - `scratch` (default: `tempdir()`): Points to a temporary directory to be used for processing. # Output files The important files generated by `HiCool` are the following: - A log file: `/logs/^mapped-^.log` - A multi-resolution, balanced contact matrix file: `/matrices/^mapped-^.mcool` - A `.pairs` file: `/pairs/^mapped-^.pairs` - Several diagnosis plots: `/plots/^mapped-^_*.pdf`. The diagnosis plots illustrate how pairs were filtered during the processing, using a strategy described in `Cournac et al., BMC Genomics 2012`. The `event_distance` chart represents the frequency of `++`, `+-`, `-+` and `--` pairs in the library, as a function of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold. The `event_distribution` chart indicates the proportion of each type of pairs (e.g. `dangling`, `uncut`, `abnormal`, ...) and the total number of pairs retained (`3D intra` + `3D inter`). ```{r echo = FALSE} ## HiCool/ ## |-- sample^mapped-R64-1-1^55IONQ.html ## |-- logs ## | |-- sample^mapped-R64-1-1^55IONQ.log ## |-- matrices ## | |-- sample^mapped-R64-1-1^55IONQ.mcool ## |-- pairs ## | |-- sample^mapped-R64-1-1^55IONQ.pairs ## `-- plots ## |-- sample^mapped-R64-1-1^55IONQ_event_distance.pdf ## |-- sample^mapped-R64-1-1^55IONQ_event_distribution.pdf ``` **Notes:** - `.pairs` file format is defined by the [4DN consortium](https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md); - `.(m)cool` file format is defined by `cooler` authors in the [supporting publication](https://doi.org/10.1093%2Fbioinformatics%2Fbtz540). # System dependencies Processing Hi-C sequencing libraries into `.pairs` and `.mcool` files requires several dependencies, to (1) align reads to a reference genome, (2) manage alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution and (5) All system dependencies are internally managed by `basilisk`. `HiCool` maintains a `basilisk` environment containing: - `python 3.9.1` - `bowtie2 2.4.5` - `samtools 1.7` - `hicstuff 3.1.5` - `cooler 0.8.11` - `chromosight 1.6.3` The first time `HiCool()` is executed, a fresh `basilisk` environment will be created and required dependencies automatically installed. This ensures compatibility between the different system dependencies needed to process Hi-C fastq files. # Session info ```{r session} sessionInfo() ```