HiCoolThe HiCool R/Bioconductor package provides an end-to-end interface to
process and normalize Hi-C paired-end fastq reads into .(m)cool files.
hicstuff python library
(https://github.com/koszullab/hicstuff).hicstuff.cooler (https://github.com/open2c/cooler)
library is used to parse pairs into a multi-resolution, balanced .mcool file.
.(m)cool is a compact, indexed HDF5 file format specifically tailored
for efficiently storing HiC-based data. The .(m)cool file format was
developed by Abdennur and Mirny and
published in 2019.basilisk environment.The main processing function offered in this package is HiCool().
To process .fastq reads into .pairs & .mcool files, one needs to provide:
r1 and r2);.fasta sequence file, a path to a pre-computed bowtie2 index
or a supported ID character (hg38, mm10, dm6, R64-1-1, WBcel235, GRCz10,
Galgal4);x <- HiCool(
    r1 = '<PATH-TO-R1.fq.gz>', 
    r2 = '<PATH-TO-R2.fq.gz>', 
    restriction = '<RE1(,RE2)>', 
    resolutions = "<resolutions of interest>", 
    genome = '<GENOME_ID>'
)Here is a concrete example of Hi-C data processing.
HiContactsData package..mcool file will have three levels of resolutions, from 1000bp to 8000bp.R64-1-1, the yeast genome reference.output/ directory.library(HiCool)
hcf <- HiCool(
    r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'), 
    r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'), 
    restriction = 'DpnII,HinfI', 
    resolutions = c(4000, 8000, 16000), 
    genome = 'R64-1-1', 
    output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Fetching bowtie genome index files from AWS iGenomes S3 bucket...
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> + /home/biocbuild/.cache/R/basilisk/1.18.0/0/bin/conda create --yes --prefix /home/biocbuild/.cache/R/basilisk/1.18.0/HiCool/1.6.0/env 'python=3.7.12' --quiet -c conda-forge -c bioconda --override-channels
#> + /home/biocbuild/.cache/R/basilisk/1.18.0/0/bin/conda install --yes --prefix /home/biocbuild/.cache/R/basilisk/1.18.0/HiCool/1.6.0/env 'python=3.7.12' -c conda-forge -c bioconda --override-channels
#> + /home/biocbuild/.cache/R/basilisk/1.18.0/0/bin/conda install --yes --prefix /home/biocbuild/.cache/R/basilisk/1.18.0/HiCool/1.6.0/env -c conda-forge -c bioconda 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1' --override-channels
#> HiCool :: Initiating processing of fastq files [tmp folder: /home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpUt0R5w/9DFR28]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpDU4WA5/Rbuild136ef35837c55/HiCool/vignettes/HiCool/189624601fcfea_7833^mapped-R64-1-1^9DFR28.html
#> HiCool :: All processing successfully achieved. Congrats!hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/189624601fcfea_7833^mapped-R64-1-1^9DFR28.mcool 
#> resolution: 4000 
#> pairs file: ./HiCool//pairs/189624601fcfea_7833^mapped-R64-1-1^9DFR28.pairs 
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/189624601fcfea_7833^mapped-R64-1-1^9DFR28.log"
#> 
#> $args
#> $args$r1
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/189624601fcfea_7833"
#> 
#> $args$r2
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/1896243d242a36_7834"
#> 
#> $args$genome
#> [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpUt0R5w/R64-1-1"
#> 
#> $args$resolutions
#> [1] "4000"
#> 
#> $args$resolutions
#> [1] "8000"
#> 
#> $args$resolutions
#> [1] "16000"
#> 
#> $args$restriction
#> [1] "DpnII,HinfI"
#> 
#> $args$iterative
#> [1] TRUE
#> 
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#> 
#> $args$threads
#> [1] 1
#> 
#> $args$output
#> [1] "./HiCool/"
#> 
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#> 
#> $args$keep_bam
#> [1] FALSE
#> 
#> $args$scratch
#> [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpUt0R5w"
#> 
#> $args$wd
#> [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpDU4WA5/Rbuild136ef35837c55/HiCool/vignettes"
#> 
#> 
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#> 
#> $stats$nPairs
#> [1] 73993
#> 
#> $stats$nDangling
#> [1] 10027
#> 
#> $stats$nSelf
#> [1] 2205
#> 
#> $stats$nDumped
#> [1] 83
#> 
#> $stats$nFiltered
#> [1] 61678
#> 
#> $stats$nDups
#> [1] 719
#> 
#> $stats$nUnique
#> [1] 60959
#> 
#> $stats$threshold_uncut
#> [1] 7
#> 
#> $stats$threshold_self
#> [1] 7Extra optional arguments can be passed to the hicstuff workhorse library:
iterative TRUE): By default, hicstuff first truncates your set of reads to 20bp and attempts to align the truncated reads, then moves on to aligning 40bp-truncated reads for those which could not be mapped, etc. This procedure is longer than a traditional mapping but allows for more pairs to be rescued. Set to FALSE if you want to perform standard alignment of fastq files without iterative alignment;balancing_args " --min-nnz 10 --mad-max 5 "): Specify here any balancing argument to be used by cooler when normalizing the binned contact matrices. Full list of options available at cooler documentation website;threads 1L): Number of CPUs to use to process data;exclude_chr 'Mito|chrM|MT'): List here any chromosome you wish to remove from the final contact matrix file;keep_bam FALSE): Set to TRUE if you wish to keep the pair of .bam files;scratch tempdir()): Points to a temporary directory to be used for processing.The important files generated by HiCool are the following:
<output_folder>/logs/<prefix>^mapped-<genome>^<hash>.log<output_folder>/matrices/<prefix>^mapped-<genome>^<hash>.mcool.pairs file: <output_folder>/pairs/<prefix>^mapped-<genome>^<hash>.pairs<output_folder>/plots/<prefix>^mapped-<genome>^<hash>_*.pdf.The diagnosis plots illustrate how pairs were filtered during the processing,
using a strategy described in Cournac et al., BMC Genomics 2012. The event_distance
chart represents the frequency of ++, +-, -+ and -- pairs in the library, as a function
of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold.
The event_distribution chart indicates the proportion of each type of pairs (e.g. dangling, uncut, abnormal, …)
and the total number of pairs retained (3D intra + 3D inter).
Notes:
.pairs file format is defined by the 4DN consortium;.(m)cool file format is defined by cooler authors in the supporting publication.Processing Hi-C sequencing libraries into .pairs and .mcool files requires
several dependencies, to (1) align reads to a reference genome, (2) manage
alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution
and (5)
All system dependencies are internally managed by basilisk. HiCool maintains
a basilisk environment containing:
python 3.9.1bowtie2 2.4.5samtools 1.7hicstuff 3.1.5cooler 0.8.11chromosight 1.6.3The first time HiCool() is executed, a fresh basilisk environment will
be created and required dependencies automatically installed. This ensures
compatibility between the different system dependencies needed to process
Hi-C fastq files.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] HiContactsData_1.7.0 ExperimentHub_2.14.0 AnnotationHub_3.14.0
#> [4] BiocFileCache_2.14.0 dbplyr_2.5.0         BiocGenerics_0.52.0 
#> [7] HiCool_1.6.0         HiCExperiment_1.6.0  BiocStyle_2.34.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3                   rlang_1.1.4                
#>   [3] magrittr_2.0.3              matrixStats_1.4.1          
#>   [5] compiler_4.4.1              RSQLite_2.3.7              
#>   [7] dir.expiry_1.14.0           png_0.1-8                  
#>   [9] vctrs_0.6.5                 stringr_1.5.1              
#>  [11] pkgconfig_2.0.3             crayon_1.5.3               
#>  [13] fastmap_1.2.0               XVector_0.46.0             
#>  [15] rmdformats_1.0.4            utf8_1.2.4                 
#>  [17] rmarkdown_2.28              sessioninfo_1.2.2          
#>  [19] tzdb_0.4.0                  UCSC.utils_1.2.0           
#>  [21] strawr_0.0.92               purrr_1.0.2                
#>  [23] bit_4.5.0                   xfun_0.48                  
#>  [25] zlibbioc_1.52.0             cachem_1.1.0               
#>  [27] GenomeInfoDb_1.42.0         jsonlite_1.8.9             
#>  [29] blob_1.2.4                  rhdf5filters_1.18.0        
#>  [31] DelayedArray_0.32.0         Rhdf5lib_1.28.0            
#>  [33] BiocParallel_1.40.0         parallel_4.4.1             
#>  [35] R6_2.5.1                    bslib_0.8.0                
#>  [37] stringi_1.8.4               reticulate_1.39.0          
#>  [39] GenomicRanges_1.58.0        jquerylib_0.1.4            
#>  [41] Rcpp_1.0.13                 bookdown_0.41              
#>  [43] SummarizedExperiment_1.36.0 knitr_1.48                 
#>  [45] IRanges_2.40.0              Matrix_1.7-1               
#>  [47] tidyselect_1.2.1            abind_1.4-8                
#>  [49] yaml_2.3.10                 codetools_0.2-20           
#>  [51] curl_5.2.3                  lattice_0.22-6             
#>  [53] tibble_3.2.1                withr_3.0.2                
#>  [55] KEGGREST_1.46.0             InteractionSet_1.34.0      
#>  [57] Biobase_2.66.0              basilisk.utils_1.18.0      
#>  [59] evaluate_1.0.1              Biostrings_2.74.0          
#>  [61] pillar_1.9.0                BiocManager_1.30.25        
#>  [63] filelock_1.0.3              MatrixGenerics_1.18.0      
#>  [65] stats4_4.4.1                plotly_4.10.4              
#>  [67] generics_0.1.3              vroom_1.6.5                
#>  [69] BiocVersion_3.20.0          S4Vectors_0.44.0           
#>  [71] ggplot2_3.5.1               munsell_0.5.1              
#>  [73] scales_1.3.0                glue_1.8.0                 
#>  [75] lazyeval_0.2.2              tools_4.4.1                
#>  [77] BiocIO_1.16.0               data.table_1.16.2          
#>  [79] rhdf5_2.50.0                grid_4.4.1                 
#>  [81] tidyr_1.3.1                 crosstalk_1.2.1            
#>  [83] AnnotationDbi_1.68.0        colorspace_2.1-1           
#>  [85] GenomeInfoDbData_1.2.13     basilisk_1.18.0            
#>  [87] cli_3.6.3                   rappdirs_0.3.3             
#>  [89] fansi_1.0.6                 S4Arrays_1.6.0             
#>  [91] viridisLite_0.4.2           dplyr_1.1.4                
#>  [93] gtable_0.3.6                sass_0.4.9                 
#>  [95] digest_0.6.37               SparseArray_1.6.0          
#>  [97] htmlwidgets_1.6.4           memoise_2.0.1              
#>  [99] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [101] httr_1.4.7                  mime_0.12                  
#> [103] bit64_4.5.2