scHiCcompare is designed for the imputation, joint normalization, and detection of differential chromatin interactions between two groups of chromosome-specific single-cell Hi-C datasets (scHi-C). The groups can be pre-defined based on biological conditions or created by clustering cells according to their chromatin interaction patterns. Clustering can be performed using methods like Higashi, scHiCcluster methods, etc.
scHiCcompare works with processed Hi-C data, specifically chromosome-specific chromatin interaction matrices, and accepts five-column tab-separated text files in a sparse matrix format.
The package provides two key functionalities:
scHiCcompare() function conducts a differential analysis
workflow, including imputation and normalization, using
chromosome-specific chromatin interaction matrices split into two
conditions (or two cell type groups).
Imputation - First, scHiC data in each group can
optionally undergo imputation to address data sparsity. As
resolution increases, the percentage of ‘0’ values or missing data also
rises drastically. To address this, scHiCompare applies the random
forest (RF) imputation method on each genomic distance
(no pooling) or pooled bands with the option of a pooling
technique. There are two pooling strategies:
Progressive pooling, where each subsequent band
combines interaction frequencies (IFs) within a linearly increasing
range of distances.
Fibonacci pooling, where each subsequent band
combines IFs within increasing distance ranges, and this increase
follows the Fibonacci sequence.
An example of how different pooling styles assign genomic distance into each band
Due to extreme sparsity at higher genomic distances and randomness of
long-range interactions, scHiCcompare, by default, workflow focuses on a
distance range of 1 to 10MB (main.Distance). If any pool
band falls outside this range and has a percentage of missing values
exceeding the specified threshold (default missPerc.threshold is 95%),
the missing interaction frequencies (IFs) values are imputed using the
mean IF values from the available data within that pool band.
Pseudo-bulk scHi-C - Second, imputed single-cell
Hi-C data in each group are transformed into group-specific
pseudo-bulk scHi-C matrices by summing all group-specific
single-cell Hi-C matrices.
Normalization - After obtaining two
group-specific pseudo-bulk matrices, normalization removes
global and local biases between two pseudo-bulk matrices. The scHiCcompare
workflow applies a LOESS regression model from HiCcompare
to jointly normalize two imputed pseudo-bulk matrices. Briefly,
Differential analysis -
differential analysis is performed on the processed
pseudo-bulk matrices to identify differential chromatin interactions
between the two cell types or conditions. This involves separating the
normalized log fold changes of interaction frequencies (the M-values)
into difference and non-difference groups. This analysis is performed on
a per-distance basis.
fprControl.logfc is applied to values in the difference
group, excluding low log fold changes.To use scHiCcompare, you’ll need to define two groups of cells to compare and save cell-specific scHi-C data (individual files in .txt format) in two folders.
Each cell-specific scHi-C .txt file should be formatted as modified sparse upper triangular matrices in R, which consist of five columns (chr1, start1, chr2, start2, IF). Since the full matrix of chromatin interactions is symmetric, only the upper triangular portion, including the diagonal and excluding any 0, is stored in a sparse matrix format. The required sparse matrix format of each single-cell Hi-C is:
The ‘.txt’ files need to be saved in tab-separated columns and no row names, column names, or quotes around character strings with the example format below.
## chr1 start1 chr2 start2 IF
## 17669 chr20 0 chr20 0 128
## 17670 chr20 0 chr20 1000000 1
## 17671 chr20 1000000 chr20 1000000 179
## 17672 chr20 0 chr20 2000000 1
## 17673 chr20 1000000 chr20 2000000 1
## 17674 chr20 2000000 chr20 2000000 174
To run scHiCcompare(), you need two folders with
condition-specific scHiC ‘.txt’ files. The condition-specific groups of
cells should be pre-defined based on criteria such as experimental
conditions, clustering results, or biological characteristics.
In section Others, we shows examples of steps to download and import scHi-C data into R. User can refer to it for more information.
Here is an example workflow using scHiC human brain datasets (Lee et al., 2019) with ODC and MG cell types at chromosome 20 with a 1MB resolution.
For the following example sections, we will load samples of 10
single-cell Hi-C data (in ‘.txt’) for each cell type group in two
example folders (ODCs_example and
MGs_axample). The files follow the same format as those
downloaded via download_schic() of Bandnorm.
You can extract the folder path by the code below, which could be used
as input for scHiCcompare() function.
## Load folder of ODC file path
ODCs_example_path <- system.file("extdata/ODCs_example",
package = "scHiCcompare"
)
## Load folder of MG file path
MGs_example_path <- system.file("extdata/MGs_example",
package = "scHiCcompare"
)Since the data downloaded by Bandnorm has the required input format (5 columns of [chr1, start1, chr2, start2, IF]), we don’t need an extra step for data modification. If, after importing your data into R, its format does not follow the sparse upper triangular input format requirement, you need to modify the data.
scHiCcompare(
file.path.1, file.path.2,
select.chromosome = "chr1",
imputation = "RF",
normalization = "LOESS",
differential.detect = "MD.cluster",
save.output.path = "results/",
...
)Core Parameter :
The core workflow consists of the following steps:
file.path.1, file.path.2 - Character strings specifying
paths to folders containing scHi-C data for the first and second cell
type or condition groups.select.chromosome - Integer or character indicating the
chromosome to be analyzed (e.g., ‘chr1’ or ‘chr10’.)imputation:
Handle missing values in sparse single-cell matrices. 'RF'
enables Random Forest-based imputation.normalization: Apply 'LOESS' normalization to
correct for systematic biases.differential.detect: Identify significant changes in
chromatin interactions. "MD.cluster" indicates
scHiCcompare’s differential detection test on MD plot.Optional Parameter :
In addition to core functionalities, scHiCcompare provides multiple optional parameters to customize imputation, normalization, differential analysis, and output handling.
pool.style), set the number of imputations
(n.imputation), and control iteration limits
(maxit). Additionally, users can include/exclude extreme
interaction frequency (IF) values (outlier.rm) and set a
missing data threshold (missPerc.threshold), which
determines the maximum allowable percentage of missing data in pool
bands outside the focal genomic distances.A.min parameter helps filter low
interaction frequencies, ensuring robust outlier detection during
differential analysis.fprControl.logfc and alpha regulate the false
positive rate for GMM difference clusters and define the significance
level for outlier detection.save.output.path allow external
storage of results (see output
example), while visualization settings (Plot,
Plot.normalize) generate MD plots for differential
detection and normalization effects.For further detail, user can refer to
?scHiCcompare()
In the following example, we will work with scHi-C data from 10 single cells in both ODC and MG cell types at a 1 MG resolution. We will focus on chromosome 20, applying the full workflow of scHiCcompare, which includes imputation, pseudo-bulk normalization, and differential analysis. Our goal is to detect differences for loci with genomic distances ranging from 1 to 10,000,000 bp. The progressive pooling style will be selected to create pool bands for the random forest imputation. For the differential analysis step, we will set the log fold change - false positive control threshold to 0.8.
The input file path was included in the package and conducted in the Prepare input folders section.
## Imputation with 'progressive' pooling
result <- scHiCcompare(
file.path.1 = ODCs_example_path,
file.path.2 = MGs_example_path,
select.chromosome = "chr20",
main.Distances = 1:10000000,
imputation = "RF",
normalization = "LOESS",
differential.detect = "MD.cluster",
pool.style = "progressive",
fprControl.logfc = 0.8,
Plot = TRUE
)Chromatin differential detection between ODC and MG in chromosome 20 in example above
From the visualizations above, normalization effectively reduces the irregular trend in the M values between the imputed pseudo-bulk matrices of the two cell types. At a 1MB resolution, the differential analysis reveals that most of the detected differences occur at closer genomic distances, particularly below 5MB.
The scHiCcompare() function will return an object that
contains plots, differential results, pseudo-bulk matrices, normalized
results, and imputation tables. The full differential results are
available in $Differential_Analysis. Intermediate results
can be accessed with $Intermediate, including the
imputation result table ($Intermediate$Imputation), the
pseudo-bulk matrix in sparse format
($Intermediate$PseudoBulk), and the normalization table
($Intermediate$Bulk.Normalization). These output table
objects have the following structure:
$Intermediate$PseudoBulk for each condition group
($condition1 and $condition2) has a standard
sparse upper triangular format with 3 columns of [region1, region2,
IF].
$Intermediate$Imputation for each condition group
($condition1 and $condition2) has modified
sparse upper triangular format:
$Intermediate$Bulk.Normalization has 15 columns
$Differential_Analysis has same structure as
$Intermediate$Bulk.Normalization with addition of 2
differential detection results columns
You also can have the option to save the results into the chosen
directory by a parameter in scHiCcompare() function. This will save the
normalization result table, differential result table, and imputed cell
scHi-C data (each group is a sub-folder). The sample of the saved output
folder structure is:
|– Bulk_normalization_table.txt
|– Differential_analysis_table.txt
|– Imputed_{group 1’s name}/
|– Imputed_{group 2’s name}/
The normalization result Bulk_normalization_table.txt
has the same format as the output object from the
scHiCcompare() function,
$Intermediate$Bulk.Normalization, which is shown in the
structure example below.
The differential result table
Differential_analysis_table.txt also has the same format as
the output object $Differential_Analysis from the
function.
The imputed cell’s scHiC data is saved in a folder for each group, which has a modified sparse upper triangular format of five columns [chr1, start1, chr2, start2, IF].
Below is a continuous example from Example of real anlysis above,
showing how you can extract different result options from the
scHiCcompare() function.
## -----------------------------------------------------------------
## ScHiCcompare - Differential Analysis for single cell Hi-C
## -----------------------------------------------------------------
## ScHiCcompare analyzes 10 cells of condition 1 group and 10 cells of condition 2 group at chromosome chr20.
##
## The process includes:
## Imputation, Bulk Normalization, Differential Analysis
##
## Note: See full differential result in $Differential_Analysis. Intermediate results can be accessed with $Intermediate.
### Extract imputed differential result
diff_result <- result$Differential_Analysis
DT::datatable(head(diff_result), options = list(scrollX = TRUE), width = 700)### Extract imputed pseudo bulk matrices normalization
norm_result <- result$Intermediate$Bulk.Normalization
DT::datatable(head(norm_result), options = list(scrollX = TRUE), width = 700)### Extract imputed ODC cell type table
imp_ODC_table <- result$Intermediate$Imputation$condition1
DT::datatable(head(imp_ODC_table), options = list(scrollX = TRUE), width = 700)## Extract Pseudo-bulk matrix from imputed scHi-C data
## Pseudo bulk matrix in standard sparse format
psudobulk_result <- result$Intermediate$PseudoBulk$condition1
DT::datatable(head(psudobulk_result),
options = list(scrollX = TRUE), width = 700
)Furthermore, you also have some parameter options in the function to indicate which plots to output and an option to save the results in a given directory.
There are several other functions included in
scHiCcompare package.
plot_HiCmatrix_heatmap() produces a heatmap
visualization for HiC and scHiC matrices. It requires, as input, a
modified sparse matrix, the same format from scHiCcompare()
Input with five columns of chr1, start1, chr2
start2, IF. More information can be found in its help document and the
example below.
data("ODC.bandnorm_chr20_1")
plot_HiCmatrix_heatmap(
scHiC.sparse = ODC.bandnorm_chr20_1,
main = "Figure 3. Heatmap of a single cell matrix", zlim = c(0, 5)
)## Matrix dimensions: 63x63
plot_imputed_distance_diagnostic() generates a
diagnostic visualization of imputation across genomic distances for all
single cells. It compares the distribution of all cells’ interaction
frequency at a given distance data before and after imputation. It
requires, as input, the scHiC table format of the original and imputed
scHiC datasets. ScHiC table format includes columns of genomic loci
coordinates and interaction frequencies (IF) of each cell (cell,
chromosome, start1, end1, IF1, IF2, IF3, etc).
The output of $Intermediate$Imputation of
scHiCcompare() function is directly compatible with this
format. For more details, see the sections on Output)
# Extract imputed table result
imp_MG_table <- result$Intermediate$Imputation$condition2
imp_ODC_table <- result$Intermediate$Imputation$condition1We need to create the table input for original IFs values in the same format. Below is a continuous example from Example of real anlysis above, showing how you can construct scHiC table for original IF values and compare them with the output of imputed IF values.
# Create scHiC table object for original ODC interaction frequencies (IF)
scHiC.table_ODC <- imp_ODC_table[c("region1", "region2", "cell", "chr")]
# List all files in the specified directory for original ODC data
file.names <- list.files(
path = ODCs_example_path,
full.names = TRUE, recursive = TRUE
)
# Loop through each file to read and merge data
for (i in 1:length(file.names)) {
# Read the current file into a data frame
data <- read.delim(file.names[[i]])
names(data) <- c("chr", "region1", "chr2", "region2", paste0("IF_", i))
data <- data[, names(data) %in%
c("chr", "region1", "region2", paste0("IF_", i))]
# Merge the newly read data with the existing scHiC.table_ODC
scHiC.table_ODC <- merge(scHiC.table_ODC, data,
by = c("region1", "region2", "chr"), all = TRUE
)
}
# Create scHiC table object for original MG interaction frequencies (IF)
scHiC.table_MG <- imp_MG_table[c("region1", "region2", "cell", "chr")]
# List all files in the specified directory for original MG data
file.names <- list.files(
path = MGs_example_path,
full.names = TRUE, recursive = TRUE
)
# Loop through each file to read and merge data
for (i in 1:length(file.names)) {
# Read the current file into a data frame
data <- read.delim(file.names[[i]])
names(data) <- c("chr", "region1", "chr2", "region2", paste0("IF_", i))
data <- data[, names(data) %in%
c("chr", "region1", "region2", paste0("IF_", i))]
# Merge the newly read data with the existing scHiC.table_MG
scHiC.table_MG <- merge(scHiC.table_MG, data,
by = c("region1", "region2", "chr"), all = TRUE
)
}# plot imputed Distance Diagnostic of MG
plot1 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 1
)
plot2 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 2
)
plot3 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 3
)
plot4 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 4
)
gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)Distribution diagnostic plot of imputed MG cells in some genomic distance
The diagnostic visualizations demonstrate that with a sample of only 10 single cells per group (note: this small sample size is for demonstration purposes only), the imputed values for MG closely match the original distribution only at shorter genomic distances (e.g., D1, D2). Increasing the number of single cells per group enhances imputation accuracy across distances. We recommend using a minimum of 80 single cells per group for optimal imputation performance.
To find and download single-cell Hi-C (scHi-C) data, you can use publicly available repositories and databases that host this type of data. Common sources include the Gene Expression Omnibus (GEO), the 4D Nucleome Data Portal (4DN), or data published in research articles, etc. Below are some examples of how to access and download scHi-C data.
You can search data on GEO using queries such as
single-cell Hi-C, scHi-C, or a GEO (GSE)
series numbers (e.g., GSE80006),
etc. Once you select a dataset, processed data (.txt, .csv, .bed, .hic,
etc.)
can be found under the “Supplementary files” section and downloaded via
the FTP links at the bottom of the page. Additionally, GEO offers various
download formats using different mechanisms. For more details about
downloading data in different formats, visit the GEO download guide: https://www.ncbi.nlm.nih.gov/geo/info/download.html.
You can also download these data by R. The example below shows steps to download the mouse scHi-C dataset (Flyamer et al.2017) on GEO.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("GEOquery")library(GEOquery)
# For example, we want to download (Flyamer et al.2017) data
geo_id <- "GSE80006"
# Download the information, notation, feature data, etc
gse <- getGEO(geo_id, GSEMatrix = TRUE)
# You can read more about this function
?getGEO()
# Download and extract the supplementary files (processed data)
getGEOSuppFiles(geo_id, baseDir = "path/to/save")Other sources:
Some research papers provide data from external sources, which are usually mentioned in the paper. For example, human brain datasets (Lee et al., 2019) are available through a public box directory located https://salkinstitute.box.com/s/fp63a4j36m5k255dhje3zcj5kfuzkyj1.
Additionally, some tools collect scHi-C data from various studies, like https://noble.gs.washington.edu/proj/schic-topic-model/. Below is an example of an R function from Bandnorm, which also accesses existing single-cell Hi-C data at a 1mbp resolution
To download human brain oligodendrocytes (ODC) and microglia (MG)
cell type (Lee et al., 2019), we used the download_schic()
function of Bandnorm
package to download the scHiC data of ODC and MG cell types groups in
1MB resolution.
After downloading scHi-C data, the next step is to import the data into R for analysis.
ScHi-C data is available in various formats from different sources. Below are examples of how to extract chromosome-specific data for analysis in R.
.bedepe, .csv, or .txt formats :If your raw scHiC data has been processed to ‘.bedepe’, ‘.csv’, or
‘.txt’ formats, it can be read using read.delim(),
read.table(), etc. Once the data is loaded, if you have
full Hi-C contact matrices, you can convert them to a sparse upper
triangular format using the full2sparse() function of the
HiCcompare
package, then reformatting the columns to achieve the sparse upper
triangular input format.
.hic format:To access and import .hic files into R, you can use
tools such as strawr for
reading and processing .hic files. You can read other
similar example in the HiCcompare
package vignette
An example of reading .hic with strawr is
shown below. straw() reads the .hic file of
each single-cell Hi-C and outputs a data.frame in a sparse upper
triangular format. This step must be repeated for single-cell Hi-C of
the same cell type (condition) group.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("strawr")library(strawr)
# Example to read the contact matrix from a .hic file of a single-cell
filepath <- "path/to/your/schic.hic"
contact_matrix <- straw(
norm = "NONE", # Normalization method (KR, VC, or NONE)
filepath, # Path to .hic file
chr1loc = "chr1", # Chromosome 1
chr2loc = "chr1", # Chromosome 2 (intra-chromosomal interactions)
unit = "BP", # Base pair (BP) resolution or fragment (FRAG) resolution
binsize = 200000 # Bin size (e.g., 200kb)
).cool format:To access and import .cool files into R, you can use
cooler2bedpe() function of HiCcompare
package or cooler
to access the data. You can read example of cooler
on HiCcompare
vignette
For example, the files can be read directly into R by
cooler2bedpe() function, which will return a list object in
the format of BEDPE, containing two elements: “cis” - Contains the
intra-chromosomal contact matrices, one per chromosome; “trans” -
Contains the inter-chromosomal contact matrix. You can read about this
in more detail by ?cooler2bedpe().
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mclust_6.1.1 data.table_1.17.8 lattice_0.22-7 gridExtra_2.3 ggplot2_4.0.0 tidyr_1.3.1 HiCcompare_1.31.0 dplyr_1.1.4 scHiCcompare_1.3.0 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.4 DBI_1.2.3 rlang_1.1.6 magrittr_2.0.4 miceadds_3.18-36 matrixStats_1.5.0 compiler_4.5.1 mgcv_1.9-3 vctrs_0.6.5 pkgconfig_2.0.3 shape_1.4.6.1 fastmap_1.2.0 backports_1.5.0 XVector_0.49.3 labeling_0.4.3 rmarkdown_2.30 nloptr_2.2.1 purrr_1.1.0 xfun_0.53 glmnet_4.1-10 jomo_2.7-6 cachem_1.1.0 jsonlite_2.0.0 rhdf5filters_1.23.0 DelayedArray_0.37.0 pan_1.9 Rhdf5lib_1.33.0 BiocParallel_1.43.4 broom_1.0.10 parallel_4.5.1 R6_2.6.1 bslib_0.9.0 RColorBrewer_1.1-3 ranger_0.17.0 car_3.1-3
## [36] boot_1.3-32 rpart_4.1.24 GenomicRanges_1.63.0 jquerylib_0.1.4 Rcpp_1.1.0 Seqinfo_0.99.4 SummarizedExperiment_1.39.2 iterators_1.0.14 knitr_1.50 IRanges_2.45.0 Matrix_1.7-4 splines_4.5.1 nnet_7.3-20 tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10 codetools_0.2-20 tibble_3.3.0 withr_3.0.2 InteractionSet_1.37.1 Biobase_2.69.1 S7_0.2.0 evaluate_1.0.5 survival_3.8-3 pillar_1.11.1 BiocManager_1.30.26 MatrixGenerics_1.21.0 carData_3.0-5 mice_3.18.0 KernSmooth_2.23-26 DT_0.34.0 foreach_1.5.2 stats4_4.5.1 reformulas_0.4.2 generics_0.1.4
## [71] S4Vectors_0.47.6 scales_1.4.0 minqa_1.2.8 gtools_3.9.5 glue_1.8.0 pheatmap_1.0.13 maketools_1.3.2 tools_4.5.1 sys_3.4.3 lme4_1.1-37 buildtools_1.0.0 rhdf5_2.55.0 grid_4.5.1 mitools_2.4 crosstalk_1.2.2 rbibutils_2.3 nlme_3.1-168 Formula_1.2-5 cli_3.6.5 S4Arrays_1.9.3 gtable_0.3.6 rstatix_0.7.3 sass_0.4.10 digest_0.6.37 BiocGenerics_0.55.4 SparseArray_1.9.2 htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4 mitml_0.4-5 MASS_7.3-65