Contents

1 Introduction

The EpiDISH package provides tools to infer the fractions of a priori known cell subtypes present in a sample representing a mixture of such cell-types. Inference proceeds via one of 3 methods (Robust Partial Correlations-RPC(Teschendorff et al. 2017), Cibersort-CBS(Newman et al. 2015), Constrained Projection-CP(Houseman et al. 2012)), as determined by the user. Besides, we also provide a function - CellDMC which allows the identification of differentially methylated cell-types in Epigenome-Wide Association Studies(EWAS)(Zheng, Breeze, et al. 2018).

For now, the package contains 4 references, including two whole blood subtypes reference, one generic epithelial reference with epithelial cells, fibroblasts, and total immune cells, and one reference for breast tissue, as described in (Teschendorff et al. 2017) and (Zheng, Webster, et al. 2018).

2 How to estimte cell-type fractions using DNAm data

To show how to use our package, we constructed and stored a dummy beta value matrix DummyBeta.m, which contains 2000 CpGs and 10 samples, in our package.

We first load EpiDISH package, DummyBeta.m and the EpiFibIC reference.

library(EpiDISH)
data(centEpiFibIC.m)
data(DummyBeta.m)

Notice that centEpiFibIC.m has 3 columns, with names of the columns as EPi, Fib and IC. We go ahead and use epidish function with RPC mode to infer the cell-type fractions.

out.l <- epidish(beta.m = DummyBeta.m, ref.m = centEpiFibIC.m, method = "RPC") 

Then, we check the output list. estF is the matrix of estimated cell-type fractions. ref is the reference centroid matrix used, and dataREF is the subset of the input data matrix over the probes defined in the reference matrix.

out.l$estF
##            Epi        Fib           IC
## S1  0.08836819 0.06109607 0.8505357378
## S2  0.07652115 0.57326994 0.3502089007
## S3  0.15417391 0.75663136 0.0891947251
## S4  0.77082647 0.04171941 0.1874541181
## S5  0.03960599 0.31921224 0.6411817742
## S6  0.12751711 0.79642919 0.0760537000
## S7  0.18144315 0.72889883 0.0896580171
## S8  0.20220823 0.40929344 0.3884983293
## S9  0.19398079 0.80540932 0.0006098973
## S10 0.27976647 0.23671333 0.4835201992
dim(out.l$ref)
## [1] 599   3
dim(out.l$dataREF)
## [1] 599  10

In quality control step of DNAm data preprocessing, we might remove bad probes from all probes on 450k or 850k array; consequently, not all probes in the reference could be found in the given dataset. By checking ref and dataREF, we can extract the probes actually used to biuld the model and infer the cell-type fractions. If the majority of the probes in the reference cannot be found, the estimated fractionss might be compromised.

And now we show an example of using our package to estimate cell-type fractions of whole blood tissues. We use a subset beta value matrix of GSE42861 (detailed description in manaul page of LiuDataSub.m).

data(LiuDataSub.m)
BloodFrac.m <- epidish(beta.m = LiuDataSub.m, ref.m = centDHSbloodDMC.m, method = "RPC")$estF

We can easily check the inferred fractions with boxplots. From the boxplots, we observe that just as we expected, the major cell-type in whole blood is neutrophil.

boxplot(BloodFrac.m)

3 How to estimte cell-type fractions in the two-step framework

HEpiDISH is an iterative hierarchical procedure of EpiDISH. HEpiDISH uses two distinct DNAm references, a primary reference for the estimation of fractions of several cell-types and a separate secondary non-overlapping DNAm reference for the estimation of underlying subtype fractions of one of the cell-type in the primary reference.