htmltools::img(
src=knitr::image_uri("cypress_official.png"),
alt="logo",
style="position:absolute; top:0; right:0; padding:10px; height:280px"
)Bulk RNA-sequencing technology has now been routinely adopted in clinical studies to investigate transcriptome profiles alterations associated with phenotypes-of-interest. In the past several years, computational deconvolution methods were developed to solve for cell mixture proportions. More recently, identifying csDE genes has been made possible by methodology extensions of cell-type deconvolution. Currently, due to the lack of experimental design and statistical power assessment tool for csDE analysis, clinical researchers are facing difficulties in using csDE gene detection methods in practice. One of the major difficulties is to determine a proper sample sizes required for a well-powered experimental designs. In a real RNA-sequencing study, effect size, gene expression level, biological and technical variation all impact statistical power determinations. Rigorous experimental design requires the statistical power assessment method that considers these factors. However, such a method is yet to be developed.
Here we present a statistical framework and tool cypress (cell-type-specific differential expression power assessment) to guide experimental design, assess statistical power, and optimize sample size requirements for csDE analysis. We adopt a rigorous statistical framework to reliably model purified cell-type-specific profiles, cell type compositions, biological and technical variations; provide thorough assessment using multiple statistical metrics through simulations; and provide a user-friendly tool to researchers for their own data simulation and power evaluation.
To use the cypress package, users have the option to provide their own bulk-level RNA-sequencing data, allowing the package to estimate distribution parameters. Alternatively, the package includes three predefined distributional parameter sets from which users can choose from, should the users have no pilot datasets. Specifically, the RNA-sequencing data simulation framework is based on a Gamma-Poisson compound that aligns with our previous benchmark study. The csDE genes calling process is implemented by TOAST, a leading method in this domain. We will stratify genes into distinct strata based on their expression levels and evaluate the power within each stratum. We particularly focus on examining the influence and the impact of sample size and sequencing depth, giving special consideration to genes with a low signal-to-noise ratio in sequencing data.
From Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("cypress")To view the package vignette in HTML format, run the following lines in R:
cypress offers the function simFromData when users want
to provide their own bulk RNA-seq data for parameter estimation and
simulation. simFromData needs the input file organized into
SummarizedExperiment
objects including a count matrix, study design, and an optional cell
type proportion matrix. An example data ASD_prop_se is
included:
library(cypress)
library(BiocParallel)
data(ASD_prop_se)
result1 <- simFromData(INPUTdata=ASD_prop, #SummarizedExperiment object.
CT_index=(seq_len(6) + 2), #Column index for cell types proportion matrix.
CT_unk=FALSE, #CT_unk should be True if no cell type proportion matrix.
n_sim=2, #Total number of iterations users wish to conduct.
n_gene=1000, #Total number of genetic features users with to conduct.
DE_pct=0.05, #Percentage of DEG on each cell type.
ss_group_set=c(8,10), #Sample sizes per group users wish to simulate.
lfc_set=c(1,1.5), #Effect sizes users wish to simulate
DEmethod = "TOAST" # DE methods we used
)Alternatively, the simFromData function needs to provide
real data, quickPower() outputs power calculation results
without the need to input pilot RNA-seq data. This function is designed
to provide quick access to pre-calculated results.
library(cypress)
data(quickPowerGSE60424)
result2<-quickPower(data="IAD") # Options include 'IAD', 'IBD', or 'ASD'.The following function produces basic line plots of power evaluation
results conducted by quickPower() for visualization:
### plot statistical power results
plotPower(simulation_results=result2,# Simulation results generated by quickPower() or simFromData()
effect.size=1,# A numerical value indicating which effect size is to be fixed.
sample_size=10 )# A numerical value indicating which sample size to be fixed.There are 3 options for users to utilize cypress functions for parameter estimation and simulation:
Option 1: Users wish to use their own real data for
estimating parameters. This situation fits well to research project
where a small pilot RNA-seq dataset has been collected, and the users
want to use the information embedded in the pilot dataset to conduct
rigorous experimental design. The input file needs to be organized into
a SummarizedExperiment object. The object should contain a
feature by sample count matrix stored in the counts slot.
It should also use the first column in the colData slot
with a column name of disease to store the group status
(i.e. case/control), where the controls are indicated by 1 and the cases
are indicated by 2. The second column in the colData slot
to stores the subject IDs mapping to each sample. The remaining columns
in the colData slot should store the cell type proportions.
If provided, the cell type proportions should sum up to 1 for each
sample. This cell type proportion matrix is helpful but optional.
## class: SummarizedExperiment
## dim: 3000 48
## metadata(0):
## assays(1): counts
## rownames(3000): ENSG00000214727 ENSG00000214941 ... ENSG00000150471
## ENSG00000198759
## rowData names(0):
## colnames(48): SRR1576466 SRR1576467 ... SRR1576512 SRR1576513
## colData names(8): disease sample_id ... Celltype5 Celltype6
Option 2: Users do not have pilot RNA-seq data for csDE analysis, and wish to use existing studies for parameter estimation and conduct customized downstream simulation. cypress provides 3 real datasets for estimating the parameters:
IAD: Immune-related disease (IAD) study (GSE60424).
Whole transcriptome signatures of 6 immune cell types, including
neutrophils, monocytes, B cells, CD4 T cells, CD8 T cells, and natural
killer cells.
ASD: The bulk RNA-seq data in a large autism
spectrum disorder (ASD) study. The study includes 251 samples of
frontal/temporal cortex and cerebellum brain regions. Samples are from
24 ASD subjects versus 24 controls.
IBD: The ileal transcriptome in pediatric
inflammatory bowel disease(IBD) study (GSE57945). This dataset includes
a cohort of 359 treatment-naïve pediatric patients with Crohn’s disease
(CD, n=213), ulcerative colitis (UC, n=60) and healthy controls
(n=41).
Option 3: Users have no pilot data to provide but wish to see results generative based on our own simulation, and have some quick experimental design references to check.
cypress offers 3 functions for simulation and power evaluation:
simFromData() ,simFromParam() and
quickPower(). If users have their own bulk RNA-seq count
data, they can use the simFromData() function, otherwise,
they can use simFromParam() estimating parameters from 3
existing studies to perform power evaluation on customized the
simulation settings. If users prefer to quickly examine the power
evaluation results using the built-in datasets, they can use the
quickPower() function. The output of these 3 functions is a
S4 object with a list of power measurements under various experimental
settings, such as Statistical Power, TDR, and FDC.
simFromData() enables users to provide their own data or
use the example data data(ASD_prop_se) , they can also
specify the number of simulations (nsim), number of
genes(n_gene), percentage of differential expressed genes
for each cell type(DE_pct), sample sizes
(ss_group_set), and log fold change
(lfc_set).
Unlike simFromData() which needs user to provide their
own count matrix and often takes a while to run,
simFromParam() produces results faster by extracting
parameters from three existing studies. simFromParam() has
3 pre-estimated datasets for users to choose, they are IAD,
ASD and IBD. Additionally, users can also
define their own simulation settings.
quickPower() runs even faster than
simFromData() and simFromParam(), as it
outputs the pre-evaluated results directly. Users only need to choose
which study result they want the program to output in the
data argument.
cypress uses power evaluation metrics, for each experimental scenario, of the following:
Power: Statistical power.
TDR: The ratio of the number of true positives to the number of significant discoveries.
FDC: The ratio of the number of false positives to the number of true positives, among significant discoveries.
Once users have obtained an S4 object with a list of power
measurements using either simFromData,
simFromParam or quickPower(), they can use the
following functions to generate basic line plots:
plotPower(), plotTDR() and
plotFDC().
plotPower(): Generates 6 plots in a 2x3
panel. The illustration of each plot from left to right and from up to
bottom is as follows:
1: Statistical power by effect size, each line represents sample size. Statistical power was the average value across cell types.
2: Statistical power by effect size, each line represents cell type.
Statistical power was calculated under the scenario of sample size to be
fixed at 10 if sample_size=10.
3: Statistical power by sample size, each line represents cell type.
Statistical power was calculated under the scenario of effect size to be
fixed at 1 if effect.size=1.
4: Statistical power by strata, each line represents cell type.
Statistical power was calculated under the scenario of sample size to be
fixed at 10 and effect size to be fixed at 1 if
sample_size=10 and effect.size=1.
5: Statistical power by strata, each line represents sample size.
Statistical power was the average value across cell types and under the
scenario of effect size to be fixed at 1 if
effect.size=1.
6: Statistical power by strata, each line represents effect size.
Statistical power was the average value across cell types and under the
scenario of sample size to be fixed at 10 if
sample_size=10.
### plot statistical power results
plotPower(simulation_results=quickPower,effect.size=1,sample_size=10 )plotTDR(): Generates 4 plots in a 2x2
panel. The illustration of each plot from left to right and from up to
bottom is as follow:
1: True discovery rate(TDR) by top-rank genes, each line represents
one cell type. TDR was calculated under the scenario of sample size to
be fixed at 10 and effect size to be fixed at 1 if
sample_size=10 and effect.size=1.
2: True discovery rate(TDR) by top-rank genes, each line represents
one effect size. TDR was the average value across cell types and under
the scenario of sample size to be fixed at 10 if
sample_size=10.
3: True discovery rate(TDR) by top-rank genes, each line represents
one sample size. TDR was the average value across cell types and under
the scenario of effect size to be fixed at 1 if
effect.size=1.
4: True discovery rate(TDR) by effect size, each line represents one sample size. TDR was calculated under the scenario of top rank gene equals 350.
plotFDC(): Generates 2 plots in a 1x2
panel. The illustration of each plot from left to right is as
follow:
1: False discovery cost(FDC) by effect size, each line represents one
cell type. FDC was calculated under the scenario of sample size to be
fixed at 10 if sample_size=10.
2: False discovery cost(FDC) by top effect size, each line represent one sample size. FDC was the average value across cell types.
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.45.0 cypress_1.7.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] pbapply_1.7-4 formatR_1.14
## [3] CDM_8.3-14 rlang_1.1.6
## [5] magrittr_2.0.4 matrixStats_1.5.0
## [7] e1071_1.7-16 compiler_4.5.1
## [9] gdata_3.0.1 vctrs_0.6.5
## [11] quadprog_1.5-8 stringr_1.5.2
## [13] pkgconfig_2.0.3 fastmap_1.2.0
## [15] backports_1.5.0 XVector_0.51.0
## [17] rmarkdown_2.30 pracma_2.4.6
## [19] nloptr_2.2.1 preprocessCore_1.73.0
## [21] purrr_1.1.0 xfun_0.54
## [23] cachem_1.1.0 jsonlite_2.0.0
## [25] sirt_4.2-133 EpiDISH_2.27.0
## [27] DelayedArray_0.37.0 gmodels_2.19.1
## [29] parallel_4.5.1 R6_2.6.1
## [31] bslib_0.9.0 stringi_1.8.7
## [33] RColorBrewer_1.1-3 limma_3.67.0
## [35] GGally_2.4.0 GenomicRanges_1.63.0
## [37] jquerylib_0.1.4 Rcpp_1.1.0
## [39] Seqinfo_1.1.0 SummarizedExperiment_1.41.0
## [41] iterators_1.0.14 knitr_1.50
## [43] IRanges_2.45.0 polycor_0.8-1
## [45] Matrix_1.7-4 nnls_1.6
## [47] splines_4.5.1 tidyselect_1.2.1
## [49] abind_1.4-8 yaml_2.3.10
## [51] doParallel_1.0.17 codetools_0.2-20
## [53] admisc_0.39 lattice_0.22-7
## [55] tibble_3.3.0 Biobase_2.71.0
## [57] S7_0.2.0 evaluate_1.0.5
## [59] lambda.r_1.2.4 futile.logger_1.4.3
## [61] ggstats_0.11.0 proxy_0.4-27
## [63] pillar_1.11.1 BiocManager_1.30.26
## [65] MatrixGenerics_1.23.0 checkmate_2.3.3
## [67] foreach_1.5.2 stats4_4.5.1
## [69] generics_0.1.4 S4Vectors_0.49.0
## [71] ggplot2_4.0.0 TOAST_1.25.0
## [73] scales_1.4.0 gtools_3.9.5
## [75] class_7.3-23 glue_1.8.0
## [77] maketools_1.3.2 tools_4.5.1
## [79] sys_3.4.3 data.table_1.17.8
## [81] locfit_1.5-9.12 buildtools_1.0.0
## [83] mvtnorm_1.3-3 grid_4.5.1
## [85] tidyr_1.3.1 matrixcalc_1.0-6
## [87] edgeR_4.9.0 PROPER_1.43.0
## [89] TAM_4.3-25 config_0.3.2
## [91] cli_3.6.5 rsvd_1.0.5
## [93] futile.options_1.0.1 S4Arrays_1.11.0
## [95] dplyr_1.1.4 corpcor_1.6.10
## [97] gtable_0.3.6 DESeq2_1.51.0
## [99] sass_0.4.10 digest_0.6.37
## [101] BiocGenerics_0.57.0 SparseArray_1.11.1
## [103] farver_2.1.2 htmltools_0.5.8.1
## [105] lifecycle_1.0.4 TCA_1.2.1
## [107] locfdr_1.1-8 statmod_1.5.1
## [109] mime_0.13 MASS_7.3-65