augere.screen 0.99.2
augere.screen implements an augere pipeline to detect differentially abundant barcodes in a functional screen.
Given a matrix of barcode-by-sample counts, this uses voom() from the limma package to test for differences between conditions.
It then consolidates barcode-level statistics into per-gene inferences with a variety of statistical methods.
The pipeline also creates a self-contained Rmarkdown report that documents each step in the analysis.
To install this package, follow the usual instructions for Bioconductor:
install.packages("BiocManager") # if not already available
BiocManager::install("augere.screen")
Now let’s mock up a dataset from a functional screen. This contains a count matrix of barcode abundances, where each row is a barcode and each column is a sample. Each barcode is associated to a target gene via a 1:many mapping. Some barcodes are also negative controls, i.e., they do not target any gene.
make_mock_dataset <- function() {
set.seed(100)
mu <- 2^runif(1000, 3, 6)
grouping <- rep(c("A", "B", "ctrl"), each=3)
counts <- matrix(rnbinom(length(mu)* length(grouping), mu=mu, size=20), ncol=length(grouping))
rownames(counts) <- sprintf("BARCODE-%s", seq_along(mu))
library(SummarizedExperiment)
se <- SummarizedExperiment(list(counts=counts), colData=DataFrame(group=grouping))
rowData(se)$type <- "other"
rowData(se)$gene <- paste0("GENE-", sample(LETTERS, length(mu), replace=TRUE))
rowData(se)$type[1:50] <- "NEG"
rowData(se)$gene[1:50] <- "NEG"
se
}
se <- make_mock_dataset()
se
## class: SummarizedExperiment
## dim: 1000 9
## metadata(0):
## assays(1): counts
## rownames(1000): BARCODE-1 BARCODE-2 ... BARCODE-999 BARCODE-1000
## rowData names(2): type gene
## colnames: NULL
## colData names(1): group
The runScreen() function uses voom() to test for differential abundance across a comparison of interest.
Let’s say we want to compare A versus B:
library(augere.screen)
output.dir <- tempfile()
res <- runScreen(
se,
group="group",
comparison=c("A", "B"),
gene.field="gene",
filter.reference.factor="group",
filter.reference.levels="ctrl",
norm.control.type.field="type",
norm.control.types="NEG",
output.dir=output.dir
)
# List of tables of barcode-level differential results.
names(res$barcode)
## [1] "Increase in `A` over `B`"
res$barcode[[1]]
## DataFrame with 1000 rows and 7 columns
## gene LogFC AveAb t PValue FDR
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## BARCODE-1 NEG -0.445506 9.14812 -1.09034 0.2756355 0.979159
## BARCODE-2 NEG -0.556312 9.02579 -1.31320 0.1891996 0.979159
## BARCODE-3 NEG 0.593803 9.97191 1.73677 0.0825132 0.979159
## BARCODE-4 NEG NA NA NA NA NA
## BARCODE-5 NEG 0.251991 9.84108 0.72858 0.4663056 0.979159
## ... ... ... ... ... ... ...
## BARCODE-996 GENE-H 0.0682531 10.4678 0.2127503 0.831534 0.979159
## BARCODE-997 GENE-N NA NA NA NA NA
## BARCODE-998 GENE-X NA NA NA NA NA
## BARCODE-999 GENE-D -0.3067536 9.1611 -0.7524626 0.451822 0.979159
## BARCODE-1000 GENE-W 0.0178532 10.4276 0.0555378 0.955713 0.990915
## B
## <numeric>
## BARCODE-1 -4.59061
## BARCODE-2 -4.57672
## BARCODE-3 -4.51731
## BARCODE-4 NA
## BARCODE-5 -4.61479
## ... ...
## BARCODE-996 -4.63948
## BARCODE-997 NA
## BARCODE-998 NA
## BARCODE-999 -4.60827
## BARCODE-1000 -4.64117
# List of tables of gene-level statistics.
names(res$gene)
## [1] "simes" "fry" "holm-min"
# Copy of 'se' with normalized abundance values.
res$normalized
## class: SummarizedExperiment
## dim: 1000 6
## metadata(0):
## assays(2): counts logcounts
## rownames(1000): BARCODE-1 BARCODE-2 ... BARCODE-999 BARCODE-1000
## rowData names(3): type gene retained
## colnames: NULL
## colData names(3): group lib.size norm.factors
The function generates an Rmarkdown report containing (almost) all of the steps required to reproduce the analysis. Let’s have a look at it:
fname <- file.path(output.dir, "report.Rmd")
cat(head(readLines(fname), 50), sep="\n")
## ---
## title: Differential abundance analysis of functional screens
## author:
## - biocbuild
## date: "`r Sys.Date()`"
## output: BiocStyle::html_document
## ---
##
## ```{r, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
##
## # Data loading
##
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a barcode while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## original <- se # making a copy for downstream use.
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
## This ensures that our variance estimates are not affected by irrelevant variability in other groups.
##
## ```{r}
## se <- se[,SummarizedExperiment::colData(se)[,"group"] %in% c("A", "B")]
## ncol(se)
## ```
##
## We extract the information into a `DGEList` object in preparation for analysis.
##
## ```{r}
## y <- edgeR::DGEList(SummarizedExperiment::assay(se, 1))
##
## # Also tracking the original row index from the SummarizedExperiment,
## # so that we can match them up after filtering out low-abundance barcodes.
## y$genes <- data.frame(origin=seq_len(nrow(se)))
## ```
##
## Finally, we build the design matrix.
##
## ```{r}
## model.data <- list()
## cd <- SummarizedExperiment::colData(se)
## model.data$group. <- factor(cd[,"group"])
## design <- model.matrix(~ 0 + group., data=model.data)
The function will also save the results to file inside a results subdirectory, which can be loaded back into the session using the readResult() function:
reloaded <- readResult(file.path(output.dir, "results", "barcode-1"))
reloaded$x # the result itself
## DataFrame with 1000 rows and 7 columns
## gene LogFC AveAb t PValue FDR
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## BARCODE-1 NEG -0.445506 9.14812 -1.09034 0.2756355 0.979159
## BARCODE-2 NEG -0.556312 9.02579 -1.31320 0.1891996 0.979159
## BARCODE-3 NEG 0.593803 9.97191 1.73677 0.0825132 0.979159
## BARCODE-4 NEG NA NA NA NA NA
## BARCODE-5 NEG 0.251991 9.84108 0.72858 0.4663056 0.979159
## ... ... ... ... ... ... ...
## BARCODE-996 GENE-H 0.0682531 10.4678 0.2127503 0.831534 0.979159
## BARCODE-997 GENE-N NA NA NA NA NA
## BARCODE-998 GENE-X NA NA NA NA NA
## BARCODE-999 GENE-D -0.3067536 9.1611 -0.7524626 0.451822 0.979159
## BARCODE-1000 GENE-W 0.0178532 10.4276 0.0555378 0.955713 0.990915
## B
## <numeric>
## BARCODE-1 -4.59061
## BARCODE-2 -4.57672
## BARCODE-3 -4.51731
## BARCODE-4 NA
## BARCODE-5 -4.61479
## ... ...
## BARCODE-996 -4.63948
## BARCODE-997 NA
## BARCODE-998 NA
## BARCODE-999 -4.60827
## BARCODE-1000 -4.64117
str(reloaded$metadata) # along with some metadata
## List of 4
## $ title : chr "Increase in `A` over `B` (barcode)"
## $ authors :List of 1
## ..$ : chr "biocbuild"
## $ type : chr "differential_barcode_abundance"
## $ differential_barcode_abundance:List of 3
## ..$ contrast :List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "A"
## .. ..$ right:List of 1
## .. .. ..$ : chr "B"
## ..$ normalization: chr "TMM on controls (NEG)"
## ..$ method : chr "voom-limma"
As one might expect, the group and comparisons arguments specify to the grouping factor in colData(se) and the comparison(s) to be performed, respectively.
Many of the arguments to runScreen() are shared with those in the runVoom() function in the augere.de package for differential expression,
so readers are referred to the latter’s vignette for more details on the parametrization of the design and contrasts.
The gene.field argument specifies the column of rowData(se) that defines the gene associated with each barcode.
This is used to consolidate the barcode-level statistics into per-gene results.
Doing so is usually desirable as genes are the features of scientific interest, not barcodes.
If this argument is not supplied, no consolidation is performed.
The filter.reference.factor and filter.reference.levels arguments specify the reference samples in the dataset.
These are used to quantify the abundances in the original barcode library.
We then remove barcodes with unusually low abundances in the references, e.g., due to defects in library manufacture.
If no reference samples are available, runScreen() will still use the default filtering from edgeR to remove low-abundance barcodes.
The norm.control.type.field and norm.control.types arguments specify the control barcodes that should not change in abundance during the experiment.
Thus, any differences in abundance are technical and can be used to quantify the scaling bias for normalization.
In this case, our negative controls are marked as targeting NEGs, i.e., non-essential genes, though one could also use non-targeting controls (NTC).
If no control barcodes are present, runScreen() will fall back to TMM normalization on all barcodes, assuming that the majority of barcodes do not change between conditions.
When consolidating barcode-level information into gene-level results,
the joint null hypothesis for each gene is that none of its barcodes are differentially abundant.
Different methods are used by runScreen() to test this joint null, each of which has different statistical properties.
The simplest approach involves applying Simes’ method to the per-barcode \(p\)-values. This is the most sensitive approach as it can reject the joint null if a small number of barcodes (possibly just 1) are differentially abundant. It is most suited for CRISPRa where few guides are expected to succeed for each gene, and is also useful for studying sporadic off-target effects in CRISPR knockout or CRISPRi.
names(res$gene$simes)
## [1] "Increase in `A` over `B`"
res$gene$simes[[1]]
## DataFrame with 27 rows and 8 columns
## NumBarcodes PValue FDR Direction NumUp NumDown
## <integer> <numeric> <numeric> <character> <integer> <integer>
## GENE-A 31 1.368298 1.000000 up 0 0
## GENE-B 28 0.165247 0.727021 down 0 0
## GENE-C 31 1.062718 1.000000 down 0 0
## GENE-D 33 0.190504 0.727021 up 0 0
## GENE-E 22 0.323121 0.727021 down 0 0
## ... ... ... ... ... ... ...
## GENE-W 31 0.4680670 0.843643 down 0 0
## GENE-X 24 0.0664626 0.727021 up 0 0
## GENE-Y 31 0.2946920 0.727021 down 0 0
## GENE-Z 39 0.0475706 0.727021 up 0 0
## NEG 49 1.0107865 1.000000 up 0 0
## LogFC AveAb
## <numeric> <numeric>
## GENE-A -0.508967 10.96911
## GENE-B -0.925260 10.21210
## GENE-C -1.022414 8.35278
## GENE-D 0.978704 9.89606
## GENE-E -0.969849 9.31064
## ... ... ...
## GENE-W -1.049680 8.98693
## GENE-X 1.382716 8.76149
## GENE-Y -0.954461 9.23219
## GENE-Z 0.999664 10.87339
## NEG 0.648934 10.50059
A more stringent approach uses an adaptation of the Holm method on the per-barcode \(p\)-values. This only rejects the joint null if a user-specified minimum number/proportion of barcodes for a gene are differentially abundant. It is suitable for all CRISPR modalities but tends to be more conservative than the other methods.
res$gene$`holm-min`[[1]]
## DataFrame with 27 rows and 8 columns
## NumBarcodes PValue FDR Direction NumUp NumDown
## <integer> <numeric> <numeric> <character> <integer> <integer>
## GENE-A 31 1 1 down 0 0
## GENE-B 28 1 1 down 0 0
## GENE-C 31 1 1 down 0 0
## GENE-D 33 1 1 down 0 0
## GENE-E 22 1 1 down 0 0
## ... ... ... ... ... ... ...
## GENE-W 31 1 1 down 0 0
## GENE-X 24 1 1 down 0 0
## GENE-Y 31 1 1 down 0 0
## GENE-Z 39 1 1 down 0 0
## NEG 49 1 1 down 0 0
## LogFC AveAb
## <numeric> <numeric>
## GENE-A -0.508967 10.96911
## GENE-B -0.925260 10.21210
## GENE-C -1.022414 8.35278
## GENE-D 0.978704 9.89606
## GENE-E -0.969849 9.31064
## ... ... ...
## GENE-W -1.049680 8.98693
## GENE-X 1.382716 8.76149
## GENE-Y -0.954461 9.23219
## GENE-Z 0.999664 10.87339
## NEG 0.648934 10.50059
Finally, we can use the fry() method from limma to perform a “barcode set test” on the barcode abundances for each gene.
This favors consistent differences across the majority of barcodes for a given gene.
It is most suited to CRISPR knockout or CRISPRi screens.
res$gene$fry[[1]]
## DataFrame with 27 rows and 10 columns
## NumBarcodes Direction PValue FDR PValue.Mixed FDR.Mixed
## <integer> <character> <numeric> <numeric> <numeric> <numeric>
## GENE-A 31 up 0.809955 0.949216 0.960329 0.966116
## GENE-B 28 up 0.901654 0.949216 0.583051 0.874576
## GENE-C 31 down 0.936767 0.949216 0.704725 0.921348
## GENE-D 33 up 0.420938 0.947110 0.559190 0.874576
## GENE-E 22 down 0.847173 0.949216 0.430066 0.874576
## ... ... ... ... ... ... ...
## GENE-W 31 up 0.917329 0.949216 0.442895 0.874576
## GENE-X 24 up 0.140628 0.931903 0.100161 0.874576
## GENE-Y 31 down 0.786387 0.949216 0.227278 0.874576
## GENE-Z 39 up 0.295041 0.931903 0.146459 0.874576
## NEG 49 up 0.617632 0.949216 0.836563 0.966116
## NumUp NumDown LogFC AveAb
## <integer> <integer> <numeric> <numeric>
## GENE-A 0 0 -0.508967 10.96911
## GENE-B 0 0 -0.925260 10.21210
## GENE-C 0 0 -1.022414 8.35278
## GENE-D 0 0 0.978704 9.89606
## GENE-E 0 0 -0.969849 9.31064
## ... ... ... ... ...
## GENE-W 0 0 -1.049680 8.98693
## GENE-X 0 0 1.382716 8.76149
## GENE-Y 0 0 -0.954461 9.23219
## GENE-Z 0 0 0.999664 10.87339
## NEG 0 0 0.648934 10.50059
For each gene, we also report the number of barcodes changing in each direction, the net direction of change, and the average abundance and log-fold change of the most significant barcode. This can be helpful for interpreting the gene’s consolidated statistics.
We can set save.results=FALSE to instruct runEdgeR() to not write results to disk.
This is more efficient if we only need the results in the current R session.
Setting dry.run=TRUE will instruct runEdgeR() to only generate the Rmarkdown report but not evaluate it.
This can be helpful for generating a draft report for further refinement.
Careful readers might notice that the Rmarkdown report generated by runScreen() is not quite self-contained as the pipeline doesn’t know how se was constructed.
In the absence of this information, the report contains a placeholder stop() to remind the user to insert the relevant commands:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a barcode while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## original <- se # making a copy for downstream use.
## se
## ```
If we know how the object was generated, we can pass this information to runScreen():
output.dry.dir <- tempfile()
runScreen(
wrapInput(se, commands=as.character(body(make_mock_dataset))[-1]),
group="group",
comparison=c("A", "B"),
gene.field="gene",
filter.reference.factor="group",
filter.reference.levels="ctrl",
norm.control.type.field="type",
norm.control.types="NEG",
output.dir=output.dry.dir,
dry.run=TRUE
)
## NULL
This instructs runScreen() to insert the relevant commands into the report for full reproducibility:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a barcode while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## set.seed(100)
## mu <- 2^runif(1000, 3, 6)
## grouping <- rep(c("A", "B", "ctrl"), each = 3)
## counts <- matrix(rnbinom(length(mu) * length(grouping), mu = mu, size = 20), ncol = length(grouping))
## rownames(counts) <- sprintf("BARCODE-%s", seq_along(mu))
## library(SummarizedExperiment)
We can also attach rowData() columns to the result tables or add custom fields to the result metadata:
output.custom.dir <- tempfile()
res <- runScreen(
se,
group="group",
comparison=c("A", "B"),
gene.field="gene",
filter.reference.factor="group",
filter.reference.levels="ctrl",
norm.control.type.field="type",
norm.control.types="NEG",
row.data=c("gene", "type"),
gene.data="type",
metadata=list(
project="my_phd_project",
collaborators=list("anne", "bob", "charlie"),
data_source="https://foo.bar.com"
),
output.dir=output.custom.dir
)
# rowData columns are now included in the barcode- and gene-level DFss.
res$barcode[[1]]
## DataFrame with 1000 rows and 8 columns
## gene type LogFC AveAb t PValue
## <character> <character> <numeric> <numeric> <numeric> <numeric>
## BARCODE-1 NEG NEG -0.445506 9.14812 -1.09034 0.2756355
## BARCODE-2 NEG NEG -0.556312 9.02579 -1.31320 0.1891996
## BARCODE-3 NEG NEG 0.593803 9.97191 1.73677 0.0825132
## BARCODE-4 NEG NEG NA NA NA NA
## BARCODE-5 NEG NEG 0.251991 9.84108 0.72858 0.4663056
## ... ... ... ... ... ... ...
## BARCODE-996 GENE-H other 0.0682531 10.4678 0.2127503 0.831534
## BARCODE-997 GENE-N other NA NA NA NA
## BARCODE-998 GENE-X other NA NA NA NA
## BARCODE-999 GENE-D other -0.3067536 9.1611 -0.7524626 0.451822
## BARCODE-1000 GENE-W other 0.0178532 10.4276 0.0555378 0.955713
## FDR B
## <numeric> <numeric>
## BARCODE-1 0.979159 -4.59061
## BARCODE-2 0.979159 -4.57672
## BARCODE-3 0.979159 -4.51731
## BARCODE-4 NA NA
## BARCODE-5 0.979159 -4.61479
## ... ... ...
## BARCODE-996 0.979159 -4.63948
## BARCODE-997 NA NA
## BARCODE-998 NA NA
## BARCODE-999 0.979159 -4.60827
## BARCODE-1000 0.990915 -4.64117
res$gene$simes[[1]]
## DataFrame with 27 rows and 9 columns
## type NumBarcodes PValue FDR Direction NumUp
## <character> <integer> <numeric> <numeric> <character> <integer>
## GENE-A other 31 1.368298 1.000000 up 0
## GENE-B other 28 0.165247 0.727021 down 0
## GENE-C other 31 1.062718 1.000000 down 0
## GENE-D other 33 0.190504 0.727021 up 0
## GENE-E other 22 0.323121 0.727021 down 0
## ... ... ... ... ... ... ...
## GENE-W other 31 0.4680670 0.843643 down 0
## GENE-X other 24 0.0664626 0.727021 up 0
## GENE-Y other 31 0.2946920 0.727021 down 0
## GENE-Z other 39 0.0475706 0.727021 up 0
## NEG NEG 49 1.0107865 1.000000 up 0
## NumDown LogFC AveAb
## <integer> <numeric> <numeric>
## GENE-A 0 -0.508967 10.96911
## GENE-B 0 -0.925260 10.21210
## GENE-C 0 -1.022414 8.35278
## GENE-D 0 0.978704 9.89606
## GENE-E 0 -0.969849 9.31064
## ... ... ... ...
## GENE-W 0 -1.049680 8.98693
## GENE-X 0 1.382716 8.76149
## GENE-Y 0 -0.954461 9.23219
## GENE-Z 0 0.999664 10.87339
## NEG 0 0.648934 10.50059
# The custom metadata is also saved to disk.
str(readResult(file.path(output.custom.dir, "results", "barcode-1"))$metadata)
## List of 7
## $ project : chr "my_phd_project"
## $ collaborators :List of 3
## ..$ : chr "anne"
## ..$ : chr "bob"
## ..$ : chr "charlie"
## $ data_source : chr "https://foo.bar.com"
## $ title : chr "Increase in `A` over `B` (barcode)"
## $ authors :List of 1
## ..$ : chr "biocbuild"
## $ type : chr "differential_barcode_abundance"
## $ differential_barcode_abundance:List of 3
## ..$ contrast :List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "A"
## .. ..$ right:List of 1
## .. .. ..$ : chr "B"
## ..$ normalization: chr "TMM on controls (NEG)"
## ..$ method : chr "voom-limma"
sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] limma_3.69.0 augere.screen_0.99.2
## [3] SummarizedExperiment_1.43.0 Biobase_2.73.1
## [5] GenomicRanges_1.65.0 Seqinfo_1.3.0
## [7] IRanges_2.47.0 S4Vectors_0.51.1
## [9] BiocGenerics_0.59.0 generics_0.1.4
## [11] MatrixGenerics_1.25.0 matrixStats_1.5.0
## [13] BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 SparseArray_1.13.2 lattice_0.22-9
## [4] h5mread_1.5.0 alabaster.se_1.13.0 magrittr_2.0.5
## [7] digest_0.6.39 evaluate_1.0.5 grid_4.6.0
## [10] augere.de_0.99.2 augere.core_0.99.3 bookdown_0.46
## [13] fastmap_1.2.0 jsonlite_2.0.0 Matrix_1.7-5
## [16] alabaster.schemas_1.13.0 tinytex_0.59 BiocManager_1.30.27
## [19] HDF5Array_1.41.0 jquerylib_0.1.4 abind_1.4-8
## [22] cli_3.6.6 rlang_1.2.0 XVector_0.53.0
## [25] cachem_1.1.0 DelayedArray_0.39.1 yaml_2.3.12
## [28] metapod_1.21.0 otel_0.2.0 S4Arrays_1.13.0
## [31] tools_4.6.0 Rhdf5lib_2.1.0 locfit_1.5-9.12
## [34] alabaster.ranges_1.13.0 alabaster.matrix_1.13.0 alabaster.base_1.13.0
## [37] R6_2.6.1 magick_2.9.1 lifecycle_1.0.5
## [40] rhdf5_2.57.0 edgeR_4.11.0 bslib_0.10.0
## [43] Rcpp_1.1.1-1.1 statmod_1.5.1 xfun_0.57
## [46] rhdf5filters_1.25.0 knitr_1.51 htmltools_0.5.9
## [49] rmarkdown_2.31 compiler_4.6.0