if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("lemur")The goal of lemur is to simplify the analysis of multi-condition single-cell data. If you have collected a single-cell RNA-seq dataset with more than one condition, lemur predicts for each cell and gene what the expression data would be in all of the other conditions. Furthermore, lemur finds neighborhoods of cells that show consistent differential expression. The results are statistically validated using a pseudo-bulk differential expression test on hold-out data using glmGamPoi or edgeR.
lemur implements a novel framework to disentangle the effects of known covariates, latent cell states, and their interactions. At the core, is a combination of matrix factorization and regression analysis implemented as geodesic regression on Grassmann manifolds. We call this latent embedding multivariate regression. For more details see our preprint.
You can install lemur directly from Bioconductor. Just paste the following snippet into your R console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("lemur")Alternatively, you can install the package from Github using devtools:
devtools::install_github("const-ae/lemur")We continue working on improvements to this package. We are delighted if you decide to try out the package. Please use the Bioconductor forum or open an issue in the GitHub repository if you think you found a bug, have an idea for a cool feature, or have any questions about how LEMUR works.
A basic lemur workflow is as easy as the following.
# ... sce is a SingleCellExperiment object with your data
fit <- lemur(sce, design = ~ patient_id + condition, n_embedding = 15)
fit <- align_harmony(fit) # This step is optional
fit <- test_de(fit, contrast = cond(condition = "ctrl") - cond(condition = "panobinostat"))
nei <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))We will now go through these steps one by one.
We demonstrate lemur using data by Zhao et al. (2021). The data consist of tumor biopsies from five glioblastomas, each of which was treated with the drug panobinostat and with a control. Accordingly, we look at ten samples in a paired experimental design.
We start by loading required packages.
library("tidyverse")
library("SingleCellExperiment")
library("lemur")
set.seed(42)The lemur package ships with a reduced-size version of the glioblastoma data, which we use in the following.
data("glioblastoma_example_data", package = "lemur")
glioblastoma_example_dataclass: SingleCellExperiment
dim: 300 5000
metadata(0):
assays(2): counts logcounts
rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468 ENSG00000139289
rowData names(6): gene_id symbol ... strand. source
colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
colData names(10): patient_id treatment_id ... sample_id id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
As is, the data separated by the known covariates patient_id and condition.
orig_umap <- uwot::umap(as.matrix(t(logcounts(glioblastoma_example_data))))
as_tibble(colData(glioblastoma_example_data)) |>
mutate(umap = orig_umap) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = patient_id, shape = condition), size = 0.5) +
labs(title = "UMAP of logcounts") + coord_fixed()We fit the LEMUR model by calling the function lemur. We provide the experimental design using a formula. The elements of the formula can refer to columns of the colData of the SingleCellExperiment object.
We also set the number of latent dimensions, n_embedding, which has a similar interpretation as the number of dimensions in PCA. The test_fraction argument sets the fraction of cells which are exclusively used to test for differential expression and not for inferring the LEMUR parameters. It balances the sensitivity to detect subtle patterns in the latent space against the power to detect differentially expressed genes.
fit <- lemur(glioblastoma_example_data, design = ~ patient_id + condition,
n_embedding = 15, test_fraction = 0.5)
fitclass: lemur_fit
dim: 300 5000
metadata(9): n_embedding design ... use_assay row_mask
assays(2): counts logcounts
rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468 ENSG00000139289
rowData names(6): gene_id symbol ... strand. source
colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
colData names(10): patient_id treatment_id ... sample_id id
reducedDimNames(2): linearFit embedding
mainExpName: NULL
altExpNames(0):
The lemur function returns an object of class lemur_fit, which extends the SingleCellExperiment class. It supports subsetting and all the usual accessor methods (e.g., nrow, assay, colData, rowData). In addition, lemur overloads the $ operator to allow easy access to additional fields produced by the LEMUR model. For example, the low-dimensional embedding can be accessed using fit$embedding:
fit$embedding |> str() num [1:15, 1:5000] 8.077 0.208 2.317 -4.313 0.868 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5000] "CGCCAGAGCGCA" "AGCTTTACTGCG" "AGGATGACCGCA" "AGAACTATTTTT" ...
Optionally, we can further align corresponding cells using manually annotated cell types (align_by_grouping) or an automated alignment procedure (e.g., align_harmony). This ensures that corresponding cells are close to each other in the fit$embedding.
fit <- align_harmony(fit)Select cells that are considered close with 'harmony'
Figure 2 shows a UMAP of fit$embedding. This is similar to working on the integrated PCA space in a traditional single-cell analysis.
umap <- uwot::umap(t(fit$embedding))
as_tibble(fit$colData) |>
mutate(umap = umap) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = patient_id), size = 0.5) +
facet_wrap(vars(condition)) + coord_fixed()Next, let’s predict the effect of the panobinostat treatment for each cell and each gene – even for the cells that were observed in the control condition. The test_de function takes a lemur_fit object and returns the object with a new slot (in SummarizedExperiment parlance: assay) called DE. This slot contains the predicted logarithmic fold changes between the two conditions specified in contrast. Note that lemur implements a special notation for contrasts. Instead of providing a contrast vector or design matrix column names, you provide for each condition the levels, and lemur automatically forms the contrast vector. This is intended to make the notation more readable.
fit <- test_de(fit, contrast = cond(condition = "panobinostat") - cond(condition = "ctrl"))We can pick any gene, say GAP43, which in our data is represented by its Ensembl gene ID ENSG00000172020, and show its differential expression pattern on the UMAP plot:
df <- tibble(umap = umap) |>
mutate(de = assay(fit, "DE")["ENSG00000172020", ])
ggplot(df, aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de)) +
scale_color_gradient2(low = "#FFD800", high= "#0056B9") + coord_fixed()
ggplot(df, aes(x = de)) + geom_histogram(bins = 100)More systematically, we can now search through all the genes, and use their expression values (assay(fit, "DE")) to search for cell neighborhoods (sets of cells that are close together in latent space) that show consistent differential expression. The function find_de_neighborhoods validates the results of such a search with a pseudobulked diferential expression test. For that, it uses the test data (fit$test_data) that was put aside in the first call to lemur(). In addition, find_de_neighborhoods assesses if the difference between the conditions is significantly larger for the cells inside the neighborhood than the cells outside the neighborhood (see columns starting with did, short for difference-in-difference).
The group_by argument determines how the pseudobulk samples are formed. It specifies the columns in the fit$colData that are used to define a sample and is inspired by the group_by function in dplyr. Typically, you provide the covariates that were used for the experimental design plus the sample id (in this case patient_id).
neighborhoods <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))
as_tibble(neighborhoods) |>
left_join(as_tibble(rowData(fit)[,1:2]), by = c("name" = "gene_id")) |>
relocate(symbol, .before = "name") |>
arrange(pval) |>
head(5)# A tibble: 5 × 14
symbol name neighborhood n_cells sel_statistic pval adj_pval f_statistic df1 df2 lfc
<chr> <chr> <I<list>> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 KNG1 ENSG0000… <chr> 4005 65.6 2.97e-5 0.00892 96.9 1 6.76 2.81
2 POLR2L ENSG0000… <chr> 4161 92.5 1.31e-4 0.0139 60.4 1 6.76 1.41
3 PMP2 ENSG0000… <chr> 4828 -68.3 1.61e-4 0.0139 56.5 1 6.76 -1.45
4 MT2A ENSG0000… <chr> 3025 43.0 2.12e-4 0.0139 51.6 1 6.76 1.81
5 NUCKS1 ENSG0000… <chr> 3490 -42.5 2.32e-4 0.0139 50.1 1 6.76 -1.25
# ℹ 3 more variables: did_pval <dbl>, did_adj_pval <dbl>, did_lfc <dbl>
To continue, we investigate one gene for which the neighborhood shows a significant differential expression pattern: here we choose a CXCL8 (also known as interleukin 8), an important inflammation signalling molecule. We see that it is upregulated by panobinostat in a subset of cells (blue). We chose this gene because it (1) had a significant change between panobinostat and negative control condition (adj_pval column) and (2) showed much larger differential expression for the cells inside the neighborhood than for the cells outside (did_lfc column).
sel_gene <- "ENSG00000169429" # is CXCL8
p <- tibble(umap = umap) |>
mutate(de = assay(fit, "DE")[sel_gene,]) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de)) +
scale_color_gradient2(low = "#FFD800", high= "#0056B9") +
coord_fixed()
pNext, we are going to try something ambitious: in the LEMUR model, the cells in a neighborhood are separated from the rest of the cells by a (k − 1)-dimensional hyperplane in the k-dimensional latent space (k being the same as n_embedding from above, i.e., k= 15). We can try to approximate this separation as a line in the two-dimensional UMAP plot.
To this end, we create a helper dataframe and use the geom_density2d function from ggplot2. To avoid the cutting of the boundary to the extremes of the cell coordinates, add lims to the plot with an appropriately large limit.
neighborhood_coordinates <- neighborhoods |>
dplyr::filter(name == sel_gene) |>
unnest(c(neighborhood)) |>
dplyr::rename(cell_id = neighborhood) |>
left_join(tibble(cell_id = rownames(umap), umap), by = "cell_id") |>
dplyr::select(name, cell_id, umap)
p + geom_density2d(data = neighborhood_coordinates, breaks = seq(0.2, 0.6, by = 0.1),
contour_var = "ndensity", color = "#808080") To summarize our results, we can make a volcano plot of the differential expression results to better understand the expression differences across all genes.
neighborhoods |>
drop_na() |>
ggplot(aes(x = lfc, y = -log10(pval))) +
geom_point(aes(col = adj_pval < 0.1)) neighborhoods |>
drop_na() |>
ggplot(aes(x = n_cells, y = -log10(pval))) +
geom_point(aes(color = adj_pval < 0.1)) The analyses up to here were conducted without using any cell type information. Often, such additional cell type information is available or can be obtained from the data by other means. For instance, here, we can distinguish the tumor cells from non-malignment other cell, using the fact that the tumor cells had a deletion of Chromosome 10 and a duplication of Chromosome 7. We build a simple classifier to distinguish the cells accordingly. (This is just to illustrate the process; for a real analysis, we would use more sophisticated methods.)
tumor_label_df <- tibble(cell_id = colnames(fit),
chr7_total_expr = colMeans(logcounts(fit)[rowData(fit)$chromosome == "7",]),
chr10_total_expr = colMeans(logcounts(fit)[rowData(fit)$chromosome == "10",])) |>
mutate(is_tumor = chr7_total_expr > 0.8 & chr10_total_expr < 2.5)
ggplot(tumor_label_df, aes(x = chr10_total_expr, y = chr7_total_expr)) +
geom_point(aes(color = is_tumor), size = 0.5) +
geom_hline(yintercept = 0.8) +
geom_vline(xintercept = 2.5) tibble(umap = umap) |>
mutate(is_tumor = tumor_label_df$is_tumor) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = is_tumor), size = 0.5) +
facet_wrap(vars(is_tumor)) + coord_fixed()We use this cell annotation to focus our neighborhood finding within the tumor cells, to find tumor subpopulations.
tumor_fit <- fit[, tumor_label_df$is_tumor]
tum_nei <- find_de_neighborhoods(tumor_fit, group_by = vars(patient_id, condition), verbose = FALSE)
as_tibble(tum_nei) |>
left_join(as_tibble(rowData(fit)[,1:2]), by = c("name" = "gene_id")) |>
dplyr::relocate(symbol, .before = "name") |>
filter(adj_pval < 0.1) |>
arrange(did_pval) |>
dplyr::select(symbol, name, neighborhood, n_cells, adj_pval, lfc, did_pval, did_lfc) |>
print(n = 10)# A tibble: 31 × 8
symbol name neighborhood n_cells adj_pval lfc did_pval did_lfc
<chr> <chr> <I<list>> <int> <dbl> <dbl> <dbl> <dbl>
1 SERPINE1 ENSG00000106366 <chr [2,838]> 2838 0.0597 1.06 0.00627 -1.15
2 CXCL8 ENSG00000169429 <chr [2,187]> 2187 0.0547 1.43 0.00757 -1.93
3 CLU ENSG00000120885 <chr [2,375]> 2375 0.0793 0.808 0.0121 -1.18
4 NCL ENSG00000115053 <chr [2,513]> 2513 0.0682 -0.538 0.0403 0.574
5 MT2A ENSG00000125148 <chr [1,812]> 1812 0.00929 2.04 0.0506 -1.05
6 TUBA1B ENSG00000123416 <chr [2,899]> 2899 0.0790 -0.529 0.0688 0.649
7 NEAT1 ENSG00000245532 <chr [2,247]> 2247 0.0206 1.80 0.0785 -0.713
8 TUBA1A ENSG00000167552 <chr [1,521]> 1521 0.0328 -0.809 0.108 0.290
9 RPL10 ENSG00000147403 <chr [2,831]> 2831 0.0791 -0.456 0.154 0.319
10 TNFRSF12A ENSG00000006327 <chr [1,138]> 1138 0.0328 -0.795 0.172 0.271
# ℹ 21 more rows
Focusing on RPS11, we see that panobinostat mostly has no effect on its expression, except for a subpopulation of tumor cells where RPS11 was originally upregulated and panobinostat downregulates the expression. A small caveat: this analysis is conducted on a subset of all cells and should be interpreted carefully. Yet, this section demonstrates how lemur can be used to find tumor subpopulations which show differential responses to treatments.
sel_gene <- "ENSG00000142534" # is RPS11
as_tibble(colData(fit)) |>
mutate(expr = assay(fit, "logcounts")[sel_gene,]) |>
mutate(is_tumor = tumor_label_df$is_tumor) |>
mutate(in_neighborhood = id %in% filter(tum_nei, name == sel_gene)$neighborhood[[1]]) |>
ggplot(aes(x = condition, y = expr)) +
geom_jitter(size = 0.3, stroke = 0) +
geom_point(data = . %>% summarize(expr = mean(expr), .by = c(condition, patient_id, is_tumor, in_neighborhood)),
aes(color = patient_id), size = 2) +
stat_summary(fun.data = mean_se, geom = "crossbar", color = "red") +
facet_wrap(vars(is_tumor, in_neighborhood), labeller = label_both) lemur directly with the aligned data?No. You need to call lemur with the unaligned data so that it can learn how much the expression of each gene changes between conditions.
Yes. You can call lemur with any variance stabilized count matrix. Based on a previous project, I recommend to use log-transformation, but other methods will work just fine.
lemur() than before. What is happening?!This is a known issue and can be caused if the data has large compositional shifts (for example, if one cell type disappears). The problem is that the initial linear regression step, which centers the conditions relative to each other, overcorrects and introduces a consistent shift in the latent space. You can either use align_by_grouping / align_harmony to correct for this effect or manually fix the regression coefficient to zero:
fit <- lemur(sce, design = ~ patient_id + condition, n_embedding = 15, linear_coefficient_estimator = "zero")align_harmony / align_neighbors. What should I do?You can try to increase n_embedding. If this still does not help, there is little use in inferring differential expression neighborhoods. But as I haven’t encountered such a dataset yet, I would like to try it out myself. If you can share the data publicly, please open an issue.
lemur faster?Several parameters influence the duration to fit the LEMUR model and find differentially expressed neighborhoods:
DelayedArray) either as a sparse dgCMatrix or dense matrix.test_fraction means fewer cells are used to fit the model (and more cells are used for the DE test), which speeds up many steps.n_embedding reduces the latent dimensions of the fit, which makes the model less flexible, but speeds up the lemur() call.align_grouping is faster than align_harmony.selection_procedure = "contrast" in find_de_neighborhoods often produces better neighborhoods, but is a lot slower than selection_procedure = "zscore".size_factor_method = "ratio" in find_de_neighborhoods makes the DE more powerful, but is a lot slower than size_factor_method = "normed_sum".sessionInfo()R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.1 Biobase_2.71.0
[4] GenomicRanges_1.63.1 Seqinfo_1.1.0 IRanges_2.45.0
[7] S4Vectors_0.49.0 BiocGenerics_0.57.0 generics_0.1.4
[10] MatrixGenerics_1.23.0 matrixStats_1.5.0 lubridate_1.9.5
[13] forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
[16] purrr_1.2.1 readr_2.1.6 tidyr_1.3.2
[19] tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
[22] lemur_1.9.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.56 lattice_0.22-9
[4] tzdb_0.5.0 vctrs_0.7.1 tools_4.5.2
[7] pkgconfig_2.0.3 Matrix_1.7-4 RColorBrewer_1.1-3
[10] S7_0.2.1 sparseMatrixStats_1.23.0 lifecycle_1.0.5
[13] compiler_4.5.2 farver_2.1.2 RhpcBLASctl_0.23-42
[16] codetools_0.2-20 glmGamPoi_1.23.0 htmltools_0.5.9
[19] sys_3.4.3 buildtools_1.0.0 yaml_2.3.12
[22] pillar_1.11.1 MASS_7.3-65 uwot_0.2.4
[25] DelayedArray_0.37.0 abind_1.4-8 RSpectra_0.16-2
[28] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
[31] splines_4.5.2 labeling_0.4.3 maketools_1.3.2
[34] cowplot_1.2.0 fastmap_1.2.0 grid_4.5.2
[37] cli_3.6.5 harmony_1.2.4 SparseArray_1.11.10
[40] magrittr_2.0.4 S4Arrays_1.11.1 utf8_1.2.6
[43] withr_3.0.2 DelayedMatrixStats_1.33.0 scales_1.4.0
[46] timechange_0.4.0 rmarkdown_2.30 XVector_0.51.0
[49] hms_1.1.4 beachmat_2.27.2 evaluate_1.0.5
[52] knitr_1.51 RcppAnnoy_0.0.23 irlba_2.3.7
[55] rlang_1.1.7 isoband_0.3.0 Rcpp_1.1.1
[58] glue_1.8.0 jsonlite_2.0.0 R6_2.6.1