--- title: "Side-by-side comparison with standard interfaces" author: "Stefano Mangiola" date: "`r Sys.Date()`" package: tidybulk output: BiocStyle::html_document: toc_float: true bibliography: references.bib link-citations: true keywords: "transcriptomics, RNA-seq, differential expression, data analysis, tidyverse, SummarizedExperiment, bioinformatics, genomics, gene expression, clustering, dimensionality reduction, cellularity analysis, gene enrichment, R package" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Side-by-side comparison with standard interfaces} %\VignetteEncoding{UTF-8} --- [![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) [![R build status](https://github.com/stemangiola/tidybulk/workflows/R-CMD-check/badge.svg)](https://github.com/stemangiola/tidybulk/actions/) [![Bioconductor status](https://bioconductor.org/shields/build/release/bioc/tidybulk.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/tidybulk/) **tidybulk** is a powerful R package designed for modular transcriptomic data analysis that brings transcriptomics to the tidyverse. ## Why tidybulk? Tidybulk provides a unified interface for comprehensive transcriptomic data analysis with seamless integration of SummarizedExperiment objects and tidyverse principles. It streamlines the entire workflow from raw data to biological insights. ### Scientific Citation Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and Anthony T. Papenfuss. 2021. "Tidybulk: An R tidy framework for modular transcriptomic data analysis." Genome Biology 22 (42). https://doi.org/10.1186/s13059-020-02233-7 [Genome Biology - tidybulk: an R tidy framework for modular transcriptomic data analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7) ```{r setup-libraries-and-theme, echo=FALSE, include=FALSE} library(knitr) library(dplyr) library(tidyr) library(tibble) library(purrr) library(magrittr) library(forcats) library(ggplot2) library(ggrepel) library(SummarizedExperiment) library(tidybulk) library(tidySummarizedExperiment) library(airway) # Define theme my_theme = theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(linewidth = 0.2), panel.grid.minor = element_line(linewidth = 0.1), text = element_text(size=12), legend.position="bottom", aspect.ratio=1, strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)) ) ``` In this vignette we will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells. This dataset is available in the [airway](https://bioconductor.org/packages/airway/) package. ```{r load airway} library(airway) data(airway) ``` This workflow, will use the [tidySummarizedExperiment](https://bioconductor.org/packages/tidySummarizedExperiment/) package to manipulate the data in a `tidyverse` fashion. This approach streamlines the data manipulation and analysis process, making it more efficient and easier to understand. ```{r load tidySummarizedExperiment} library(tidySummarizedExperiment) ``` Here we will add a gene symbol column to the `airway` object. This will be used to interpret the differential expression analysis, and to deconvolve the cellularity. ```{r add-gene-symbol} library(org.Hs.eg.db) library(AnnotationDbi) # Add gene symbol and entrez airway <- airway |> mutate(entrezid = mapIds(org.Hs.eg.db, keys = gene_name, keytype = "SYMBOL", column = "ENTREZID", multiVals = "first" )) detach("package:org.Hs.eg.db", unload = TRUE) detach("package:AnnotationDbi", unload = TRUE) ``` # Side-by-side Comparison with Standard Interfaces This vignette demonstrates how tidybulk compares to standard R/Bioconductor approaches for transcriptomic data analysis. We'll show the same analysis performed using both tidybulk (tidyverse-style) and traditional methods side by side. ## Data Overview We will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells: ```{r data-overview} airway ``` Loading `tidySummarizedExperiment` makes the `SummarizedExperiment` objects compatible with tidyverse tools while maintaining its `SummarizedExperiment` nature. This is useful because it allows us to use the `tidyverse` tools to manipulate the data. ```{r check-se-class} class(airway) ``` ### Prepare Data for Analysis Before analysis, we need to ensure our variables are in the correct format: ```{r convert-condition-to-factor} # Convert dex to factor for proper differential expression analysis airway = airway |> mutate(dex = as.factor(dex)) ``` ## Step 1: Aggregate Duplicated Transcripts tidybulk provides the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters. > Transcript aggregation is a standard bioinformatics approach for gene-level summarization.

TidyTranscriptomics ```{r aggregate, message=FALSE, warning=FALSE, results='hide', class.source='yellow', eval=FALSE} rowData(airway)$gene_name = rownames(airway) airway.aggr = airway |> aggregate_duplicates(.transcript = gene_name) ```

Standard procedure (comparative purpose) ```{r aggregate long, eval=FALSE} temp = data.frame( symbol = dge_list$genes$symbol, dge_list$counts ) dge_list.nr <- by(temp, temp$symbol, function(df) if(length(df[1,1])>0) matrixStats:::colSums(as.matrix(df[,-1])) ) dge_list.nr <- do.call("rbind", dge_list.nr) colnames(dge_list.nr) <- colnames(dge_list) ```

## Step 2: Scale Abundance We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). `scale_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method as arguments and returns a tibble with additional columns with scaled data as `_scaled`. > Normalization is crucial for comparing expression levels across samples with different library sizes.

TidyTranscriptomics ```{r normalise} airway.norm = airway |> identify_abundant(factor_of_interest = dex) |> scale_abundance() ```

Standard procedure (comparative purpose) ```{r normalise long, eval=FALSE} library(edgeR) dgList <- DGEList(count_m=x,group=group) keep <- filterByExpr(dgList) dgList <- dgList[keep,,keep.lib.sizes=FALSE] # ... additional processing steps ... dgList <- calcNormFactors(dgList, method="TMM") norm_counts.table <- cpm(dgList) ```

```{r, include=FALSE} airway.norm |> dplyr::select(`counts`, counts_scaled, .abundant, everything()) ``` We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by treatment. ```{r plot_normalise, fig.width = 10, fig.height = 10} airway.norm |> ggplot(aes(counts_scaled + 1, group=.sample, color=`dex`)) + geom_density() + scale_x_log10() + my_theme ``` ## Step 3: Filter Variable Transcripts We may want to identify and filter variable transcripts to focus on the most informative features. > Variable transcript filtering helps reduce noise and focuses analysis on the most informative features.

TidyTranscriptomics ```{r filter variable} airway.norm.variable = airway.norm |> keep_variable() ```

Standard procedure (comparative purpose) ```{r filter variable long, eval=FALSE} library(edgeR) x = norm_counts.table s <- rowMeans((x-rowMeans(x))^2) o <- order(s,decreasing=TRUE) x <- x[o[1L:top],,drop=FALSE] norm_counts.table = norm_counts.table[rownames(x)] norm_counts.table$cell_type = tibble_counts[ match( tibble_counts$sample, rownames(norm_counts.table) ), "cell" ] ```

## Step 4: Reduce Dimensions We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. `reduce_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions. > Dimensionality reduction helps visualize high-dimensional data and identify patterns. **MDS** (Robinson et al., 10.1093/bioinformatics/btp616)

TidyTranscriptomics ```{r mds} airway.norm.MDS = airway.norm |> reduce_dimensions(method="MDS", .dims = 2) ```

Standard procedure (comparative purpose) ```{r, eval = FALSE} library(limma) count_m_log = log(count_m + 1) cmds = limma::plotMDS(count_m_log, ndim = 3, plot = FALSE) cmds = cmds %$% cmdscale.out |> setNames(sprintf("Dim%s", 1:6)) cmds$cell_type = tibble_counts[ match(tibble_counts$sample, rownames(cmds)), "cell" ] ```

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by treatment. ```{r plot_mds, fig.width = 10, fig.height = 10, eval=FALSE} airway.norm.MDS |> pivot_sample() |> dplyr::select(contains("Dim"), everything()) airway.norm.MDS |> pivot_sample() |> GGally::ggpairs(columns = 9:11, ggplot2::aes(colour=`dex`)) ``` **PCA**

TidyTranscriptomics ```{r pca, message=FALSE, warning=FALSE, results='hide'} airway.norm.PCA = airway.norm |> reduce_dimensions(method="PCA", .dims = 2) ```

Standard procedure (comparative purpose) ```{r,eval=FALSE} count_m_log = log(count_m + 1) pc = count_m_log |> prcomp(scale = TRUE) variance = pc$sdev^2 variance = (variance / sum(variance))[1:6] pc$cell_type = counts[ match(counts$sample, rownames(pc)), "cell" ] ```

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by treatment. ```{r plot_pca, fig.width = 10, fig.height = 10, eval=FALSE} airway.norm.PCA |> pivot_sample() |> dplyr::select(contains("PC"), everything()) airway.norm.PCA |> pivot_sample() |> GGally::ggpairs(columns = 11:13, ggplot2::aes(colour=`dex`)) ``` ## Step 5: Rotate Dimensions We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. `rotate_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as ` rotated ` by default, or as specified in the input arguments. > Dimension rotation can help align data with biological axes of interest.

TidyTranscriptomics ```{r rotate} airway.norm.MDS.rotated = airway.norm.MDS |> rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45) ```

Standard procedure (comparative purpose) ```{r, eval=FALSE} rotation = function(m, d) { r = d * pi / 180 ((bind_rows( c(`1` = cos(r), `2` = -sin(r)), c(`1` = sin(r), `2` = cos(r)) ) |> as_matrix()) %*% m) } mds_r = pca |> rotation(rotation_degrees) mds_r$cell_type = counts[ match(counts$sample, rownames(mds_r)), "cell" ] ```

**Original** On the x and y axes axis we have the first two reduced dimensions, data is coloured by treatment. ```{r plot_rotate_1, fig.width = 10, fig.height = 10} airway.norm.MDS.rotated |> ggplot(aes(x=`Dim1`, y=`Dim2`, color=`dex` )) + geom_point() + my_theme ``` **Rotated** On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by treatment. ```{r plot_rotate_2, fig.width = 10, fig.height = 10} airway.norm.MDS.rotated |> pivot_sample() |> ggplot(aes(x=`Dim1_rotated_45`, y=`Dim2_rotated_45`, color=`dex` )) + geom_point() + my_theme ``` ## Step 8: Test Differential Abundance We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). `test_differential_expression` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate). > Differential expression analysis identifies genes that are significantly different between conditions.

TidyTranscriptomics ```{r de, message=FALSE, warning=FALSE, results='hide'} airway.de = airway |> test_differential_expression( ~ dex, method = "edgeR_quasi_likelihood") |> pivot_transcript() airway.de ```

Standard procedure (comparative purpose) ```{r, eval=FALSE} library(edgeR) dgList <- DGEList(counts=counts_m,group=group) keep <- filterByExpr(dgList) dgList <- dgList[keep,,keep.lib.sizes=FALSE] dgList <- calcNormFactors(dgList) design <- model.matrix(~group) dgList <- estimateDisp(dgList,design) fit <- glmQLFit(dgList,design) qlf <- glmQLFTest(fit,coef=2) topTags(qlf, n=Inf) ```

The functon `test_differential_expression` operated with contrasts too. The constrasts hve the name of the design matrix (generally ) ```{r de contrast, message=FALSE, warning=FALSE, results='hide', eval=FALSE} airway.de = airway |> identify_abundant(factor_of_interest = dex) |> test_differential_expression( ~ 0 + dex, .contrasts = c( "dexuntrt - dextrt"), method = "edgeR_quasi_likelihood" ) |> pivot_transcript() ``` ## Step 6: Adjust for Unwanted Variation We may want to adjust `counts` for (known) unwanted variation. `adjust_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as `_adjusted`. At the moment just an unwanted covariates is allowed at a time. > Batch effect correction is important for removing technical variation that could confound biological signals.

TidyTranscriptomics ```{r adjust, message=FALSE, warning=FALSE, results='hide'} airway.norm.adj = airway.norm |> adjust_abundance( .factor_unwanted = cell, .factor_of_interest = dex, method="combat") ```

Standard procedure (comparative purpose) ```{r, eval=FALSE} library(sva) count_m_log = log(count_m + 1) design = model.matrix( object = ~ factor_of_interest + batch, data = annotation ) count_m_log.sva = ComBat( batch = design[,2], mod = design, ... ) count_m_log.sva = ceiling(exp(count_m_log.sva) -1) count_m_log.sva$cell_type = counts[ match(counts$sample, rownames(count_m_log.sva)), "cell" ] ```

## Deconvolve `Cell type composition` We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). `deconvolve_cellularity` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the adjusted cell type proportions. *Note: Cellularity deconvolution requires gene symbols that match the reference data. The airway dataset uses Ensembl IDs, so this example is commented out.*

TidyTranscriptomics ```{r cibersort, eval=FALSE} # Requires gene symbols that match reference data # airway.cibersort = # airway |> # deconvolve_cellularity( cores=1, prefix = "cibersort__") |> # pivot_sample() ```

Standard procedure (comparative purpose) ```{r, eval=FALSE} source("CIBERSORT.R") count_m |> write.table("mixture_file.txt") results <- CIBERSORT( "sig_matrix_file.txt", "mixture_file.txt", perm=100, QN=TRUE ) results$cell_type = tibble_counts[ match(tibble_counts$sample, rownames(results)), "cell" ] ```

*Note: The plotting code is commented out as the deconvolution step is not executed.* ```{r plot_cibersort, eval=FALSE} # airway.cibersort |> # pivot_longer( # names_to= "Cell_type_inferred", # values_to = "proportion", # names_prefix ="cibersort__", # cols=contains("cibersort__") # ) |> # ggplot(aes(x=Cell_type_inferred, y=proportion, fill=cell)) + # geom_boxplot() + # facet_wrap(~cell) + # my_theme + # theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5) ``` ## Step 7: Cluster Samples We may want to cluster our samples based on the transcriptomic profiles. `cluster_elements` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the cluster labels. > Clustering helps identify groups of samples with similar expression profiles.

TidyTranscriptomics ```{r cluster} airway.norm.cluster = airway.norm |> cluster_elements(method="kmeans", centers = 2) ```

Standard procedure (comparative purpose) ```{r, eval=FALSE} library(stats) count_m_log = log(count_m + 1) count_m_log_centered = count_m_log - rowMeans(count_m_log) kmeans_result = kmeans(t(count_m_log_centered), centers = 2) cluster_labels = kmeans_result$cluster ```

## Step 9: Test Differential Abundance (Alternative Method) We may want to test for differential abundance between conditions. `test_differential_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model, and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate). > This demonstrates an alternative approach to differential expression analysis.

TidyTranscriptomics ```{r differential} airway.norm.de = airway.norm |> test_differential_abundance(~ dex, method="edgeR_quasi_likelihood") ```

Standard procedure (comparative purpose) ```{r, eval=FALSE} library(edgeR) count_m_log = log(count_m + 1) design = model.matrix(~ dex, data = annotation) dge = DGEList(counts = count_m) dge = calcNormFactors(dge) dge = estimateDisp(dge, design) fit = glmQLFit(dge, design) qlf = glmQLFTest(fit, coef=2) results = topTags(qlf, n = Inf) ```

## Conclusion Tidybulk provides a unified interface for comprehensive transcriptomic data analysis with seamless integration of SummarizedExperiment objects and tidyverse principles. It streamlines the entire workflow from raw data to biological insights while maintaining compatibility with standard Bioconductor practices. ```{r session-info} sessionInfo() ```