--- title: "Introduction to dominatR" author: "Simon Lizarazo, Ethan Chen, Rajendra K C, Kevin Van Bortle" output: rmarkdown::html_document vignette: > %\VignetteIndexEntry{Introduction to dominatR} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include=FALSE} library(knitr) opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = 'center', warning = FALSE, message = FALSE ) library(dominatR) library(dominatRData) library(airway) library(ggplot2) ``` ## Overview ```{r dominatr-logo, echo=FALSE, fig.align='right', out.width='220px'} logo <- system.file("images", "dominatR_logo.png", package = "dominatR") knitr::include_graphics(logo) ``` `dominatR` is a genomic data visualization package that applies concepts drawn from physics - such as *center of mass* from classical mechanics and *Shannon's entropy* from statistical mechanics - to effectively visualize features (e.g. genes) that are present within a specific context or condition (e.g. tissue-specific gene expression). `dominatR` is able to integrate `dataframes`, `matrices` and `SummarizedExperiment` objects, perform a number of common genomic normalization methods, compute center of mass, entropy, and categorical entropy values, and generate customizable plots that serve to highlight context-relevant feature dominance. In all examples, dominance visualization is linked to the coordinate spatial localization of each data point, such that full feature dominance is projected at the radial extreme. `dominatoR` package functions can be subgrouped into the following categories: - **Normalization Functions:** Contains basic normalization methods, such as: - `cpm_normalization` - `minmax_normalization` - `quantile_normalization` - `rpkm_normalization` - `tpm_normalization` - **Calculation Functions:** Contains feature calculation methods, such as: - `centmass` - `entropy` - `Qentropy` - **Visualization Functions:** Contains three different visualization tools, ideal for assessing dominance in 2, 3 and N Dimension: - `plot_rope` - `plot_triangle` - `plot_circle` This vignette serves as a quick tutorial for using `dominatR`. Further descriptions and examples of customization and the additional features present in the package can be explored in the companion articles. ## Installation To install this package, start R and enter ```{r installation, eval = FALSE} library(BiocManager) BiocManager::install('dominatR') ``` ## Data normalization; computing and plotting feature dominance The data included in this vignette can be found in the `airway` dataset and the supplementary package `dominatRData` ### Normalization functions (e.g. quantile normalization) #### Example 1: `SummarizedExperiment` object ```{r se_norm, eval = TRUE, echo = TRUE, message = FALSE} data("airway") se = quantile_normalization(airway, new_assay_name = 'quantile_norma') head(se) head(assay(se, 'quantile_norma')) ``` #### Example 2: `dataframe` object ```{r df_norm, eval = TRUE, echo = TRUE, message = FALSE} data("atac_tissue_counts") ## selecting the numeric columns norm = atac_tissue_counts[,8:26] norm1 = quantile_normalization(norm) head(norm1[,1:10]) ``` ### Calculation functions (e.g. entropy) #### Example 1: `SummarizedExperiment` object ```{r se_entr, eval = TRUE, echo = TRUE, message = FALSE} se = entropy(se) head(se) ### It creates a new column in the rowData dataframe head(rowData(se)) ``` #### Example 2: `dataframe` object ```{r df_entr, eval = TRUE, echo = TRUE, message = FALSE} norm1 = entropy(norm) ### It creates a new column in the dataframe head(norm1[,10:20]) ``` ### Visualization functions (2-, 3-, and N-dimensions) All `dominatR` package functions are compatible with `SummarizedExperiment` and `data.frame` objects. This quick introduction will show examples using `data.frame` objects. Each function produces a graphic and a `data.frame` with the dominance results; setting the attribute `output_table = FALSE` will restrict the dataframe from being displayed. For more details and examples on usage and aesthetics manipulation, refer to the articles in the visualization section. #### Example 1: Two Dimensions with `plot_rope()` `plot_rope` is useful to visualize feature dominance between two variables. In this example the function is comparing dominance over gene occupancy for RNA Polymerase II and RNA Polymerase III. By default the function plots all data points, but users can adjust `entropy_range` values to subset on observations within a specified range. ```{r rope_pol3, eval = TRUE, echo = TRUE, message = FALSE} data("rnapol_score") data1 <- rnapol_score[,6:7] plot_rope( data1, rope_color = 'white', pch = c(21, 21), push_text = 1.2, col = c('#7eb1d5', '#fa8451'), output_table = FALSE, rope_width = 1 ) title(main = 'Gene Occupancy RNA Pol II vs RNA Pol III', font.main = 1, cex = 0.5) ``` \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Below, let's filter for genes with **low entropy** and high significance scores to visualize genes that are specifically occupied by either RNA Pol II or RNA Pol III ```{r rope_pol3_filt, eval = TRUE, echo = TRUE, message = FALSE} plot_rope( data1, rope_color = 'white', pch = c(21, 21), col = c('#7eb1d5', '#fa8451'), push_text = 1.2, output_table = FALSE, rope_width = 1, entropyrange = c(0, 0.5), maxvaluerange = c(1.3, Inf) ) title(main = 'Gene Occupancy \n RNA Pol II vs RNA Pol III - Specific', font.main = 1, cex = 0.5) ``` \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Below, let's filter for genes with **high entropy** and high significance scores to visualize genes that are specifically occupied by both RNA Pol II and RNA Pol III ```{r rope_pol3_filt1, eval = TRUE, echo = TRUE, message = FALSE} plot_rope( data1, pch = c(21,21), push_text = 1.2, rope_color = 'white', col = c('#7eb1d5', '#fa8451'), output_table = FALSE, rope_width = 1, entropyrange = c(0.8, 1), maxvaluerange = c(1.3, Inf) ) title(main = 'Gene Occupancy \n RNA Pol II vs RNA Pol III - Shared', font.main = 1, cex = 0.5) ``` #### Example 2: Three Dimensions with `plot_triangle()` `plot_triange` is useful to visualize feature dominance between three variables. In this example, the function is comparing dominance over gene occupancy for RNA Polymerase I, RNA Polymerase II and RNA Polymerase III. By default the function plots every data point, but users can again adjust the `entropy_range` attribute values to subset on observations within a specified range. ```{r trin_pol3, eval = TRUE, echo = TRUE, message = FALSE} data1 <- rnapol_score[,5:7] plot_triangle(data1, output_table = FALSE, col = c('#ff80e3', '#7eb1d5', '#fa8451'), label = TRUE, pch = 21, push_text = 1.3) title(main = 'Gene Occupancy \n Pol I vs Pol II vs Pol III', font.main = 1, cex = 0.5) ``` \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Below, let's filter for genes with **low entropy** and high significance scores to visualize genes that are specifically occupied by either RNA Pol I, RNA Pol II, or RNA Pol III ```{r trin_pol3_filt, eval = TRUE, echo = TRUE, message = FALSE} data1 <- rnapol_score[,5:7] plot_triangle(data1, output_table = FALSE, col = c('#ff80e3', '#7eb1d5', '#fa8451'), entropyrange = c(0,0.5), label = TRUE, pch = 21, push_text = 1.3) title(main = 'Gene Occupancy \n Pol I vs Pol II vs Pol III - Specific', font.main = 1, cex = 0.5) ``` \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Below, let's filter for genes with **high entropy** and high significance scores to visualize genes that are specifically occupied by all three RNA Pols or combinations of any 2 RNA Pols (e.g. RNA Pol I and RNA Pol II) ```{r trin_pol3_filt1, eval = TRUE, echo = TRUE, message = FALSE} data1 <- rnapol_score[,5:7] plot_triangle(data1, output_table = FALSE, col = c('#ff80e3', '#7eb1d5', '#fa8451'), entropyrange = c(1.5, Inf), label = TRUE, pch = 21, push_text = 1.3) title(main = 'Gene Occupancy \n Pol I vs Pol II vs Pol III - Shared', font.main = 1, cex = 0.5) ``` Notice that as each data point gets closer to the center (and moving away from the vertices), the observation is characterized by higher entropy and is not dominated by a specific variable (in this case, by a specific RNA polymerase). #### Example 3: N-Dimensions with `plot_circle()` `plot_circle` is useful to visualize feature dominance between N-Numbers of variables. In this example, the function is comparing RNA-Pol III transcribed genes accessibility across tissues. By default the function plots all the points, but the user can adjust `entropy_range` values for subsetting specific type of observations. Briefly A total of P-1 (P = Number of Variables) circles are plotted and they represent degrees of dominance. - The outermost circle represents observations that are dominated only by one variable. - The second outermost circle represents observations that are dominated by two variables. - The innermost circle represents observations that are uniform across all variables. ```{r circ_tiss, eval = TRUE, echo = TRUE, message = FALSE} data("atac_tissue_score") ### subsetting only a set of numerical columns data1 = atac_tissue_score[,8:26] plot_circle(data1, point_line_colors = rep('black', 19), magnituderange = c(1, Inf), n = 19, output_table = FALSE, point_size = 3) + ggtitle('Pol III transcribed genes \n accessibility across tissues') ``` ```{r circ_tiss_filt1, eval = TRUE, echo = TRUE, message = FALSE} data("atac_tissue_score") plot_circle(data1, point_line_colors = rep('black', 19), entropyrange = c(0,1), magnituderange = c(1, Inf), straight_points = FALSE, n = 19, output_table = FALSE, point_size = 3) + ggtitle('Pol III transcribed genes \n accessibility across tissues - Unique') ``` ```{r circ_tiss_filt2, eval = TRUE, echo = TRUE, message = FALSE} data("atac_tissue_score") plot_circle(data1, point_line_colors = rep('black', 19), entropyrange = c(4, Inf), magnituderange = c(1, Inf), n = 19, output_table = FALSE, point_size = 3) + ggtitle('Pol III transcribed genes \n accessibility across tissues - Shared') ``` ### Session Info ```{r session-info, echo=FALSE} sessionInfo() ```