--- title: "The transcriptogramer user's guide" output: BiocStyle::html_document bibliography: cite.bib date: '`r NULL`' vignette: > %\VignetteIndexEntry{The transcriptogramer user's guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- **Authors**: Diego Morais [aut, cre], Rodrigo Dalmolin [aut]
**Version**: `r packageDescription("transcriptogramer")[["Version"]]`
# Overview The transcriptogramer package is designed for transcriptional analysis based on transcriptograms, a method to analyze transcriptomes that projects expression values on a set of ordered proteins, arranged such that the probability that gene products participate in the same metabolic pathway exponentially decreases with the increase of the distance between two proteins of the ordering. Transcriptograms are, hence, genome wide gene expression profiles that provide a global view for the cellular metabolism, while indicating gene sets whose expression are altered [@Silva2014; @Filho2011]. Methods are provided to analyze topological properties of an interactome, to generate transcriptograms, to detect and to display differentially expressed gene clusters, and to perform a functional enrichment analysis on these clusters. As a set of ordered proteins is required in order to run the methods, datasets are available for four species (*Homo sapiens*, *Mus musculus*, *Saccharomyces cerevisiae* and *Rattus norvegicus*). Each species has three datasets, originated from [STRINGdb](https://string-db.org/) release 10.5 protein network data, with combined scores greater than or equal to 700, 800 and 900 (see **Hs900**, **Hs800**, **Hs700**, **Mm900**, **Mm800**, **Mm700**, **Sc900**, **Sc800**, **Sc700**, **Rn900**, **Rn800** and **Rn700** datasets). Custom sets of ordered proteins can be generated from protein network data using [The transcriptogramer](https://lief.if.ufrgs.br/pub/biosoftwares/transcriptogramer/) on Windows, or the [Seriation package](https://github.com/amamory/seriation) on Mac and Linux. # Quick start The first step is to create a Transcriptogram object by running the **transcriptogramPreprocess()** function. This example uses a subset of the *Homo sapiens* protein network data, from STRINGdb release 10.5, containing only associations of proteins of combined score greater than or equal to 900 (see **Hs900** and **association** datasets). ```{r message = FALSE} library(transcriptogramer) t <- transcriptogramPreprocess(association = association, ordering = Hs900) ``` ## Topological analysis There are two methods to perform topological analysis, **connectivityProperties()** calculates average graph properties as a function of the node connectivity, and **orderingProperties()** calculates graph properties projected on the ordered proteins. Some methods, such as orderingProperties(), uses a window, region of n (radius * 2 + 1) proteins centered at a protein, whose radius changes the output. The Transcriptogram object has a **radius** slot that can be setted during, or after, its preprocessing (see **Transcriptogram-class** documentation). ```{r message = FALSE} ## during the preprocessing ## creating the object and setting the radius as 0 t <- transcriptogramPreprocess(association = association, ordering = Hs900) ## creating the object and setting the radius as 50 t <- transcriptogramPreprocess(association = association, ordering = Hs900, radius = 50) ``` ```{r message = FALSE} ## after the preprocessing ## modifying the radius of an existing Transcriptogram object radius(object = t) <- 25 ## getting the radius of an existing Transcriptogram object r <- radius(object = t) ``` The output of the orderingProperties() method is partially affected by the radius slot. ```{r message = FALSE} oPropertiesR25 <- orderingProperties(object = t, nCores = 1) ## slight change of radius radius(object = t) <- 30 ## this output is partially different comparing to oPropertiesR25 oPropertiesR30 <- orderingProperties(object = t, nCores = 1) ``` As the connectivityProperties() method does not uses a window, its output is not affected by the radius slot. ```{r message = FALSE} cProperties <- connectivityProperties(object = t) ``` ## Transcriptogram A transcriptogram is generated in two steps and requires expression values, from microarray or RNA-Seq assays, and a dictionary. This example uses the datasets **GSE9988**, which contains normalized expression values of 3 cases and 3 controls referring to the innate immune responses to TREM-1 activation, and **GPL570**, a mapping between ENSEMBL Peptide ID and Affymetrix Human Genome U133 Plus 2.0 Array probe identifier. The methods to generate a transcriptogram are **transcriptogramStep1()** and **transcriptogramStep2()**. The transcriptogramStep1() assigns to each protein, of each transcriptome sample, the average of the expression values of all the identifiers related to it. ```{r message = FALSE} t <- transcriptogramStep1(object = t, expression = GSE9988, dictionary = GPL570, nCores = 1) ``` To each position of the ordering, the transcriptogramStep2() method assigns a value equal to the average of the expression values inside a window, which considers periodic boundary conditions to deal with proteins near the ends of the ordering, in order to reduce random noise. ```{r message = FALSE} t <- transcriptogramStep2(object = t, nCores = 1) ``` The Transcriptogram object has slots to store the outputs of the transcriptogramStep1() and transcriptogramStep2() methods, called transcriptogramS1 and transcriptogramS2 respectively. As the output of some methods are affected by the content of the transcriptogramS2 slot, it can be recalculated using the content of the transcriptogramS1 slot. ```{r message = FALSE} radius(object = t) <- 50 t <- transcriptogramStep2(object = t, nCores = 1) ``` ## Functional enrichment analysis As nearby genes of a transcriptogram have a high probability to interact with each other, gene sets whose expression are altered can be identified using the `r Biocpkg("limma")` package. The **differentiallyExpressed()** method uses the limma package to identify differentially expressed genes, for the contrast "case-control", grouping as a cluster a set of genes which positions are within a radius range specified by the content of the radius slot. For this example, the p-value threshold for false discovery rate will be setted as 0.005. If the name of a species is provided on the input, the `r Biocpkg("biomaRt")` package will be used to translate the ENSEMBL Peptide ID to Symbol (Gene Name), alternatively, a data frame can be provided and used instead. The levels argument classify the columns of the transcriptogramS2 slot referring to samples, as there are 6 columns (see dataset **GSE9988**), is created a logical vector that uses TRUE to label the columns referring to controls samples, and FALSE to label the columns referring to case samples. ```{r message = FALSE, fig.show = "hide"} levels <- c(rep(FALSE, 3), rep(TRUE, 3)) t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.005) ``` ```{r eval = FALSE} ## translating ENSEMBL Peptide IDs to Symbols using the biomaRt package ## Internet connection is required for this command t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.005, species = "Homo sapiens") ## translating ENSEMBL Peptide IDs to Symbols using the DEsymbols dataset t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.005, species = DEsymbols) ``` This method also produces a plot referring to its output. In this case, four clusters were detected, and each one is represented by a color. It is important to mention that not all the colored genes were detected as differentially expressed, but, as they were within the radius specified by the content of the radius slot, they were included in a cluster. The genes that are above the horizontal blue line are upregulated, and the genes that are below are downregulated. ![](Rplot.png) The differentially expressed genes identified by this method are stored in the DE slot of the Transcriptogram object, its content can be obtained using the DE method. By default, the p-values are adjusted by the Benjamini-Hochberg procedure. ```{r message = FALSE} DE <- DE(object = t) ``` The **clusterVisualization()** method uses the `r Biocpkg("RedeR")` package to display graphs of the differentially expressed clusters and returns an object of the RedPort Class, allowing interactions through functions of the RedeR package. This method may take some time depending on the number of clusters, and nodes per cluster, and requires the Java Runtime Environment (>= 6). If the DE slot of the Transcriptogram object has a column named Symbol, its contents will be used as node alias. ```{r eval = FALSE} rdp <- clusterVisualization(object = t) ``` ![](RedeR.png) The **clusterEnrichment()** method perform a functional enrichment analysis using the `r Biocpkg("topGO")` package. By default, the universe is composed by all the proteins present in the transcriptogramS2 slot, the ontology is setted to biological process, the algorithm is setted to classic, the statistic is setted to fisher, and the p-values are adjusted by the Benjamini-Hochberg procedure. For this example, the p-value threshold for false discovery rate will be setted as 0.005. This method uses the biomaRt package to build a gene2GO list if the name of a species is provided on the input, alternatively, a data frame can be provided and used instead. ```{r message = FALSE} ## using the HsBPTerms dataset to create the gene2GO list terms <- clusterEnrichment(object = t, species = HsBPTerms, pValue = 0.005, nCores = 1) ``` ```{r eval = FALSE} ## using the biomaRt package to create the gene2GO list ## Internet connection is required for this command terms <- clusterEnrichment(object = t, species = "Homo sapiens", pValue = 0.005, nCores = 1) ``` ```{r echo = FALSE} load("terms.RData") ``` ```{r} head(terms) ``` # Session info ```{r} sessionInfo() ``` ```{r} warnings() ``` # References