Version: 1.22.0
The transcriptogramer package (Morais 2019) is designed for transcriptional analysis based on transcriptograms, a method to analyze transcriptomes that projects expression values on a set of ordered proteins, arranged such that the probability that gene products participate in the same metabolic pathway exponentially decreases with the increase of the distance between two proteins of the ordering. Transcriptograms are, hence, genome wide gene expression profiles that provide a global view for the cellular metabolism, while indicating gene sets whose expression are altered (da Silva 2014; Rybarczyk-Filho 2011; de Almeida 2016; Ferrareze 2017; Xavier 2017).
Methods are provided to analyze topological properties of a Protein-Protein Interaction (PPI) network, to generate transcriptograms, to detect and to display differentially expressed gene clusters, and to perform a Gene Ontology Enrichment Analysis on these clusters.
As a set of ordered proteins is required in order to run the methods, datasets are available for four species (Homo sapiens, Mus musculus, Saccharomyces cerevisiae and Rattus norvegicus). Each species has three datasets, originated from STRINGdb release 11.0 protein network data, with combined scores greater than or equal to 700, 800 and 900 (see Hs900, Hs800, Hs700, Mm900, Mm800, Mm700, Sc900, Sc800, Sc700, Rn900, Rn800 and Rn700 datasets). Custom sets of ordered proteins can be generated from protein network data using The transcriptogramer on Windows, or an implementation of the seriation algorithm on Linux.
The first step is to create a Transcriptogram object by running the
transcriptogramPreprocess()
method. This example uses a subset of the
Homo sapiens protein network data, from STRINGdb release 11.0, containing only
associations of proteins of combined score greater than or equal to 900
(see Hs900 and association datasets).
library(transcriptogramer)
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
t <- transcriptogramPreprocess(association = association, ordering = Hs900)
There are two methods to perform topological analysis,
connectivityProperties()
calculates average graph properties as function
of node connectivity, and orderingProperties()
calculates graph
properties projected on the ordered proteins. Some methods, such as
orderingProperties(), uses a window, region of n (radius * 2 + 1) proteins
centered at a protein, whose radius changes the output. The Transcriptogram
object has a radius slot that can be setted during, or after, its
preprocessing (see Transcriptogram-class documentation).
## during the preprocessing
## creating the object and setting the radius as 0
t <- transcriptogramPreprocess(association = association, ordering = Hs900)
## creating the object and setting the radius as 80
t <- transcriptogramPreprocess(association = association, ordering = Hs900,
radius = 80)
## after the preprocessing
## modifying the radius of an existing Transcriptogram object
radius(object = t) <- 50
## getting the radius of an existing Transcriptogram object
r <- radius(object = t)
As window related metrics are affected by the radius, the output of the orderingProperties() method changes depending on the content of the radius slot. A window modularity value close to 1 indicates dense connections between the genes inside the window, as well as sparse connections between these genes and the other genes in the network. Note that the sum of the window modularity increased using the radius 80.
oPropertiesR50 <- orderingProperties(object = t, nCores = 1)
## slight change of radius
radius(object = t) <- 80
## this output is partially different comparing to oPropertiesR50
oPropertiesR80 <- orderingProperties(object = t, nCores = 1)
sum(oPropertiesR50$windowModularity)
[1] 3346.246
sum(oPropertiesR80$windowModularity)
[1] 4249.471
However, the connectivityProperties() method does not use a window, thus, its output is not affected by the radius slot.
cProperties <- connectivityProperties(object = t)
A transcriptogram is generated in two steps and requires expression values, from microarray or RNA-Seq assays (log2-counts-per-million), and a dictionary. This example uses the datasets GSE9988, which contains normalized expression values of 3 cases and 3 controls (GSM252443, GSM252444, GSM252445, GSM252465, GSM252466 and GSM252467 respectively), and GPL570, a mapping between ENSEMBL Peptide ID and Affymetrix Human Genome U133 Plus 2.0 Array probe identifier.
The methods to generate a transcriptogram are transcriptogramStep1()
and
transcriptogramStep2()
. The transcriptogramStep1() assigns to each protein,
of each transcriptome sample, the average of the expression values of all the
identifiers related to it.
t <- transcriptogramStep1(object = t, expression = GSE9988,
dictionary = GPL570, nCores = 1)
t2 <- t
To each position of the ordering, the transcriptogramStep2() method assigns a value equal to the average of the expression values inside a window, which considers periodic boundary conditions to deal with proteins near the ends of the ordering, in order to reduce random noise.
t <- transcriptogramStep2(object = t, nCores = 1)
The Transcriptogram object has slots to store the outputs of the transcriptogramStep1() and transcriptogramStep2() methods, called transcriptogramS1 and transcriptogramS2 respectively. As the output of some methods are affected by the content of the transcriptogramS2 slot, it can be recalculated using the content of the transcriptogramS1 slot.
radius(object = t2) <- 50
t2 <- transcriptogramStep2(object = t2, nCores = 1)
As nearby genes of a transcriptogram have a high probability to interact with
each other, gene sets whose expression are altered can be identified using the
limma package. The differentiallyExpressed()
method uses the
limma package to identify differentially expressed genes (the approaches voom and trend are supported for RNA-Seq), for the contrast
“case-control”, grouping as a cluster
a set of genes which positions are within a radius range specified by the
content of the radius slot.
For this example, the p-value threshold for false discovery rate will be set as 0.01. If a species name is provided, the biomaRt package is used to translate the ENSEMBL Peptide ID to Symbol (Gene Name), alternatively, a data.frame with such content can be provided. The levels argument classify the columns of the transcriptogramS2 slot referring to samples, as there are 6 columns (see dataset GSE9988), is created a logical vector that uses TRUE to label the columns referring to controls samples, and FALSE to label the columns referring to case samples.
## trend = FALSE for microarray data or voom log2-counts-per-million
## the default value for trend is FALSE
levels <- c(rep(FALSE, 3), rep(TRUE, 3))
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
trend = FALSE, title = "radius 80")
## the radius 50 will affect the output significantly
t2 <- differentiallyExpressed(object = t2, levels = levels, pValue = 0.01,
species = DEsymbols, title = "radius 50")
## using the species argument to translate ENSEMBL Peptide IDs to Symbols
## Internet connection is required for this command
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
species = "Homo sapiens", title = "radius 80")
## translating ENSEMBL Peptide IDs to Symbols using the DEsymbols dataset
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
species = DEsymbols, title = "radius 80")
This method also produces a plot referring to its output. Each cluster detected is represented by a color. The genes that are above the horizontal black line are upregulated, and the genes that are below are downregulated.
The differentially expressed genes identified by this method are stored in the DE slot of the Transcriptogram object, its content can be obtained using the DE method. By default, the p-values are adjusted by the Benjamini-Hochberg procedure. Note that the differential expression on the object of radius 80 detected less clusters, but there are more windows centers significantly altered on it, thus, the clusters are more consistent. Therefore, the next methods will be performed only on the object of radius 80.
DE <- DE(object = t)
DE2 <- DE(object = t2)
nrow(DE)
[1] 393
nrow(DE)/length(unique(DE$ClusterNumber))
[1] 78.6
nrow(DE2)
[1] 421
nrow(DE2)/length(unique(DE2$ClusterNumber))
[1] 30.07143
The clusterVisualization()
method uses the RedeR package to
display graphs of the differentially expressed clusters and returns an
object of the RedPort Class, allowing interactions through methods
of the RedeR package. This method requires the Java Runtime Environment
(>= 6).
rdp <- clusterVisualization(object = t)
The clusterEnrichment()
method performs a Gene Ontology enrichment analysis
using the topGO package. By default, the universe is composed by
all the proteins present in the ordering slot, the ontology is setted
to biological process, the algorithm is setted to classic, the statistic is
setted to fisher, and the p-values are adjusted by the Benjamini-Hochberg
procedure. For this example, the p-value threshold for false
discovery rate will be set as 0.005. This method uses the biomaRt package to
build a Protein2GO data.frame using the given species name, or data.frame.
## using the HsBPTerms dataset to create the Protein2GO data.frame
t <- clusterEnrichment(object = t, species = HsBPTerms,
pValue = 0.005, nCores = 1)
## using the species argument to create the Protein2GO data.frame
## Internet connection is required for this command
t <- clusterEnrichment(object = t, species = "Homo sapiens",
pValue = 0.005, nCores = 1)
head(Terms(t))
GO.ID Term Annotated Significant
1 GO:0006357 regulation of transcription by RNA polym... 1809 315
2 GO:0006366 transcription by RNA polymerase II 1937 315
3 GO:0006355 regulation of transcription, DNA-templat... 2427 344
4 GO:1903506 regulation of nucleic acid-templated tra... 2472 345
5 GO:2001141 regulation of RNA biosynthetic process 2480 345
6 GO:0006351 transcription, DNA-templated 2575 345
Expected pValue pAdj ClusterNumber
1 81.52 1e-30 4.076053e-28 1
2 87.29 1e-30 4.076053e-28 1
3 109.37 1e-30 4.076053e-28 1
4 111.40 1e-30 4.076053e-28 1
5 111.76 1e-30 4.076053e-28 1
6 116.04 1e-30 4.076053e-28 1
enrichmentPlot(t)
How does transcriptogramer deal in cases where Protein-Protein Interactions (PPIs) for expressed proteins are missing?
The transcriptogramStep1()
method removes rows from the expression data which are not present in the set of ordered proteins. As the association and ordering arguments of the transcriptogramPreprocess()
method must contain the same proteins, if a row of the expression data cannot be mapped to one of these proteins, it will be discarded.
How does transcriptogramer deal in cases where there’s no expression data for some PPIs in the network?
When there is no expression data related to a given protein of the set of ordered proteins, this protein expression value is treated as a missing value. In this case, the sliding window will discard missing values, sum the remaining ones, and divide this value by radius * 2 + 1
while running the transcriptogramStep2()
method.
Does the transcriptogramer accept/convert protein/gene IDs other than the ones used in the vignette?
This is not possible in the current transcriptogramer version. The transcriptogramer is only able to automatically convert ENSEMBL Peptide IDs to Symbols. As this can be done by the biomaRt package, arguments will be added in the future to allow the automatic conversion of other IDs, such as ENTREZ to ENSEMBL Gene ID. This change will be reported here and in the package NEWS.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] transcriptogramer_1.22.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 farver_2.1.1 dplyr_1.1.2
## [4] blob_1.2.4 filelock_1.0.2 Biostrings_2.68.0
## [7] bitops_1.0-7 fastmap_1.1.1 RCurl_1.98-1.12
## [10] BiocFileCache_2.8.0 XML_3.99-0.14 digest_0.6.31
## [13] lifecycle_1.0.3 topGO_2.52.0 KEGGREST_1.40.0
## [16] RSQLite_2.3.1 magrittr_2.0.3 compiler_4.3.0
## [19] rlang_1.1.0 sass_0.4.5 progress_1.2.2
## [22] tools_4.3.0 doSNOW_1.0.20 igraph_1.4.2
## [25] utf8_1.2.3 yaml_2.3.7 data.table_1.14.8
## [28] knitr_1.42 prettyunits_1.1.1 bit_4.0.5
## [31] curl_5.0.0 xml2_1.3.3 withr_2.5.0
## [34] purrr_1.0.1 BiocGenerics_0.46.0 grid_4.3.0
## [37] stats4_4.3.0 fansi_1.0.4 colorspace_2.1-0
## [40] GO.db_3.17.0 ggplot2_3.4.2 scales_1.2.1
## [43] iterators_1.0.14 biomaRt_2.56.0 cli_3.6.1
## [46] rmarkdown_2.21 crayon_1.5.2 generics_0.1.3
## [49] httr_1.4.5 DBI_1.1.3 cachem_1.0.7
## [52] stringr_1.5.0 zlibbioc_1.46.0 parallel_4.3.0
## [55] AnnotationDbi_1.62.0 BiocManager_1.30.20 XVector_0.40.0
## [58] matrixStats_0.63.0 vctrs_0.6.2 SparseM_1.81
## [61] jsonlite_1.8.4 bookdown_0.33 IRanges_2.34.0
## [64] hms_1.1.3 S4Vectors_0.38.0 bit64_4.0.5
## [67] magick_2.7.4 foreach_1.5.2 limma_3.56.0
## [70] jquerylib_0.1.4 tidyr_1.3.0 snow_0.4-4
## [73] glue_1.6.2 codetools_0.2-19 stringi_1.7.12
## [76] gtable_0.3.3 GenomeInfoDb_1.36.0 munsell_0.5.0
## [79] tibble_3.2.1 pillar_1.9.0 rappdirs_0.3.3
## [82] htmltools_0.5.5 graph_1.78.0 GenomeInfoDbData_1.2.10
## [85] R6_2.5.1 dbplyr_2.3.2 lattice_0.21-8
## [88] evaluate_0.20 Biobase_2.60.0 png_0.1-8
## [91] memoise_2.0.1 RedeR_2.4.0 bslib_0.4.2
## [94] Rcpp_1.0.10 xfun_0.39 pkgconfig_2.0.3
warnings()
da Silva, S. R. M. et al. 2014. “Reproducibility Enhancement and Differential Expression of Non Predefined Functional Gene Sets in Human Genome.” BMC Genomics. https://doi.org/10.1186/1471-2164-15-1181.
de Almeida, R. M. C. et al. 2016. “Transcriptome Analysis Reveals Manifold Mechanisms of Cyst Development in Adpkd.” Human Genomics 10 (1): 1–24. https://doi.org/10.1186/s40246-016-0095-x.
Ferrareze, P. A. G. et al. 2017. “Transcriptional Analysis Allows Genome Reannotation and Reveals That Cryptococcus Gattii Vgii Undergoes Nutrient Restriction During Infection.” Microorganisms 5 (3). https://doi.org/10.3390/microorganisms5030049.
Morais, D. A. A. et al. 2019. “Transcriptogramer: An R/Bioconductor Package for Transcriptional Analysis Based on Protein–Protein Interaction.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btz007.
Rybarczyk-Filho, J. L. et al. 2011. “Towards a Genome-Wide Transcriptogram: The Saccharomyces Cerevisiae Case.” Nucleic Acids Research 39 (8): 3005–16. https://doi.org/10.1093/nar/gkq1269.
Xavier, S. R. M. et al. 2017. “Analysis of Genome Instability Biomarkers in Children with Non-Syndromic Orofacial Clefts.” Mutagenesis. https://doi.org/10.1093/mutage/gew068.