signifinder is an R package for computing and exploring a compendium of tumor signatures. It allows to compute a variety of signature scores based on gene expression values. Further, it supports the exploration of the scores proving functions to visualize single or multiple signatures. Currently, signifinder contains 46 distinct signatures collected from the literature relating to multiple tumors and multiple cancer processes.
signifinder 1.2.1
In cancer studies, transcriptional signatures are studied as good indicators of cancer phenotypes, for their potential to show cancer ongoing activities and can be used for patient stratification. For these reasons, they are considered potentially useful to guide therapeutic decisions and monitoring interventions. Moreover, transcriptional signatures of RNA-seq experiments are also used to assess the complex relations between the tumor and its microenvironment. In recent years, the new technologies for transcriptome detection (single-cell RNA-seq and spatial transcriptomics) highlighted the highly heterogeneous behaviour of this disease and, as a result, the need to dissect its complexity. To better achieve this result, the combined analysis of multiple signatures may reveal possible correlations between different tumor processes and allow patients (or cells or spots) to be stratified at a broader level of information.
Transcriptional signatures are based upon a specific gene set - and eventually a set of coefficients to differently weight the gene contributions - whose expression levels are combined in a score designed to provide a single-sample (-cell, -spot) prediction. Hence, signatures consist not only of a list of genes but also of an algorithm that defines the computation of the single-sample prediction score. Despite much evidence that computational implementations are useful to improve data reproducibility, applicability and dissemination, the vast majority of signatures are not published along with their computational code and only few of them have been implemented in a software, virtuous examples are: the R package consensusOV
, dedicated to the TCGA ovarian cancer signature; and the R package genefu
which hosts some of the most popular signatures of breast cancer.
signifinder
has been developed to provide an easy and fast computation of several published signatures. Thanks to the compatibility with the Bioconductor data structures and procedures, signifinder
can easily integrate the most popular expression data analysis packages to complement the results and improve data interpretations.
Also, several visualization functions are implemented to visualize the scores obtained from signatures. These can help in the result interpretations: users can not only browse single signatures independently but also compare them with each other.
To install this package:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("signifinder")
Stringent criteria for the inclusion of the signatures were established: (i) signatures should rely on cancer topics, and be developed and used on cancer samples; (ii) signatures should exclusively use transcriptomic data, though exceptions have been made in case of combination of gene expression and signature-related gene weights; (iii) signatures must release a clear gene list used for the signature definition, where all genes have an official gene symbol (Hugo consortium) or an unambiguous translation (genes without an official gene symbol are removed); (iv) the method to calculate expression-based scores should be unambiguously described; (v) additional clarity about the type of expression in the input (e.g., counts, log counts, FPKM, or others) may also be required.
In the current release of signifinder
, all the included signatures rely on bulk tumor expression experiments, even if the package infrastructure could potentially store and manage signatures derived by single-cell and spatial transcriptomics. Further, while it may not ever be possible to include all cancer signatures proposed in the literature, our package makes easy the addition of new signatures (by us or by others via “pull requests”, see Adding new signatures).
The input expression dataset must be normalized RNA-Seq counts (or normalized data matrix from microarrays) of bulk transcriptomics data, single-cell transcriptomics data or spatial transcriptomics data. They should be provided in the form of a matrix, a data frame or a SummarizedExperiment (and respectively SingleCellExperiment/SpatialExperiment). In the last case, the name of the assay containing the normalized values must be “norm_expr”. Regardless of the input type, the output data is a SummarizedExperiment (SingleCellExperiment/SpatialExperiment) where the scores computed are put in the colData
section.
Gene lists of signatures reported in literature are typically in symbol IDs, but signifinder
can either use gene symbols, NCBI entrez or ensembl gene IDs. Users can say which of the three identifiers they use (SYMBOL, ENTREZID or ENSEMBL) to let the package convert the signature gene lists for the matching of gene data (nametype
argument inside the signature functions).
When a signature is computed a message is shown that says the percentage of genes used for the calculation of the signature compared to the original list. There is no minimum threshold of genes for signatures to be computed, but a warning
will be given if there are less than the 30% of signature genes. After a signature has been calculated it is possible to visually inspect signature gene expressions using geneHeatmapSignPlot
(see Gene Expression Heatmap).
Furthermore, the original works, that provide the signatures, also specify the type of expression value (e.g. normalized value, TPM (transcript per million), log(TPM), etc…) that should be used to compute the signature. Therefore, during signature computation, data type should be eventually converted as reported in the original work. When using signifinder
, users must supply the input data in the form of normalised counts (or normalised arrays) and, for the signatures which require this, a data transformation step will be automatically performed. The transformed data matrix will be included in the output as an additional assay and the name of the assay will be the name of the conversion (i.e. “TPM”, “CPM” or “FPKM”). Alternatively, if the input data is a SummarizedExperiment
object that already contains (in addition to the normalized count) also an assay of the transformed data, this will be used directly. Note that in order to be used they must be called “TPM”, “CPM” or “FPKM”. Finally, included signatures have been developed both from array and RNA-seq data, therefore it is crucially important for users to specify the type of data used: “microarray” or “rnaseq” (inputType
argument inside the signature functions). In signifinder
, signatures for microarray can be applied to RNA-seq data but not vice versa due to input type conversions.
In the following section, we use an example bulk expression dataset of ovarian cancer to show how to use signifinder
with a standard workflow.
# loading packages
library(SummarizedExperiment)
library(signifinder)
library(dplyr)
data(ovse)
ovse
## class: SummarizedExperiment
## dim: 1456 40
## metadata(0):
## assays(4): norm_expr TPM CPM FPKM
## rownames(1456): ACOT7 ADORA3 ... TMSB4Y USP9Y
## rowData names(0):
## colnames(40): sample1 sample2 ... sample39 sample40
## colData names(40): OV_subtype os ... DNArep_Kang IPSOV_Shen
We can check all the signatures available in the package with the function availableSignatures
.
availSigns <- availableSignatures()
The function returns a data frame with all the signatures included in the package and for each signature the following information:
1 | |
---|---|
signature | EMT_Miow |
scoreLabel | EMT_Miow_Epithelial, EMT_Miow_Mesenchymal |
functionName | EMTSign |
topic | epithelial to mesenchymal |
tumor | ovarian cancer |
tissue | ovary |
requiredInput | microarray, rnaseq |
transformationStep | normArray, normCounts |
author | Miow |
reference | Miow Q. et al. Oncogene (2015) |
description | Double score obtained with ssGSEA to establish the epithelial- and the mesenchymal-like status in ovarian cancer patients. |
We can also interrogate the table asking which signatures are available for a specific tissue (e.g. ovary).
ovary_signatures <- availableSignatures(tissue = "ovary",
description = FALSE)
signature | scoreLabel | functionName | topic | tumor | tissue | requiredInput | transformationStep | author | reference | |
---|---|---|---|---|---|---|---|---|---|---|
1 | EMT_Miow | EMT_Miow_Epithelial, EMT_Miow_Mesenchymal | EMTSign | epithelial to mesenchymal | ovarian cancer | ovary | microarray, rnaseq | normArray, normCounts | Miow | Miow Q. et al. Oncogene (2015) |
4 | Pyroptosis_Ye | Pyroptosis_Ye | pyroptosisSign | pyroptosis | ovarian cancer | ovary | rnaseq | FPKM | Ye | Ye Y. et al. Cell Death Discov. (2021) |
8 | Ferroptosis_Ye | Ferroptosis_Ye | ferroptosisSign | ferroptosis | ovarian cancer | ovary | microarray, rnaseq | normArray, FPKM | Ye | Ye Y. et al. Front. Mol. Biosci. (2021) |
12 | LipidMetabolism_Zheng | LipidMetabolism_Zheng | lipidMetabolismSign | metabolism | epithelial ovarian cancer | ovary | rnaseq | normCounts | Zheng | Zheng M. et al. Int. J. Mol. Sci. (2020) |
14 | ImmunoScore_Hao | ImmunoScore_Hao | immunoScoreSign | immune system | epithelial ovarian cancer | ovary | microarray, rnaseq | normArray, log2(FPKM+0.01) | Hao | Hao D. et al. Clin Cancer Res (2018) |
16 | ConsensusOV_Chen | ConsensusOV_Chen_IMR, ConsensusOV_Chen_DIF, ConsensusOV_Chen_PRO, ConsensusOV_Chen_MES | consensusOVSign | ovarian subtypes | high-grade serous ovarian carcinoma | ovary | microarray, rnaseq | normArray, normCounts | Chen | Chen G.M. et al. Clin Cancer Res (2018) |
18 | Matrisome_Yuzhalin | Matrisome_Yuzhalin | matrisomeSign | extracellular matrix | ovarian cystadenocarcinoma, gastric adenocarcinoma, colorectal adenocarcinoma, lung adenocarcinoma | ovary, lung, stomach, colon | microarray, rnaseq | normArray, normCounts | Yuzhalin | Yuzhalin A. et al. Br J Cancer (2018) |
43 | HRDS_Lu | HRDS_Lu | HRDSSign | chromosomal instability | ovarian cancer, breast cancer | ovary, breast | microarray, rnaseq | normArray, normCounts | Lu | Lu J. et al. J Mol Med (2014) |
45 | DNArep_Kang | DNArep_Kang | DNArepSign | chromosomal instability | serous ovarian cystadenocarcinoma | ovary | microarray, rnaseq | normArray, log2(normCount+1) | Kang | Kang J. et al. JNCI (2012) |
46 | IPSOV_Shen | IPSOV_Shen | IPSOVSign | immune system | ovarian cancer | ovary | microarray, rnaseq | normArray, log2(normCount+1) | Shen | Shen S. et al. EBiomed (2019) |
Once we have found a signature of interest, we can compute it by using the corresponding function (indicated in the functionName
field of availableSignatures
table). All the signature functions require the expression data and to indicate the type of input data (inputType
equal to “rnaseq” or “microarray”). Data are supposed to be the normalized expression values in the form of a data frame or a matrix with genes in rows and samples in columns. Alternatively, a SummarizedExperiment
object containing an assay called ‘norm_expr’ where rows correspond to genes and columns correspond to samples.
ovse <- ferroptosisSign(dataset = ovse,
inputType = "rnaseq")
## ferroptosisSignYe is using 100% of signature genes
Signatures are often grouped in the same function by cancer topic even if they deal with different cancer types and computation approaches. We can unequivocally choose the one we are interested in by stating the first author of the signature (indicated in the author
field of availableSignatures
table). E.g., currently, there are three different epithelial-to-mesenchymal transition (EMT) signatures implemented inside the EMTSign
function (“Miow”, “Mak” or “Cheng”). We can choose which one to compute stating the author
argument:
ovse <- EMTSign(dataset = ovse,
inputType = "rnaseq",
author = "Miow")
## EMTSignMiow is using 96% of epithelial signature genes
## EMTSignMiow is using 91% of mesenchymal signature genes
## Warning in .filterFeatures(expr, method): 1 genes with constant expression
## values throuhgout the samples.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
In this way, “EMT_Miow” is computed. Regardless of the expression input type, the output data of all the signature functions is a SummarizedExperiment
with the original expression data in the assay
and the computed signature scores in the colData
. Thus, the returned object can be resubmitted as input data to another signature function and will be returned as well with the addition of the new signature in the colData
.
We can also compute multiple signatures at once with the function multipleSign
. Supplying the expression dataset and the input type without any other argument, all the signatures will be computed. Otherwise, we can specify a sub-group of signatures through the use of the arguments tissue
, tumor
and/or topic
to define signature attributes that will additionally narrow the signature list. Alternatively, we can state exactly the signatures using the whichSign
argument. E.g. here below we computed all the available signature for ovary and pan-tissue:
ovse <- multipleSign(dataset = ovse,
inputType = "rnaseq",
tissue = c("ovary", "pan-tissue"))
## EMTSignMiow is using 96% of epithelial signature genes
## EMTSignMiow is using 91% of mesenchymal signature genes
## Warning in .filterFeatures(expr, method): 1 genes with constant expression
## values throuhgout the samples.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
## EMTSignMak is using 96% of epithelial signature genes
## EMTSignMak is using 100% of mesenchymal signature genes
## pyroptosisSignYe is using 86% of signature genes
## ferroptosisSignYe is using 100% of signature genes
## lipidMetabolismSign is using 100% of signature genes
## hypoxiaSign is using 92% of signature genes
## immunoScoreSignHao is using 100% of signature genes
## immunoScoreSignRoh is using 100% of signature genes
## 'select()' returned 1:1 mapping between keys and columns
## Loading training data
## Training Random Forest...
## IPSSign is using 98% of signature genes
## matrisomeSign is using 100% of signature genes
## mitoticIndexSign is using 100% of signature genes
## ImmuneCytSignRooney is using 100% of signature genes
## IFNSign is using 100% of signature genes
## expandedImmuneSign is using 100% of signature genes
## TinflamSign is using 100% of signature genes
## CINSign is using 96% of signature genes
## CINSign is using 94% of signature genes
## cellCycleSignLundberg is using 93% of signature genes
## cellCycleSignDavoli is using 100% of signature genes
## ASCSign is using 92% of signature genes
## ImmuneCytSignDavoli is using 100% of signature genes
## ChemokineSign is using 100% of signature genes
## ECMSign is using 100% of up signature genes
## ECMSign is using 93% of down signature genes
## Warning in .filterFeatures(expr, method): 1 genes with constant expression
## values throuhgout the samples.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
## HRDSSign is using 89% of signature genes
## VEGFSign is using 100% of signature genes
## DNArepSign is using 87% of signature genes
## IPSOVSign is using 100% of signature genes
## Warning in .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking,
## : Some gene sets have size one. Consider setting 'min.sz > 1'.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
As a first step, we can visualize some signature’s technical parameters to evaluate their reliability for our analysis. Thus, the evaluationSignPlot
function returns a multipanel plot that shows for each signature: (i) the percentage of genes from the signature gene list that are actually available in the dataset; (ii) the log2 average expressions of these genes (iii) the percentage of zero values in them; (iv) the correlation between scores and total read counts; (v) the correlation between scores and the percentage of total zero values.
evaluationSignPlot(data = ovse)