Please, use the following cite to reference KnowSeq R package within your own manuscripts or researches:
Castillo-Secilla, D., Galvez, J. M., Carrillo-Perez, F., Verona-Almeida, M., Redondo-Sanchez, D., Ortuno, F. M., … and Rojas, I. (2021). KnowSeq R-Bioc Package: The Automatic Smart Gene Expression Tool For Retrieving Relevant Biological Knowledge. Computers in Biology and Medicine, 104387.
To install and load KnowSeq package in R, it is necessary the previous installation of BiocManager from Bioconductor. The next code shows how this installation can be performed:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("KnowSeq")
library(KnowSeq)
KnowSeq is now available also on Docker by running the next command, allowing the use of KnowSeq without a previous installation:
KnowSeq proposes a novel methodology that comprises the most relevant steps in the Transcriptomic gene expression analysis. KnowSeq expects to serve as an integrative tool that allows to process and extract relevant biomarkers, as well as to assess them through a Machine Learning approaches. Finally, the last objective of KnowSeq is the biological knowledge extraction from the biomarkers (Gene Ontology enrichment, Pathway listing and Evidences related to the addressed disease). Although the package allows analyzing all the data manually, the main strength of KnowSeq is the possibility of carrying out an automatic and intelligent HTML report that collect all the involved steps in one document. Nowadays, there is no package that only from the information of the samples to align -included in a text file-, automatically performs the download and alignment of all of the samples. Furthermore, KnowSeq is the only package that allows applying both a machine learning and biomarkers enrichment processes just after the biomarkers extraction. It is important to highlight that the pipeline is totally modular and flexible, hence it can be started from whichever of the different steps. This pipeline has been used in our previous publications for processing raw RNA-seq data and to perform the biomarkers extraction along with the machine learning classifier design steps, also for their integration with microarray data [1,2,3,4].
The whole pipeline included in KnowSeq has been designed carefully with the purpose of achieving a great quality and robustness in each of the steps that conform the pipeline. For that, the pipeline has four fundamental processes:
The first process is focused on the Transcriptomic RAW data treatment. This step has the purpose of extracting a set of count files from raw files stored in the repositories supported by our package (NCBI/GEO [5] ArrayExpress [6] and GDC-Portal). The second one englobes the Differential Expressed Genes (DEGs) identification and extraction by using a novel parameter (Specifically for multiclass studies) defined as Coverage [3], and the assessment of those DEGs by applying advanced machine learning techniques (feature selection process and supervised classification). Once the DEGs are assessed, the next step is the DEGs enrichment methodology which allows retrieving biological information from the DEGs. In this process, relevant information (such as related diseases, biological processes associated and pathways) about the DEGs is retrieved by using very well-known tools and databases. The three types of enrichment are the Gene Ontology (GO) study, the pathways visualization taking into account the gene expression, and the Evidences related to the addressed disease from the final set of DEGs. Finally, all of this information can be displayed on an automatic and intelligent HTML report that contains the results of the complete study for the faced disease or diseases.
In order to avoid version incompatibilities with hisat2 aligner and the installation of the required tools, pre-compiled versions will be used to run the R functions. Consequently, all the tools were compressed and stored in an external server to be downloaded whenever it is required (http://iwbbio.ugr.es/utils/unixUtils.tar.gz). If the tools are directly downloaded from the link, the compressed files must be decompressed in the current project folder in R or RStudio. The name of the resultant folder must be “utils”. Nevertheless, this file can be downloaded automatically by just calling the function rawAlignment, in case the folder utils is not detected in the project folder. This is all needed to run hisat2 through the function rawAlignment. It is not possible to run the alignment without the utils folder. It must be mentioned too that the different files included in the compressed .tar.gz are not only the aligner but also functions needed in the raw alignment process. The tools included are the following:
The rawAlignment function allows running hisat2 aligner. The function takes as single input a CSV from GEO or ArrayExpress loaded in R. There is the possibility to process data from GDC-portal, but a previous authorization (token file) from this platform is required. Furthermore, there is a set of logical parameters to edit the default pipeline followed for the function. With the parameters the user can select if the BAM/SAM/Count files are created. The user can choose if wants to download the reference genome, the GTF, and which version. Even if the user has custom FASTA and GTF files, this can be specified by setting the parameter referenceGenome to “custom” and using the parameters customFA and customGTF to indicates the paths to the custom files. Other functionality is the possibility to process BAM files from the GDC Portal database by setting to TRUE the parameter fromGDC. Then the function will download the specific genome reference of GDC and process the BAM files to Count files. Furthermore, if the user has access to the controlled data, with the token and the manifest acquired from GDC Portal web platform, the samples can be downloaded automatically. An example to run the function with hisat2 aligner is showed below:
# Downloading one series from NCBI/GEO and one series from ArrayExpress
downloadPublicSeries(c("GSE74251"))
# Using read.csv for NCBI/GEO files (read.csv2 for ArrayExpress files)
GSE74251csv <- read.csv("ReferenceFiles/GSE74251.csv")
# Performing the alignment of the samples by using hisat2 aligner
rawAlignment(GSE74251csv,downloadRef=TRUE,downloadSamples=TRUE,BAMfiles = TRUE,
SAMfiles = TRUE,countFiles = TRUE,referenceGenome = 38, fromGDC = FALSE, customFA = "",
customGTF = "", tokenPath = "", manifest = "")
RawAlignment function creates a folder structure in the current project folder which will store all the downloaded and created files. The main folder of this structure is the folder ReferenceFiles but inside of it there are more folders that allows storing the different files used by the process in an organized way.
Another important requirement to take into account is the format of the csv file used to launch the function. It could be from three repositories, two publics (NCBI/GEO and ArrayExpress) and one controlled (GDC Portal). Each of these repositories has its own format in the csv file that contains the information to download and process the desired samples. The necessary format for each repository is explained below.
Series belonging to RNA-seq have a SRA identifier. If this identifier is clicked, a list with the samples that conform this series is showed. Then, the desired samples of the series can be checked and the CSV is automatically generated by clicking the button shown in the image below:
The previous selection generates a csv files that contains a number of columns with information about the samples. However, running the rawAlignment function only needs the three columns shown below in the csv (although the rest of the columns can be kept):
Run | download_path | LibraryLayout |
---|---|---|
SRR2753177 | sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… | SINGLE |
SRR2753178 | sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… | SINGLE |
SRR2753179 | sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… | SINGLE |
There is another way to obtain this csv automatically by calling the function downloadPublicSeries with the NCBI/GEO GSE ID of the wanted series, but this option does not let the user to choose the wanted samples and downloads all the samples of each selected series.
The process for ArrayExpress is the very similar to that for NCBI/GEO. It changes the way to download the csv and the name of the columns in the file. To download the csv there is a file finished as .sdrf.txt inside the RNA-seq series in ArrayExpress, as can be seen in the example below:
As with the NCBI/GEO csv, the csv of ArrayExpress requires only three columns as is shown below:
Comment[ENA_RUN] | Comment[FASTQ_URI] | Comment[LIBRARY_LAYOUT] |
---|---|---|
ERR1654640 | ftp.sra.ebi.ac.uk/vol1/fastq/ERR165/000/ERR16… | PAIRED |
ERR1654640 | ftp.sra.ebi.ac.uk/vol1/fastq/ERR165/000/ERR16… | PAIRED |
There is another way to achieve this csv automatically by calling the function downloadPublicSeries with the ArrayExpress MTAB ID of the wanted series, but this option does not let the user to choose the wanted samples, and therefore and downloads all the samples of each selected series.
GDC portal has the BAM files access restricted or controlled for the user who has access to them. However, the count files are open and can be used directly in this package as input of the function countsToMatrix. If there exist the possibility to download the controlled BAM files, the tsv file that this package uses to convert them into count files is the tsv file generated when the button Sample Sheet is clicked in the cart:
As in the other two repositories, there are a lot of columns inside the tsv files but this package only needs two of them. Furthermore, if the BAM download is carried out by the gdc-client or the web browser, the BAM has to be moved to the path ReferenceFiles/Samples/RNAseq/BAMFiles/Sample.ID/File.Name/ where Sample.ID and File.Name are the columns with the samples information in the tsv file. This folder is created automatically in the current project folder when the rawAlignment function is called, but it can be created manually. However, GDC portal has public access to count files that can be used in a posterior step of the KnowSeq pipeline to merge and analyze them.
It exists the possibility to download automatically the raw data from GDC portal by using the rawAlignment function. In order to carry this out, the function needs the parameters downloadSamples and fromGDC set to TRUE, the path to the token in order to obtain the authentication to download the controlled data and the path to the manifest that contains the information to download the samples. This step needs the permission of GDC portal to the controlled data.
From now on, the data that will be used for the documentation are real count files, but with a limited number of genes (around 1000). Furthermore, to reduce the computational cost of this example, only 5 samples from each of the two selected series will be taken into account. Showed in the code snippet below, two RNA-seq series from NCBI/GEO are downloaded automatically and the existing count files prepared to be merged in one matrix with the purpose of preparing the data for further steps:
suppressMessages(library(KnowSeq))
dir <- system.file("extdata", package="KnowSeq")
# Using read.csv for NCBI/GEO files and read.csv2 for ArrayExpress files
GSE74251 <- read.csv(paste(dir,"GSE74251.csv",sep = "/"))
GSE81593 <- read.csv(paste(dir,"GSE81593.csv",sep = "/"))
# Creating the csv file with the information about the counts files location and the labels
Run <- GSE74251$Run
Path <- paste(dir,"/countFiles/",GSE74251$Run,sep = "")
Class <- rep("Tumor", length(GSE74251$Run))
GSE74251CountsInfo <- data.frame(Run = Run, Path = Path, Class = Class)
Run <- GSE81593$Run
Path <- paste(dir,"/countFiles/",GSE81593$Run,sep = "")
Class <- rep("Control", length(GSE81593$Run))
GSE81593CountsInfo <- data.frame(Run = Run, Path = Path, Class = Class)
mergedCountsInfo <- rbind(GSE74251CountsInfo, GSE81593CountsInfo)
write.csv(mergedCountsInfo, file = "mergedCountsInfo.csv")
However, the user can run a complete example by executing the following code:
After the raw alignment step, a list of count files of the samples is available at ReferenceFiles/Samples/RNAseq/CountFiles. The next step in the pipeline implemented in this package is the processing of those count files in order to obtain a gene expression matrix by merging all of them.
After the alignment, as many count files as samples in the CSV used for the alignment have been created. In order to prepare the data for the DEGs analysis, it is important to merge all these files in one matrix that contains the genes Ensembl ID (or other IDs) in the rows and the name of the samples in the columns. To carry this out, the function countsToMatrix is available. This function reads all count files and joints them in one matrix by using edgeR package [15]. To call the function it is only necessary a CSV with the information about the count files paths. The required CSV has to have the following format:
Run | Path | Class |
---|---|---|
SRR2753159 | ~/ReferenceFile/Count/SRR2753159/ | Tumor |
SRR2753162 | ~/ReferenceFile/Count/SRR2753162/ | Tumor |
SRR2827426 | ~/ReferenceFile/Count/SRR2827426/ | Healthy |
SRR2827427 | ~/ReferenceFile/Count/SRR2827427/ | Healthy |
The column Run is the name of the sample without .count, the column Path is the Path to the count file and the Class column is the labels of the samples. Furthermore, an example of this function is shown below:
# Merging in one matrix all the count files indicated inside the CSV file
countsInformation <- countsToMatrix("mergedCountsInfo.csv", extension = "count")
##
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR2753159/SRR2753159.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR2753160/SRR2753160.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR2753161/SRR2753161.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR2753162/SRR2753162.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR2753163/SRR2753163.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR3541296/SRR3541296.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR3541297/SRR3541297.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR3541298/SRR3541298.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR3541299/SRR3541299.count
## /tmp/RtmpVk6CqE/Rinst345e2f5ef2438f/KnowSeq/extdata/countFiles/SRR3541300/SRR3541300.count
## Merging 10 counts files...
# Exporting to independent variables the counts matrix and the labels
countsMatrix <- countsInformation$countsMatrix
labels <- countsInformation$labels
The function returns a list that contains the matrix with the merged counts and the labels of the samples. It is very important to store the labels in a new variable because as it will be required in several functions of KnowSeq.
This step is only required if the user wants to get the gene names and the annotation is retrieved with the information given by the ensembl webpage [16]. Normally, the counts matrix has the Ensembl Ids as gene identifier, but with this step, the Ensembl Ids are change by the gene names. However, the user can decide to keep its own annotation or the Ensembl Ids. For example, to achieve the gene names the function needs the current Ensembl Ids, and the reference Genome used would be the number 38. If the user wants a different annotation than the human annotation, the parameter notHSapiens has to be set to TRUE and the desired specie dataset from ensembl indicated in the parameter notHumandataset (i.e. “mm129s1svimj_gene_ensembl”). An example can be seen below:
## Getting annotation of the Homo Sapiens...
## Using reference genome 38.
# Downloading mus musculus annotation
myAnnotationMusMusculus <- getGenesAnnotation(rownames(countsMatrix),
notHSapiens = TRUE,notHumandataset = "mm129s1svimj_gene_ensembl")
## Downloading annotation mm129s1svimj_gene_ensembl...
##
## Connection error, trying again...
##
## Connection error, trying again...
Finally, once both the countsMatrix and the annotation are ready, it is time to convert those counts into gene expression values. For that, the function calculateGeneExpressionValues uses the cqn package to calculates the equivalent gene expression [17]. This function performs a conversion of counts into gene expression values, and changes the Ensembl Ids by the gene names if the parameter geneNames is equal to TRUE. An example of the use of this function is showed below:
# Calculating gene expression values matrix using the counts matrix
expressionMatrix <- calculateGeneExpressionValues(countsMatrix,myAnnotation,
genesNames = TRUE)
## Calculating gene expression values...
## RQ fit ..........
## SQN .
At this time of the pipeline, a function that plots the expression data and allows verifying if the data is well normalized can be used. This function has the purpose of joining all the important graphical representation of the pipeline in the same function and is called dataPlot. It is very easy to use because just by changing the parameter method many different representations can be achieved. In this case, in order to see the expression boxplot of each sample, the function has to be called with the parameter mode equal to “boxplot”. The labels are necessary to colour the different samples depending on the class of the samples. These colours can be selected by the user, by introducing in the parameter colours a vector with the name of the desired colours. The function also allows exporting the plots as PNG and PDF files.
# Plotting the boxplot of the expression of each samples for all the genes
dataPlot(expressionMatrix,labels,mode = "boxplot", toPNG = TRUE,
toPDF = TRUE)
## Creating PNG...
## Creating PDF...
A crucial step in this pipeline is the batch effect treatment. It is widely known that this is a crucial step in the omics data processing due to the intrinsic deviations that the data can present due to its origin, sequencing design, etc… Besides, when working with public data it is very difficult to know if exists a real batch effect among the selected datasets. This package allows removing batch effect if the batch groups are known by calling the function batchEffectRemoval, that makes use of sva package [18], with the parameter mode equal to “combat” [19]. This step allows obtaining an expression matrix with the batch effect treated by combat method. An example to do this is below:
# Removing batch effect by using combat and known batch groups
batchGroups <- c(1,1,1,1,2,2,1,2,1,2)
expressionMatrixCorrected <- batchEffectRemoval(expressionMatrix, labels,
batchGroups = batchGroups, method = "combat")
## Correcting batch effect by using combat method...
## Using the 'mean only' version of ComBat
## Found2batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
There is another method in the function that removes the batch effect that uses surrogate variable analysis or sva. The only requirement to use it is to set the parameter method equal to “sva”. This method returns a matrix with the batch effect corrected which has to be used as input of the function DEGsExtraction.
# Calculating the surrogate variable analysis to remove batch effect
expressionMatrixCorrected <- batchEffectRemoval(expressionMatrix, labels, method = "sva")
## Calculating sva model to batch effect correction...
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
There is a long way between the raw data and the DEGs extraction, for that in this step the samples have to have had a strong pre-processing step applied. At this point of the pipeline the DEGs existing among two or more classes will be extracted by using the novel parameter coverage (cov) along with limma R-bioc package [20]. The parameter cov represents the number of different pathologies that a certain gen is able to discern. By default, the parameter is set to 1, so all genes that has the capability to discern among the comparison of two classes would be selected as DEGs. To understand better this parameter, our multiclass study applied to different leukemia sub-types introduces it, and it’s publicly available [3].
The function DEGsExtraction receives an expression matrix, the labels of the samples and the restriction imposed for considering a gene as differential expressed gene. The function returns a list containing the table with statistical values of each DEGs and the expression matrix of the DEGs instead all of the genes. The call to the function is listed below:
# Extracting DEGs that pass the imposed restrictions
DEGsInformation <- DEGsExtraction(expressionMatrixCorrected, labels,
lfc = 1.0, pvalue = 0.01, number = 100, cov = 1)
## Two classes detected, applying limma biclass
topTable <- DEGsInformation$DEG_Results$DEGs_Table
DEGsMatrix <- DEGsInformation$DEG_Results$DEGs_Matrix
DEGs are genes that have a truly different expression among the studied classes, for that it is important to try to see graphically if those DEGs comply with this requirement. In order to provide a tool to perform this task, the function dataPlot encapsulate a set of graphs that allows plotting in different ways the expression of the DEGs.
dataPlot function also allows representing an ordered boxplot that internally orders the samples by class and plots a boxplot for each samples and for the first top 12 DEGs in this example. With this plot, the difference at gene expression level between the classes can be seen graphically. The code to reproduce this plot is the following:
# Plotting the expression of the first 12 DEGs for each of the samples in an ordered way
dataPlot(DEGsMatrix[1:12,],labels,mode = "orderedBoxplot",toPNG = FALSE,toPDF = FALSE)
In the previous boxplot the expression of a set of DEGs for each sample its showed, however it is interesting to see the differentiation at gene expression level for each of the top 12 genes used before separately. It is recommended to use this function with a low number of genes, because with a larger number the plot it is difficult to distinguish the information provided and R would not have enough memory to calculate the plot. For that, the function dataPlot with the mode genesBoxplot allows to do that by executing the next code:
# Plotting the expression of the first 12 DEGs separatelly for all the samples
dataPlot(DEGsMatrix[1:12,],labels,mode = "genesBoxplot",toPNG = FALSE,toPDF = FALSE)
Finally, it is possible to plot one of the most widespread visualization methods in the literature, the heatmap. By setting the parameter method to heatmap, the function calculates the heatmap for the given samples and classes. The code to do this is the same than for the previous boxplot but changing the method parameter:
Normally, in the literature, the last step in the pipeline for differential gene expression analysis is the DEGs extraction step. However, in this package a novel machine learning step is implemented with the purpose of giving to the user an automatic tool to assess the DEGs, and evaluate their robustness in the discernment among the studied pathologies. This library has three possible classification methodologies to take into account. These options are k-NN [21], SVM [22] and Random Forest [23], three of the most popular classifiers in the literature. Furthermore, it includes two different working procedures for each of them. The first one implements a cross-validation process, in order to assess the expected accuracy with different models and samples the DEGs with a specific number of folds. These functions return a list with 4 objects that contain the confusion matrices, the accuracy, the sensitivity and the specificity.
The second one is to assess a specific test dataset by using a classifier trained using the training dataset separately. Moreover, the function featureSelection allows performing a feature selection process by using, mRMR [24], Random Forest (as feature selector instead of classifier) or Disease Association based algorithms with the purpose of finding the best DEGs order to assess the data. Da-FS is a new novel method designed in KnowSeq with the purpose of giving to the expert a biological based feature selection method. This method makes use of targetValidation webplatform to acquire an association score for each DEGs with the required genetic disease, breast cancer in the example. This score takes values between 1 and 0, meaning 1 a total association and 0 no association. Therefore, the DEGs are sorted by this score, achieving a ranking in which in the first positions those DEGs with more biological relation to breast cancer are placed.
Moreover, targetValidation webplatform allows acquiring evidences that tie a gene with a certain disease. DA-RED-FS is a novel iterative method based on DA-FS that use these evidences to calculate the redundance between genes based on biological information. This redundances take values between 1 and 0. For example, if gene A has a redundance of 1 with gene B, means that all found evidences for gene A and a certain disease are also found in gene B and the disease. Likewise, if this redundance is 0, means that there are not any gene A evidences in gene B evidences. DA-RED-FS starts with an empty set of selected genes, \(S_G\), and a set of possible genes \(G\). In the first step, the gene with the highest DA score is selected. In the following steps, genes that verify the following equation are added to selected genes set.
\[ max_{g \in G - S_G} DA(g) - \frac{\sum_{g_i \in S_g} RED(g,g_i)}{|S_G|} \times DA(g) \]
Where \(RED(g,g_i)\) is the redundance between gene \(g\) and gen \(g_i\) and \(DA(g)\) is the DA score of gene \(g\). The algorithms ends when a certain number of genes are selected, this number can be fixed by the maxGenes parameter.
To invoke these functions, it is necessary an expression matrix with the samples in the rows and the genes in the columns and the labels of the samples, the genes that will be assessed and the number of fold in the case of the cross-validation function. In the case of the test functions, it is necessary the matrix and the labels for both the training and the test datasets:
DEGsMatrixML <- t(DEGsMatrix)
# Feature selection process with mRMR and RF
mrmrRanking <- featureSelection(DEGsMatrixML,labels,colnames(DEGsMatrixML), mode = "mrmr")
## Calculating the ranking of the most relevant genes by using mRMR algorithm...
## mRMR ranking: BCAS1 ATP2B4 RPS17P5 CXorf56 GALC VIM SCIN ROS1 SH2D2A ETV1 AGPS SNAI2 LAMC2 GNA15 MAGEC2 EHD3 GYG2 PRSS3 FSTL4 BIRC3 ADAM22 COL17A1 CCN5 CALCR DNASE1L1 ARHGAP44 TNC DLX3 FUT8 CCDC85A CDH1 EHD2 HOXC8 PRSS21 MATK ABCB4 DKK3 YBX2 COL9A2 RIPOR3 VCAN NFIX LY75 SLC7A2 CCT8L1P ERBB3 ARSD BARX2 APPBP2 DCN CNTN1 NLRP2 APBA2 TYMP MPPED2 TENM1 PRKCQ FUZ CHDH NEXMIF PLAUR CYP26B1 E2F2 DLEC1 FAS RAI14 EPHA3 AGPAT4 RCN1 ATP2C2 GAS7 CPS1 LAMA3 ARHGAP31 PRSS8 TIMP2 DGKA PLEKHG6 TRAF1 BTN3A1 EPN3 CP PRSS22 ABCC2 ME1 NGEF SYT13 SEZ6 PLEKHB1 CYFIP2 ABCC8 ZMYND12 SPATA20 CEACAM7 DAPK2 GRAMD1B IYD PHF21B SARM1 NDC1
## Calculating the ranking of the most relevant genes by using Random Forest algorithm...
## Random Forest ranking: FSTL4 CCT8L1P ATP2C2 ZMYND12 ARHGAP31 NEXMIF GYG2 COL17A1 DCN CDH1 VCAN SEZ6 FUZ TYMP DGKA MAGEC2 EPHA3 DLEC1 PLEKHB1 TENM1 PRKCQ NLRP2 DAPK2 SPATA20 BTN3A1 GALC CYFIP2 SNAI2 SH2D2A CALCR EHD2 DKK3 GAS7 RAI14 EPN3 BIRC3 ARHGAP44 ABCB4 SLC7A2 CPS1 LAMA3 PHF21B PRSS8 PRSS3 LY75 DNASE1L1 DLX3 PRSS21 E2F2 IYD FUT8 GNA15 CCDC85A ARSD BARX2 CYP26B1 CP GRAMD1B YBX2 SARM1 TIMP2 COL9A2 ERBB3 ME1 BCAS1 SCIN ETV1 EHD3 TNC ATP2B4 ABCC8 APPBP2 PRSS22 ABCC2 NGEF VIM RCN1 ROS1 NFIX CEACAM7 AGPS LAMC2 ADAM22 MATK MPPED2 PLAUR AGPAT4 NDC1 PLEKHG6 RPS17P5 CCN5 HOXC8 CXorf56 CHDH FAS RIPOR3 CNTN1 APBA2 SYT13 TRAF1
## Calculating ranking of biological relevant genes by using DA implementation...
## Getting annotation of the Homo Sapiens...
## Using reference genome 38.
## Obtaining scores for breast...
## Disease scores acquired successfully!
## Disease Association ranking: BCAS1 RPS17P5 GALC VIM SCIN ROS1 SH2D2A ETV1
## 0 0 0 0 0 0 0 0
## AGPS SNAI2 LAMC2 GNA15 MAGEC2 EHD3 GYG2 PRSS3
## 0 0 0 0 0 0 0 0
## FSTL4 BIRC3 ADAM22 COL17A1 CCN5 CALCR DNASE1L1 ARHGAP44
## 0 0 0 0 0 0 0 0
## TNC DLX3 ATP2B4 PLEKHB1 FUT8 CCDC85A CDH1 EHD2
## 0 0 0 0 0 0 0 0
## HOXC8 PRSS21 MATK ABCB4 DKK3 YBX2 COL9A2 RIPOR3
## 0 0 0 0 0 0 0 0
## VCAN NFIX ABCC8 LY75 SLC7A2 CCT8L1P ERBB3 ARSD
## 0 0 0 0 0 0 0 0
## BARX2 APPBP2 DCN CNTN1 NLRP2 CXorf56 APBA2 TYMP
## 0 0 0 0 0 0 0 0
## SARM1 MPPED2 TENM1 PRKCQ FUZ CHDH NEXMIF PLAUR
## 0 0 0 0 0 0 0 0
## CYP26B1 E2F2 DLEC1 FAS RAI14 EPHA3 AGPAT4 RCN1
## 0 0 0 0 0 0 0 0
## ATP2C2 GAS7 CPS1 LAMA3 NDC1 CEACAM7 DAPK2 SPATA20
## 0 0 0 0 0 0 0 0
## ARHGAP31 PRSS8 TIMP2 ZMYND12 SYT13 DGKA PLEKHG6 TRAF1
## 0 0 0 0 0 0 0 0
## CYFIP2 BTN3A1 SEZ6 EPN3 IYD CP PRSS22 ABCC2
## 0 0 0 0 0 0 0 0
## GRAMD1B PHF21B ME1 NGEF
## 0 0 0 0
# CV functions with k-NN, SVM and RF
results_cv_knn <- knn_trn(DEGsMatrixML,labels,names(mrmrRanking)[1:10],5)
## Tuning the optimal K...
## Loading required package: ggplot2
## Loading required package: lattice
## Optimal K: 7
## Running K-Fold Cross-Validation...
## Training fold 1...
## Running K-Fold Cross-Validation...
## Training fold 2...
## Running K-Fold Cross-Validation...
## Training fold 3...
## Running K-Fold Cross-Validation...
## Training fold 4...
## Running K-Fold Cross-Validation...
## Training fold 5...
## Classification done successfully!
## Tuning the optimal C and G...
## Optimal cost: 0.25
## Optimal gamma: 0.9
## Training fold 1...
## Training fold 2...
## Training fold 3...
## Training fold 4...
## Training fold 5...
## Classification done successfully!
## Tuning the optimal mtry...
## Training fold 1...
## Training fold 2...
## Training fold 3...
## Training fold 4...
## Training fold 5...
## Classification done successfully!
It is important to show graphically the results of the classifiers and for that purpose, the function dataPlot implements some methods. Concretely, to plot the accuracy, the sensitivity or the specificity reached by the classifiers, the function dataPlot has to be run with the parameter method equal to classResults. This method generated as many random colors as folds or simulations in the rows of the matrix passed to the function but, through the parameter colors a vector of desired colors can be specified. For the legend, the function uses the rownames of the input matrix but these names can be changed with the parameter legend. An example of this method is showed below:
# Plotting the accuracy of all the folds evaluated in the CV process
dataPlot(rbind(results_cv_knn$accuracy$meanAccuracy,results_cv_knn$accuracy$standardDeviation),mode = "classResults", legend = c("Mean Acc","Standard Deviation"),
main = "Mean Accuracy with k-NN", xlab = "Genes", ylab = "Accuracy")
# Plotting the sensitivity of all the folds evaluated in the CV process
dataPlot(rbind(results_cv_knn$sensitivity$meanSensitivity,results_cv_knn$sensitivity$standardDeviation),mode = "classResults", legend = c("Mean Sens","Standard Deviation"),
main = "Mean Sensitivity with k-NN", xlab = "Genes", ylab = "Sensitivity")
# Plotting the specificity of all the folds evaluated in the CV process
dataPlot(rbind(results_cv_knn$specificity$meanSpecificity,results_cv_knn$specificity$standardDeviation),mode = "classResults", legend = c("Mean Spec","Standard Deviation"),
main = "Mean Specificity with k-NN", xlab = "Genes", ylab = "Specificity")
# Plotting all the metrics depending on the number of used DEGs in the CV process
dataPlot(results_cv_knn,labels,mode = "heatmapResults")
Furthermore, the function dataPlot counts with another similar mode to the previous but this time to represents confusion matrices. This mode is called confusionMatrix and allows creating graphically a confusion matrix with the most important statistical measures. The following code allows doing this:
# Plotting the confusion matrix with the sum of the confusion matrices
# of each folds evaluated in the CV process
allCfMats <- results_cv_knn$cfMats[[1]]$table + results_cv_knn$cfMats[[2]]$table +
results_cv_knn$cfMats[[3]]$table + results_cv_knn$cfMats[[4]]$table +
results_cv_knn$cfMats[[5]]$table
dataPlot(allCfMats,labels,mode = "confusionMatrix")
# Test functions with k-NN, SVM and RF
trainingMatrix <- DEGsMatrixML[c(1:4,6:9),]
trainingLabels <- labels[c(1:4,6:9)]
testMatrix <- DEGsMatrixML[c(5,10),]
testLabels <- labels[c(5,10)]
results_test_knn <- knn_test(trainingMatrix, trainingLabels, testMatrix,
testLabels, names(mrmrRanking)[1:10], bestK = results_cv_knn$bestK)
## Testing with 1 variables...
## Testing with 2 variables...
## Testing with 3 variables...
## Testing with 4 variables...
## Testing with 5 variables...
## Testing with 6 variables...
## Testing with 7 variables...
## Testing with 8 variables...
## Testing with 9 variables...
## Testing with 10 variables...
## Classification done successfully!
results_test_svm <- svm_test(trainingMatrix, trainingLabels, testMatrix,
testLabels, rfRanking[1:10], bestParameters = results_cv_svm$bestParameters)
## Testing with 1 variables...
## Testing with 2 variables...
## Testing with 3 variables...
## Testing with 4 variables...
## Testing with 5 variables...
## Testing with 6 variables...
## Testing with 7 variables...
## Testing with 8 variables...
## Testing with 9 variables...
## Testing with 10 variables...
## Classification done successfully!
results_test_rf <- rf_test(trainingMatrix, trainingLabels, testMatrix,
testLabels, colnames(DEGsMatrixML)[1:10], results_cv_rf$bestParameters)
## Testing with 1 variables...
## Testing with 2 variables...
## Testing with 3 variables...
## Testing with 4 variables...
## Testing with 5 variables...
## Testing with 6 variables...
## Testing with 7 variables...
## Testing with 8 variables...
## Testing with 9 variables...
## Testing with 10 variables...
## Classification done successfully!
# Plotting the accuracy achieved in the test process
dataPlot(results_test_knn$accVector,mode = "classResults",
main = "Accuracy with k-NN", xlab = "Genes", ylab = "Accuracy")
The main goal of the previous pipeline is the extraction of biological relevant information from the DEGs. For that, this package provides a set of tools that allows doing it. The last step of the pipeline englobes all the available tools in KnowSeq for DEGs enrichment, where three different approaches can be taken. The gene ontology information, the pathway visualization and the relationship between the DEGs and diseases related to the studied pathologies.
Gene ontology (GO) provides information about the biological functions of the genes. In order to complete this pipeline, it is important to know if the DEGs have functions related with the studied pathologies. In this sense, this package brings the possibility to know the GOs from the three different ontologies (BP, MF and CC) by using the function geneOntologyEnrichment that internally uses information from DAVID web-platform [20]. The function returns a list that contains a matrix for each ontology and a matrix with the GOs of the three ontologies together. Moreover, the matrices have different statistical measures and the description of the functionality of each GO.
# Retrieving the GO information from the three different ontologies
GOsList <- geneOntologyEnrichment(names(mrmrRanking)[1:10], geneType='GENE_SYMBOL', pvalCutOff=0.1)
## Getting gene symbols...Getting annotation of the Homo Sapiens...
## Using reference genome 38.
## Retrieving Gene Ontology terms related to the list of DEGs...
## [1] "Empty GOTERM_CC_ALL"
For example, in this example, the top 10 GOs from the BP ontology for the extracted DEGs are shown in the following image.
Another important step in the enrichment methodology in this pipeline is the related pathway information. The function uses the DEGs to retrieve pathways with any relation with DEGs. For that, DEGsToPathways makes use of the information gathered from KEGG database in order to supply as more details as possible .
## Retrieving information about KEGG pathways...
## KEGG_Path Name
## 1 map04020 Calcium signaling pathway
## 2 map04022 cGMP-PKG signaling pathway
## 3 map04024 cAMP signaling pathway
## 4 map04261 Adrenergic signaling in cardiomyocytes
## 5 map04925 Aldosterone synthesis and secretion
## 6 map04961 Endocrine and other factor-regulated calcium reabsorption
## 7 map04970 Salivary secretion
## 8 map04972 Pancreatic secretion
## 9 map04978 Mineral absorption
## 10 map00600 nothing
## 11 map01100 nothing
## 12 map04142 Lysosome
## 13 map05169 Epstein-Barr virus infection
## 14 map05206 MicroRNAs in cancer
## 15 map04666 Fc gamma R-mediated phagocytosis
## 16 map04810 nothing
## 17 map05203 Viral carcinogenesis
## 18 map04370 VEGF signaling pathway
## 19 map05202 Transcriptional misregulation in cancer
## Description
## 1 Ca2+ that enters the cell from the outside is a principal source of signal Ca2+. Entry of Ca2+ is driven by the presence of a large electrochemical gradient across the plasma membrane. Cells use this external source of signal Ca2+ by activating various entry channels with widely different properties. The voltage-operated channels (VOCs) are found in excitable cells and generate the rapid Ca2+ fluxes that control fast cellular processes. There are many other Ca2+-entry channels, such as the receptor-operated channels (ROCs), for example the NMDA (N-methyl-D-aspartate) receptors (NMDARs) that respond to glutamate. There also are second-messenger-operated channels (SMOCs) and store-operated channels (SOCs).\n The other principal source of Ca2+ for signalling is the internal stores that are located primarily in the endoplasmic/sarcoplasmic reticulum (ER/SR), in which inositol-1,4,5-trisphosphate receptors (IP3Rs) or ryanodine receptors (RYRs) regulate the release of Ca2+. The principal activator of these channels is Ca2+ itself and this process of Ca2+-induced Ca2+ release is central to the mechanism of Ca2+ signalling. Various second messengers or modulators also control the release of Ca2+. IP3, which is generated by pathways using different isoforms of phospholipase C (PLCbeta, delta, epsilon, gamma and zeta), regulates the IP3Rs. Cyclic ADP-ribose (cADPR) releases Ca2+ via RYRs. Nicotinic acid adenine dinucleotide phosphate (NAADP) may activate a distinct Ca2+ release mechanism on separate acidic Ca2+ stores. Ca2+ release via the NAADP-sensitive mechanism may also feedback onto either RYRs or IP3Rs. cADPR and NAADP are generated by CD38. This enzyme might be sensitive to the cellular metabolism, as ATP and NADH inhibit it.\n The influx of Ca2+ from the environment or release from internal stores causes a very rapid and dramatic increase in cytoplasmic calcium concentration, which has been widely exploited for signal transduction. Some proteins, such as troponin C (TnC) involved in muscle contraction, directly bind to and sense Ca2+. However, in other cases Ca2+ is sensed through intermediate calcium sensors such as calmodulin (CALM).
## 2 Cyclic GMP (cGMP) is the intracellular second messenger that mediates the action of nitric oxide (NO) and natriuretic peptides (NPs), regulating a broad array of physiologic processes. The elevated intracellular cGMP level exerts its physiological action through two forms of cGMP-dependent protein kinase (PKG), cGMP-regulated phosphodiesterases (PDE2, PDE3) and cGMP-gated cation channels, among which PKGs might be the primary mediator. PKG1 isoform-specific activation of established substrates leads to reduction of cytosolic calcium concentration and/or decrease in the sensitivity of myofilaments to Ca2+ (Ca2+-desensitization), resulting in smooth muscle relaxation. In cardiac myocyte, PKG directly phosphorylates a member of the transient potential receptor canonical channel family, TRPC6, suppressing this nonselective ion channel's Ca2+ conductance, G-alpha-q agonist-induced NFAT activation, and myocyte hypertrophic responses. PKG also opens mitochondrial ATP-sensitive K+ (mitoKATP) channels and subsequent release of ROS triggers cardioprotection.
## 3 cAMP is one of the most common and universal second messengers, and its formation is promoted by adenylyl cyclase (AC) activation after ligation of G protein-coupled receptors (GPCRs) by ligands including hormones, neurotransmitters, and other signaling molecules. cAMP regulates pivotal physiologic processes including metabolism, secretion, calcium homeostasis, muscle contraction, cell fate, and gene transcription. cAMP acts directly on three main targets: protein kinase A (PKA), the exchange protein activated by cAMP (Epac), and cyclic nucleotide-gated ion channels (CNGCs). PKA modulates, via phosphorylation, a number of cellular substrates, including transcription factors, ion channels, transporters, exchangers, intracellular Ca2+ -handling proteins, and the contractile machinery. Epac proteins function as guanine nucleotide exchange factors (GEFs) for both Rap1 and Rap2. Various effector proteins, including adaptor proteins implicated in modulation of the actin cytoskeleton, regulators of G proteins of the Rho family, and phospholipases, relay signaling downstream from Rap.
## 4 Cardiac myocytes express at least six subtypes of adrenergic receptor (AR) which include three subtypes of beta-AR (beta-1, beta-2, beta-3) and three subtypes of the alpha-1-AR (alpha-1A, alpha-1B, and alpha-1C). In the human heart the beta-1-AR is the pre- dominate receptor. Acute sympathetic stimulation of cardiac beta-1-ARs induces positive inotropic and chronotropic effects, the most effective mechanism to acutely increase output of the heart, by coupling to Gs, formation of cAMP by adenylyl cyclase (AC), and PKA- dependent phosphorylation of various target proteins (e.g., ryanodine receptor [RyR]; phospholamban [PLB], troponin I [TnI], and the L-type Ca2+ channel [LTCC]). Chronic beta-1-AR stimulation is detrimental and induces cardiomyocyte hypertrophy and apoptosis. beta-2-AR coupled to Gs exerts a proapoptotic action as well as beta-1-AR, while beta-2-AR coupled to Gi exerts an antiapoptotic action.
## 5 Aldosterone is a steroid hormone synthesized in and secreted from the outer layer of the adrenal cortex, the zona glomerulosa. Aldosterone plays an important role in the regulation of systemic blood pressure through the absorption of sodium and water. Angiotensin II (Ang II), potassium (K+) and adrenocorticotropin (ACTH) are the main extracellular stimuli which regulate aldosterone secretion. These physiological agonists all converge on two major intracellular signaling pathways: calcium (Ca2+) mobilization and an increase in cAMP production. The increase in cytosolic calcium levels activates calcium/calmodulin- dependent protein kinases (CaMK), and the increased cAMP levels stimulate the activity of cAMP-dependent protein kinase, or protein kinase A (PKA). The activated CaMK, and possibly PKA, activates transcription factors (NURR1 and NGF1B, CREB) to induce StAR and CYP11B2 expression, the early and late rate- limiting steps in aldosterone biosynthesis, respectively, thereby stimulating aldosterone secretion.
## 6 Calcium (Ca2+) is essential for numerous physiological functions including intracellular signalling processes, neuronal excitability, muscle contraction and bone formation. Therefore, its homeostasis is finely maintained through the coordination of intestinal absorption, renal reabsorption, and bone resorption. In kidney, the late part of the distal convoluted tubule (DCT) and the connecting tubule (CNT) are the site of active Ca2+ transport and precisely regulate Ca2+ reabsorption. Following Ca2+ entry through TRPV5, Ca2+ bound to calbindin-D28K diffuses to the basolateral side, where it is extruded into the blood compartment through NCX1 and to a lesser extent PMCA1b. In the urinary compartment, both klotho and tissue kallikrein (TK) increase the apical abundance of TRPV5. In the blood compartment, PTH, 1,25(OH)2D3 and estrogen increase the transcription and protein expression of the luminal Ca2+ channels, calbindins, and the extrusion systems.
## 7 Saliva has manifold functions in maintaining the integrity of the oral tissues, in protecting teeth from caries, in the tasting and ingestion of food, in speech and in the tolerance of tenures, for example. Salivary secretion occurs in response to stimulation by neurotransmitters released from autonomic nerve endings. There are two secretory pathways: protein exocytosis and fluid secretion. Sympathetic stimulation leads to the activation of adenylate cyclase and accumulation of intracellular cAMP. The elevation of cAMP causes the secretion of proteins such as amylase and mucin. In contrast, parasympathetic stimulation activates phospholipase C and causes the elevation of intracellular Ca2+, which leads to fluid secretion; that is, water and ion transport. Ca2+ also induces amylase secretion, but the amount is smaller than that induced by cAMP.
## 8 The pancreas performs both exocrine and endocrine functions. The exocrine pancreas consists of two parts, the acinar and duct cells. The primary functions of pancreatic acinar cells are to synthesize and secrete digestive enzymes. Stimulation of the cell by secretagogues such as acetylcholine (ACh) and cholecystokinin (CCK) causes the generation of an intracellular Ca2+ signal. This signal, in turn, triggers the fusion of the zymogen granules with the apical plasma membrane, leading to the polarised secretion of the enzymes. The major task of pancreatic duct cells is the secretion of fluid and bicarbonate ions (HCO3-), which neutralize the acidity of gastric contents that enter the duodenum. An increase in intracellular cAMP by secretin is one of the major signals of pancreatic HCO3- secretion. Activation of the CFTR Cl- channel and the CFTR-dependent Cl-/HCO3- exchange activities is responsible for cAMP-induced HCO3- secretion.
## 9 Minerals are one of the five fundamental groups of nutrients needed to sustain life. Of the minerals, calcium plays innumerable roles in our bodies, serving as a main component of bone as well as an intracellular messenger in muscle contraction/relaxation, neural networks, the immune system, and endocrine/exocrine cells. Iron, copper, and other metals are required for redox reactions (as cofactors) and for oxygen transport and binding (in hemoglobin and myoglobin). Many enzymes require specific metal atoms to complete their catalytic functions. Animal tissues need moderate quantities of some elements (Ca, P, K, Na, Mg, S, and Cl) and trace amounts of others (Mn, Fe, I, Co, Cr, Cu, Zn, and Se). The minerals are absorbed by either passive or active transport systems through the intestinal mucosa, often using specialized transport proteins, such as ferritin for Fe3+ and vitamin D-induced protein for calcium.
## 10 nothing
## 11 nothing
## 12 Lysosomes are membrane-delimited organelles in animal cells serving as the cell's main digestive compartment to which all sorts of macromolecules are delivered for degradation. They contain more than 40 hydrolases in an acidic environment (pH of about 5). After synthesis in the ER, lysosomal enzymes are decorated with mannose-6-phosphate residues, which are recognized by mannose-6-phosphate receptors in the trans-Golgi network. They are packaged into clathrin-coated vesicles and are transported to late endosomes. Substances for digestion are acquired by the lysosomes via a series of processes including endocytosis, phagocytosis, and autophagy.
## 13 Epstein-Barr virus (EBV) is a gamma-herpes virus that widely infects human populations predominantly at an early age but remains mostly asymptomatic. EBV has been linked to a wide spectrum of human malignancies, including nasopharyngeal carcinoma and other hematologic cancers, like Hodgkin's lymphoma, Burkitt's lymphoma (BL), B-cell immunoblastic lymphoma in HIV patients, and posttransplant-associated lymphoproliferative diseases. EBV has the unique ability to establish life-long latent infection in primary human B lymphocytes. During latent infection, EBV expresses a small subset of genes, including 6 nuclear antigens (EBNA-1, -2, -3A, -3B, -3C, and -LP), 3 latent membrane proteins (LMP-1, -2A, and -2B), 2 small noncoding RNAs (EBER-1 and 2). On the basis of these latent gene expression, three different latency patterns associated with the types of cancers are recognized.
## 14 MicroRNA (miRNA) is a cluster of small non-encoding RNA molecules of 21 - 23 nucleotides in length, which controls gene expression post-transcriptionally either via the degradation of target mRNAs or the inhibition of protein translation. Using high-throughput profiling, dysregulation of miRNAs has been widely observed in different stages of cancer. The upregulation (overexpression) of specific miRNAs could lead to the repression of tumor suppressor gene expression, and conversely the downregulation of specific miRNAs could result in an increase of oncogene expression; both these situations induce subsequent malignant effects on cell proliferation, differentiation, and apoptosis that lead to tumor growth and progress. The miRNA signatures of cancer observed in various studies differ significantly. These inconsistencies occur due to the differences in the study populations and methodologies used. This pathway map shows the summarized results from various studies in 9 cancers, each of which is presented in a review article.
## 15 Phagocytosis plays an essential role in host-defense mechanisms through the uptake and destruction of infectious pathogens. Specialized cell types including macrophages, neutrophils, and monocytes take part in this process in higher organisms. After opsonization with antibodies (IgG), foreign extracellular materials are recognized by Fc gamma receptors. Cross-linking of Fc gamma receptors initiates a variety of signals mediated by tyrosine phosphorylation of multiple proteins, which lead through the actin cytoskeleton rearrangements and membrane remodeling to the formation of phagosomes. Nascent phagosomes undergo a process of maturation that involves fusion with lysosomes. The acquisition of lysosomal proteases and release of reactive oxygen species are crucial for digestion of engulfed materials in phagosomes.
## 16 nothing
## 17 There is a strong association between viruses and the development of human malignancies. We now know that at least six human viruses, Epstein-Barr virus (EBV), hepatitis B virus (HBV), hepatitis C virus (HCV), human papilloma virus (HPV), human T-cell lymphotropic virus (HTLV-1) and Kaposi's associated sarcoma virus (KSHV) contribute to 10-15% of the cancers worldwide. Via expression of many potent oncoproteins, these tumor viruses promote an aberrant cell-proliferation via modulating cellular cell-signaling pathways and escape from cellular defense system such as blocking apoptosis. Human tumor virus oncoproteins can also disrupt pathways that are necessary for the maintenance of the integrity of host cellular genome. Viruses that encode such activities can contribute to initiation as well as progression of human cancers.
## 18 There is now much evidence that VEGFR-2 is the major mediator of VEGF-driven responses in endothelial cells and it is considered to be a crucial signal transducer in both physiologic and pathologic angiogenesis. The binding of VEGF to VEGFR-2 leads to a cascade of different signaling pathways, resulting in the up-regulation of genes involved in mediating the proliferation and migration of endothelial cells and promoting their survival and vascular permeability. For example, the binding of VEGF to VEGFR-2 leads to dimerization of the receptor, followed by intracellular activation of the PLCgamma;PKC-Raf kinase-MEK-mitogen-activated protein kinase (MAPK) pathway and subsequent initiation of DNA synthesis and cell growth, whereas activation of the phosphatidylinositol 3' -kinase (PI3K)-Akt pathway leads to increased endothelial-cell survival. Activation of PI3K, FAK, and p38 MAPK is implicated in cell migration signaling.
## 19 In tumor cells, genes encoding transcription factors (TFs) are often amplified, deleted, rearranged via chromosomal translocation and inversion, or subjected to point mutations that result in a gain- or loss-of- function. In hematopoietic cancers and solid tumors, the translocations and inversions increase or deregulate transcription of the oncogene. Recurrent chromosome translocations generate novel fusion oncoproteins, which are common in myeloid cancers and soft-tissue sarcomas. The fusion proteins have aberrant transcriptional function compared to their wild-type counterparts. These fusion transcription factors alter expression of target genes, and thereby result in a variety of altered cellular properties that contribute to the tumourigenic process.
## Class Genes
## 1 Environmental Information Processing; Signal transduction ATP2B4
## 2 Environmental Information Processing; Signal transduction ATP2B4
## 3 Environmental Information Processing; Signal transduction ATP2B4
## 4 Organismal Systems; Circulatory system ATP2B4
## 5 Organismal Systems; Endocrine system ATP2B4
## 6 Organismal Systems; Excretory system ATP2B4
## 7 Organismal Systems; Digestive system ATP2B4
## 8 Organismal Systems; Digestive system ATP2B4
## 9 Organismal Systems; Digestive system ATP2B4
## 10 Metabolism; Lipid metabolism GALC
## 11 nothing GALC
## 12 Cellular Processes; Transport and catabolism GALC
## 13 Human Diseases; Infectious disease: viral VIM
## 14 Human Diseases; Cancer: overview VIM
## 15 Organismal Systems; Immune system SCIN
## 16 Cellular Processes; Cell motility SCIN
## 17 Human Diseases; Cancer: overview SCIN
## 18 Environmental Information Processing; Signal transduction SH2D2A
## 19 Human Diseases; Cancer: overview ETV1
Most of these steps can be displayed using a single function, knowseqReport, that starting from an expression matrix (or counts matrix) and the labels of the samples, will follow part of the pipeline explained above and will display the result in a report in html format.
For the obtention of this report a DEGs extraction using limma package will be performed. Then, a feature selection process will be carried out, where the used algorithm can be set by the featureSelectionMode parameter, along with the visualizations steps that have been described previously.
Following, the machine learning process starts, where the classification algorithms and the classification metrics to be displayed can be chossen by the user by clasifAlgs and metrics parameters.
Finally DEGs enrichment is obtained, showing Gene Ontologies and found evidences for related diseases for each gene. It is possible to obtain evidences for a certain disease, specifying it by the disease parameter.
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] caret_6.0-92 lattice_0.20-45 ggplot2_3.3.6
## [4] KnowSeq_1.10.2 cqn_1.42.0 quantreg_5.93
## [7] SparseM_1.81 preprocessCore_1.58.0 nor1mix_1.3-0
## [10] mclust_5.4.10
##
## loaded via a namespace (and not attached):
## [1] backports_1.4.1 Hmisc_4.7-0 plyr_1.8.7
## [4] BiocParallel_1.30.3 listenv_0.8.0 GenomeInfoDb_1.32.2
## [7] sva_3.44.0 digest_0.6.29 foreach_1.5.2
## [10] htmltools_0.5.2 fansi_1.0.3 checkmate_2.1.0
## [13] magrittr_2.0.3 memoise_2.0.1 cluster_2.1.3
## [16] limma_3.52.2 recipes_1.0.0 globals_0.15.1
## [19] Biostrings_2.64.0 annotate_1.74.0 gower_1.0.0
## [22] matrixStats_0.62.0 R.utils_2.12.0 hardhat_1.2.0
## [25] jpeg_0.1-9 colorspace_2.0-3 blob_1.2.3
## [28] xfun_0.31 dplyr_1.0.9 crayon_1.5.1
## [31] RCurl_1.98-1.7 jsonlite_1.8.0 genefilter_1.78.0
## [34] survival_3.3-1 iterators_1.0.14 glue_1.6.2
## [37] gtable_0.3.0 ipred_0.9-13 zlibbioc_1.42.0
## [40] XVector_0.36.0 MatrixModels_0.5-0 kernlab_0.9-31
## [43] future.apply_1.9.0 BiocGenerics_0.42.0 scales_1.2.0
## [46] DBI_1.1.3 edgeR_3.38.1 Rcpp_1.0.8.3
## [49] htmlTable_2.4.0 xtable_1.8-4 foreign_0.8-82
## [52] bit_4.0.4 proxy_0.4-27 Formula_1.2-4
## [55] stats4_4.2.0 lava_1.6.10 prodlim_2019.11.13
## [58] htmlwidgets_1.5.4 httr_1.4.3 RColorBrewer_1.1-3
## [61] ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3
## [64] XML_3.99-0.10 R.methodsS3_1.8.2 nnet_7.3-17
## [67] sass_0.4.1 locfit_1.5-9.5 utf8_1.2.2
## [70] labeling_0.4.2 tidyselect_1.1.2 rlang_1.0.3
## [73] reshape2_1.4.4 AnnotationDbi_1.58.0 munsell_0.5.0
## [76] tools_4.2.0 cachem_1.0.6 cli_3.3.0
## [79] generics_0.1.2 RSQLite_2.2.14 evaluate_0.15
## [82] stringr_1.4.0 fastmap_1.1.0 yaml_2.3.5
## [85] ModelMetrics_1.2.2.2 knitr_1.39 bit64_4.0.5
## [88] purrr_0.3.4 randomForest_4.7-1.1 KEGGREST_1.36.2
## [91] future_1.26.1 nlme_3.1-158 praznik_11.0.0
## [94] R.oo_1.25.0 rstudioapi_0.13 compiler_4.2.0
## [97] curl_4.3.2 png_0.1-7 e1071_1.7-11
## [100] tibble_3.1.7 bslib_0.3.1 stringi_1.7.6
## [103] highr_0.9 Matrix_1.4-1 vctrs_0.4.1
## [106] pillar_1.7.0 lifecycle_1.0.1 jquerylib_0.1.4
## [109] data.table_1.14.2 bitops_1.0-7 R6_2.5.1
## [112] latticeExtra_0.6-29 gridExtra_2.3 IRanges_2.30.0
## [115] parallelly_1.32.0 codetools_0.2-18 MASS_7.3-57
## [118] assertthat_0.2.1 withr_2.5.0 S4Vectors_0.34.0
## [121] GenomeInfoDbData_1.2.8 rlist_0.4.6.2 mgcv_1.8-40
## [124] parallel_4.2.0 grid_4.2.0 rpart_4.1.16
## [127] timeDate_3043.102 class_7.3-20 rmarkdown_2.14
## [130] pROC_1.18.0 Biobase_2.56.0 lubridate_1.8.0
## [133] base64enc_0.1-3
Castillo, D., Gálvez, J. M., Herrera, L. J., San Román, B., Rojas, F., & Rojas, I. (2017). Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC bioinformatics, 18(1), 506.
Gálvez, J. M., Castillo, D., Herrera, L. J., San Roman, B., Valenzuela, O., Ortuno, F. M., & Rojas, I. (2018). Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series. PloS one, 13(5), e0196836.
Castillo, D., Galvez, J. M., Herrera, L. J., Rojas, F., Valenzuela, O., Caba, O., … & Rojas, I. (2019). Leukemia multiclass assessment and classification from Microarray and RNA-seq technologies integration at gene expression level. PloS one, 14(2), e0212127.
Gálvez, J. M., Castillo, D., Herrera, L. J., Valenzuela, O., Caba, O., Prados, J. C., … & Rojas, I. (2019). Towards Improving Skin Cancer Diagnosis by Integrating Microarray and RNA-seq Datasets. IEEE Journal of Biomedical and Health Informatics.
Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., … & Yefanov, A. (2012). NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research, 41(D1), D991-D995.
Kolesnikov, N., Hastings, E., Keays, M., Melnichuk, O., Tang, Y. A., Williams, E., … & Megy, K. (2014). ArrayExpress update—simplifying data submissions. Nucleic acids research, 43(D1), D1113-D1116.
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357.
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357.
Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), 166-169.
Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417.
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
Sherry, S., & Xiao, C. (2012, January). Ncbi sra toolkit technology for next generation sequence data. In Plant and Animal Genome XX Conference (January 14-18, 2012). Plant and Animal Genome.
Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
Cunningham, F., Achuthan, P., Akanni, W., Allen, J., … , & Huber, W. (2018). Ensembl 2019. Nucleic Acids Research, 47(D1), D745–D751.
Hansen, K. D., Irizarry, R. A., & Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, 13(2), 204-216.
Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE, Storey JD, Zhang Y, Torres LC (2019). sva: Surrogate Variable Analysis. R package version 3.34.0.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research, 43(7), e47-e47.
X Jiao, BT Sherman, R Stephens, MW Baseler, HC Lane, RA Lempicki. DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics (2012) 28 (13): 1805-1806. doi: 10.1093/bioinformatics/bts251
Luo, W., & Brouwer, C. (2013). Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, 29(14), 1830-1831.
Goh, W. W. B., Wang, W., & Wong, L. (2017). Why batch effects matter in omics data, and how to avoid them. Trends in biotechnology, 35(6), 498-507.
Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02), 185-205.
Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565.
Parry, R. M., Jones, W., Stokes, T. H., Phan, J. H., Moffitt, R. A., Fang, H., … & Wang, M. D. (2010). k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. The pharmacogenomics journal, 10(4), 292.
Koscielny, G., An, P., Carvalho-Silva, D., Cham, J. A., Fumis, L., Gasparyan, R., … & Pierleoni, A. (2016). Open Targets: a platform for therapeutic target identification and validation. Nucleic acids research, 45(D1), D985-D994.