Contents

1 Introduction

Self-training Gene Clustering Pipeline (SGCP) is a framework for gene co-expression network construction and analysis. The goal in these networks is to group the genes in a way that those with similar expression pattern fall within the same network cluster, commonly called module. SGCP consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology (GO) information in a semi-supervised clustering method that further improves the quality of the computed modules. SGCP results in highly enriched modules. Preprint of manuscript describing SGCP in details is available in here.

1.1 SGCP installation

To install the package SGCP use the following. For more information please visit here). In this manual guide, SGCP also relies on two more packages; SummarizedExperiment and org.Hs.eg.db.

library(BiocManager)
BiocManager::install('SGCP')

Let’s start.

library(SGCP)

1.2 SGCP General Input

SGCP has three main input; gene expression, geneID, and genome wide annotation database.

  • expData is a matrix or a dataframe of size m*n where m and n are the number of genes and samples respectively and it can be either DNA-microarray or RNA-seq . In another words, in expData, rows and columns correspond to genes and samples respectively and the entry i,j is an expression value for gene i in sample j. SGCP does not perform any normalization or correction for batch effects and it is assumed that these preprocessing steps have been already performed.

  • geneID is a vector of size m where entry at index i denotes the gene identifier for gene i. Note that there is one-to-one correspondence between the rows in expData and geneID vector where index i in geneId indicates the gene identifier for row i in expData.

  • annotation_db the name of a genome wide annotation package of the organism of interest for gene ontology (GO) enrichment step. annotation_db must be installed by user prior using SGCP.

Here, are some important annotation_db along with its corresponding identifiers.

organism annotation_db gene identifier
Homo sapiens (Hs) org.Hs.eg.db Entrez Gene identifiers
Drosophila melanogaster (Dm) org.Dm.eg.db Entrez Gene identifiers
Rattus norvegicus (Rn) org.Rn.eg.db Entrez Gene identifiers
Mus musculus (Mm) org.Mm.eg.db Entrez Gene identifiers
Arabidopsis thaliana (At) org.At.tair.db TAIR identifiers

Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.

Note that SGCP depends on GOstats for GO enrichment, thus (geneID) and (annotation_db) must be compatible to this package standard.

1.2.1 SGCP Input Example

For illustrative purposes we will use an example of a gene expression provided with SGCP. This data originally represent 5000 genes in 57 samples for Homo sapiens, for more information see (Cheng et al.) Here, to ease the computation, 1000 genes that have the most variance selected with 5 samples have been selected.

For this input example we need to install the following packages.

The input example is organized as an object of SummarizedExperiment. The assay field shows the expression matrix in which rows and columns correspond to gene and samples respectively. rowData denotes the gene Entrez ids for the gene. Note that there is 1-to-1 correspondence between rows in assays and rowData fields, thus rowData at index i shows the gene Entrez id for gene at row i in assays field. We call this object cheng. annotation_db is org.Hs.eg.db, and is required to be installed if is not, see link. Let’s look at the input example.

library(SummarizedExperiment)
data(cheng)
cheng
## class: SummarizedExperiment 
## dim: 1500 10 
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
print("gene expression...")
## [1] "gene expression..."
print("rownames and colnames correspond to gene Entrez ids and sample names")
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
head(assay(cheng))
##           SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503        4.608516   4.822920  11.477016  11.164324  10.631023   5.372338
## 100462977   6.566498  11.170815  10.721761   7.120300  10.991193  12.889641
## 4878        6.925341   9.098948   9.550381   7.742856   7.539139   6.818301
## 4879        6.020677  10.545173   6.398164   8.415020   7.936110   5.492805
## 6192        9.256628   9.511192   4.276300   3.705296   3.392204   9.581542
## 9086        9.098932   9.110217   3.858846   3.077543   2.923131   9.426101
##           SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503        4.618841   4.628351   4.398288   4.686936
## 100462977  12.024684  12.295130   7.481746   6.661691
## 4878        6.181450   6.083118   9.985733   7.622827
## 4879        5.255798   6.266800   8.812797   8.678046
## 6192        9.211812   9.109689   9.309379   9.158489
## 9086        9.567030   9.557452   9.316588   9.059718
print(" \n gene ids...")
## [1] " \n gene ids..."
print("rownames are the gene Entrez ids")
## [1] "rownames are the gene Entrez ids"
head(rowData(cheng))
## DataFrame with 6 rows and 2 columns
##                SYMBOL    ENTREZID
##           <character> <character>
## 7503             XIST        7503
## 100462977    MTRNR2L1   100462977
## 4878             NPPA        4878
## 4879             NPPB        4879
## 6192           RPS4Y1        6192
## 9086           EIF1AY        9086

This data has the dimension of 1500 genes and 10 samples.

Now, we are ready to create the three main inputs. Using assay and rowData functions we can have access to expression matrix and gene Entrez ids respectively.

message("expData")
## expData
expData <- assay(cheng)
head(expData)
##           SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503        4.608516   4.822920  11.477016  11.164324  10.631023   5.372338
## 100462977   6.566498  11.170815  10.721761   7.120300  10.991193  12.889641
## 4878        6.925341   9.098948   9.550381   7.742856   7.539139   6.818301
## 4879        6.020677  10.545173   6.398164   8.415020   7.936110   5.492805
## 6192        9.256628   9.511192   4.276300   3.705296   3.392204   9.581542
## 9086        9.098932   9.110217   3.858846   3.077543   2.923131   9.426101
##           SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503        4.618841   4.628351   4.398288   4.686936
## 100462977  12.024684  12.295130   7.481746   6.661691
## 4878        6.181450   6.083118   9.985733   7.622827
## 4879        5.255798   6.266800   8.812797   8.678046
## 6192        9.211812   9.109689   9.309379   9.158489
## 9086        9.567030   9.557452   9.316588   9.059718
dim(expData)
## [1] 1500   10
message(" \n geneID")
##  
##  geneID
geneID <- rowData(cheng)
geneID <- geneID$ENTREZID
head(geneID)
## [1] "7503"      "100462977" "4878"      "4879"      "6192"      "9086"
length(geneID)
## [1] 1500
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
annotation_db <- "org.Hs.eg.db"

1.3 SGCP Pipeline Parameters and Workflow

SGCP is based on 5 main steps to produce the final modules. Each step offers parameters that can be adjusted by the user as follow. Each step is implemented in a single function.

  1. Network Construction step (adjacencyMatrix function)
    • calibration: boolean, default FALSE, if TRUE SGCP performs calibration step for more information see link.
    • norm: boolean, default TRUE, if TRUE SGCP divides each genes (rows) by its norm2.
    • tom: boolean, default TRUE, if TRUE SGCP adds TOM to the network.
    • saveAdja: boolean, default FALSE, if TRUE, the adjacency matrix will be saved .
    • adjaNameFile: string indicates the name of file for saving adjacency matrix.
    • hm: string indicates the name of file for saving adjacency matrix heat map.
  2. Network Clustering step (clustering function)
    • kopt: an integer, default NULL, denotes the optimal number of clusters k by the user.
    • method: either “relativeGap”, “secondOrderGap”, “additiveGap”, or NULL, default NULL. Defines the method for number of cluster.
    • func.GO: a function for gene ontology validation, default is sum.
    • func.conduct: a function for conductance validation, default is min.
    • maxIter: an integer, identifies the maximum number of iteration in kmeans.
    • numStart: an integer, identifies the number of start in kmeans.
    • saveOrig: boolean, default TRUE, if TRUE, keeps the transformed matrix.
    • n_egvec: either “all” or an integer, default = 100, indicates the number of columns of transformed matrix to be kept.
    • sil: boolean, default FALSE, if TRUE, calculates silhouette index per gene. at the end of this step, initial clusters are produced.
  3. Gene Ontology Enrichment step (geneOntology function)
    • direction: test direction, default c(“over”, “under”), for over-represented, or under-represented GO terms
    • ontology: GO ontologies, default c(“BP”, “CC”, “MF”), BP: Biological Process, CC: Cellular Component, MF: Molecular Function.
    • hgCutoff: a numeric value in (0,1) as the p-value baseline, default 0.05, GO terms smaller than hgCutoff value are kept.
    • cond: boolean, default TRUE, if TRUE conditional hypergeometric test is performed.
  4. Gene Semi-labeling step (semiLabeling function)
    • cutoff: a numeric in (0, 1) default NULL, is a base line for GO term significancy to identify remarkable and unremarkable genes.
    • percent: a number in (0,1) default 0.1, indicate the percentile for finding top GO terms.
    • stp: a number in (0,1) default 0.01, indicates increasing value to be added to percent parameter.
  5. Semi-supervised Classification step (semiSupevised function)
    • model: either “knn” or “lr” for classification model, knn: k nearest neighbors, lr: logistic regression.
    • kn: an integer default NULL indicating the number of neighbors in knn, if kn is NULL, then kn = 20 : (20 + 2 * k) if 2 * k < 30 otherwise 20 : 30, at the end of this step, final modules are produced.

At the end, SGCP performs on more step of Gene Ontology Enrichment on the final modules.

Detailed of the steps are available in the manuscript in SGCP. In Network Construction, user can identify of any of steps to be added to the network. In Network Clustering, If kopt is not null, SGCP will find clusters based on kopt. Otherwise, if method is not null, SGCP will pick k based on the selected method. Otherwise, if geneID and annotation_db is null, SGCP will determine the optimal method and its corresponding number of cluster based on conductance validation. It picks a method that func.conduct (default min) on its cluster is minimum. Otherwise, SGCP will use gene ontology validation (by default) to find the optimal method and its corresponding number of clusters. To this end, it performs gene ontology enrichment on the cluster with minimum conductance index per method and pick the one that has the maximum func.GO (default sum) over -log10 of p-values. In Gene Semi-labeling step, if cutoff is not NULL, SGCP considers genes associated to GO terms more significant than\(~\)cutoff\(~\) as remarkable. Otherwise, SGCP collects all GO terms from all clusters and picks the percent (default 0.1) mot significant GO terms among them. If Genes associated to these significant terms come from more than a single cluster, SGCP takes these genes as remarkable. Otherwise, it adds\(~\)stp\(~\)to\(~\)percent\(~\)and repeat this process until remarkable genes come from at least two clusters. In Semi-supervised Classification remarkable clusters are the clusters that have at least one remarkable gene.

2 Automatic Run, ezSGCP Function

ezSGCP function implements the SGCP pipeline in one function. Parameters are the same as here except calib corresponds to calibration parameter in Network Construction method_k, f.GO, f.conduct, maxIteration, numberStart parameters correspond to method , func.GO, func.conduct, maxIter, numStart in Network Clustering respectively. dir, onto, hgCut, condTest correspond to direction, ontology, hgCutoff, and cond in Gene Ontology Enrichment respectively. semilabel also is a boolean parameter, that if is FALSE, SGCP will stop after initial clusters. For full parameter description, see the help document.

Below shows how to call this function. Because, this may take up to 15 minutes to be completed, I have already stored the result as sgcp and commented the function. For your practice, uncomment the function and run it. In this call, I set the sil parameter to TRUE to get the gene silhouette index.

# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db, 
#               eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm= NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
##                Length Class      Mode   
## semilabel       1     -none-     logical
## clusterLabels   3     data.frame list   
## clustering     13     -none-     list   
## initial.GO      2     -none-     list   
## semiLabeling    2     -none-     list   
## semiSupervised  3     -none-     list   
## final.GO        2     -none-     list

If you run the function, you will see the following during the execution, which tells you the current state of ezSGCP function.

##  starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
##  it may take time...
## network is created, done!...

##  starting network clustering step...
## calculating normalized Laplacian 
##  it may take time...
## calculating eigenvalues/vectors 
##  it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3

##  Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method secondOrderGap...
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method additiveGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index 
##  it may take time...
## network clustering is done...

## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..

## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
##  it may take time
## Loading required package: ggplot2
## Loading required package: lattice
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      4      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list     
## 24-nearest neighbor model
## Training set outcome distribution:

##   1   2 
## 498 667 

## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
##   1   2 
## 130 205 
## semi-supervised done!..

## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

## ezSGCP done!..

As it is seen, SGCP print out information of each step. In this example, there are three potential number of clusters, 2, 3, and 5, and based on gene ontology validation, 2 is selected as the optimal number of cluster.

We found out SGCP performs well with default parameters in most of cases. However, users have option to change the setting according to their need. For instance, there are cases that calib increases the final module enrichment. Similarly, adding TOM is not always helpful. By default SGCP assumes that user does not know the exact number of cluster, nor does it know the method for it. Therefore, it uses gene ontology validation to identify the initial clusters. We highly recommend users to perform SGCP on the three potential metod “relativeGap”, “secondOrderGap”, “additiveGap”.

SGCP allows user to visualize the result. SGCP_ezPLOT takes sgcp result from ezSGCP function along with expData and geneID. It returns two files; excel and pdf depending on the initial call. User also can keep the plotting object by setting keep = TRUE. Here, we set silhouette_in to TRUE to see the silhouette index plot.

message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)

Note that in order to store files in xlsx format, excel must be installed on your system.

For the detailed parameter explanation visit the help document.

3 Step-By-Step Run, SGCP In Detail

The ezSGCP function explained above is a wrapper that calls several other SGCP function in the following order. Now, we will go through each step in detail using example data.

  1. adjacencyMatrix function, performs Network Construction step
  2. clustering function, performs Network Clustering step
  3. geneOntology function, performs Gene Ontology Enrichment step (produces initial clusters)
  4. semiLabeling function, performs Semi-labeling step
  5. semiSupervised function, performs Semi-supervised step
  6. geneOntology function, performs Gene Ontology Enrichment step (produces final modules)

Lets remove sgcp for space efficiency.

rm(sgcp)

Here, we will skip the detail of function’s input and output as it is described in the help document. SGCP allows user to visualize the PCA of the input gene expression data using SGCP_plot_pca function.

message("Plotting PCA of expression data")
## Plotting PCA of expression data
pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot", ps = .5)
print(pl)

3.1 adjacencyMatrix function

Network Construction step is implemented using adjacencyMatrix function. By default, this function divides each vector of genes by its norm 2 and used Gaussian kernel metric for similarity function. It then adds the information of the second order of the genes’ neighborhood in the form of TOM to the network. Code below shows how to use this function.

resAdja <- adjacencyMatrix(expData = expData, hm = NULL)
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
##  it may take time...
## network is created, done!...
resAdja[0:10, 0:5]
##               1            2            3            4            5
## 1  0.000000e+00 1.561704e-23 1.684533e-18 5.096042e-21 2.304447e-34
## 2  1.561704e-23 0.000000e+00 3.073427e-08 9.883634e-11 1.560804e-14
## 3  1.684533e-18 3.073427e-08 0.000000e+00 2.990119e-04 3.639579e-13
## 4  5.096042e-21 9.883634e-11 2.990119e-04 0.000000e+00 2.664746e-14
## 5  2.304447e-34 1.560804e-14 3.639579e-13 2.664746e-14 0.000000e+00
## 6  1.880926e-37 1.859059e-16 1.732464e-15 6.205653e-17 6.756836e-01
## 7  7.700277e-16 3.234158e-06 1.891155e-02 3.050233e-04 7.223951e-11
## 8  4.549591e-34 2.956936e-14 5.344484e-13 1.850700e-14 6.616417e-01
## 9  2.052374e-34 1.424767e-14 2.918285e-13 2.113249e-14 7.306804e-01
## 10 5.767117e-34 3.294000e-14 6.190725e-13 3.546475e-14 7.471861e-01

resAdja is the adjacency matrix of the gene co-expression network. We can visualize the heatmap of the adjacency matrix using\(~\)SGCP_plot_heatMap function as follow.To see the heatmap, uncomment the following line of codes.

#message("Plotting adjacency heatmap")
#pl <- SGCP_plot_heatMap(m = resAdja, tit = "Adjacency Heatmap", 
#                    xname = "genes", yname = "genes")
#print(pl)

3.2 clustering function

Network Clustering step is implemented using clustering function. This function takes the adjacency matrix , geneID, and annotation_db, and returns a list of following, depending on the initial call. I set the sil parameter to TRUE to get the silhouette index per gene.

  • dropped.indices: a vector of indices of dropped genes. Some genes may be noise, and therefore dropped throughout this function.

  • geneID: a vector of geneID.

  • method: indicates the selected method for number of cluster.

  • k: number of clusters.

  • Y: transformed matrix with 2*k columns.

  • X: eigenvalues correspond to 2*k columns in Y.

  • cluster: an object of class “kmeans”.

  • clusterLabels: a vector containing the cluster label per gene, there is a 1-to-1 correspondence between geneID and clusterLabes.

  • conductance: a list containing mean and median, and individual cluster conductance index for clusters per method. Index in clusterConductance field denotes the cluster label and the value shows the conductance index.

  • cvGOdf: a dataframe used for gene ontology validation. For each method, it returns the gene ontology enrichment result on the cluster with minimum conductance index.

  • cv: an string indicates the validation method for number of cluster, “cvGO”: if gene ontology validation used, “cvConductance”: if conductance validation used, “userMethod”: if user defined the method, “userkopt”: if user defines the kopt.

  • clusterNumberPlot: plotting object for relativeGap"“,”secondOrderGap“, and”additiveGap".

  • silhouette: a dataframe that indicates the silhouette indices for genes.

  • original: a list with matrix transformation and corresponding eigenvalues and n_egvec, where n_egvec top columns of transformation is kept.

Code below shows how to use this function. Again, since it takes time, I have commented the code, and use the storing resClus as the output of this function. For your practice uncomment and run it.

# resClus <- clustering(adjaMat = resAdja, geneID = geneID, annotation_db = annotation_db, 
#                       eff.egs = FALSE , saveOrig = FALSE, sil = TRUE)

data(resClus)
summary(resClus)
##                   Length Class      Mode     
## dropped.indices      0   -none-     numeric  
## geneID            1500   -none-     character
## method               1   -none-     character
## k                    1   -none-     numeric  
## Y                 6000   -none-     numeric  
## X                    4   -none-     numeric  
## cluster              9   kmeans     list     
## clusterLabels     1500   -none-     character
## conductance          3   -none-     list     
## cvGOdf               9   data.frame list     
## cv                   1   -none-     character
## clusterNumberPlot    3   -none-     list     
## silhouette           3   data.frame list
# lets drop noisy genes from expData
geneID <- resClus$geneID
if(length(resClus$dropped.indices)>0 ){ expData <- expData[-resClus$dropped.indices, ]}

# removing the adjacency matrix
rm(resAdja)

If you run the code, you will see the following will be printout for you.

## calculating normalized Laplacian 
##  it may take time...
## calculating eigenvalues/vectors 
##  it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3

##  Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method secondOrderGap...
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method additiveGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index 
##  it may take time...
## network clustering is done...

We saw that three methods (“relativeGap”, “secondOrderGap”, “additiveGap”) resulted in three distinct potential number of clusters. SGCP picked “secondOrderGap” after gene ontology validation which is 2. cv field is “cvGO”, which indicates that k is found based on gene ontology validation. In original field, you can have the n_egvec first columns ( eigenvectors) and eigenvalues of the transformation matrix. This might be useful for further investigation.

Now we can see the plot of PCA on the expression and transformed data with and without the labels using SGCP_plot_pca. For practice, uncomment the following and see the resulting PCAs.

message("Plotting PCA of expression data with label")
## Plotting PCA of expression data with label
# pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
message("Plptting PCA of transformed data without label")
## Plptting PCA of transformed data without label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
message("Plotting PCA of transformed data with label")
## Plotting PCA of transformed data with label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = resClus$clusterLabels, tit = "PCA plot with label", ps = .5)
# print(pl)

In resClus the plotting objects of methods of number of clusters is available.

if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
    message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
    print(resClus$clusterNumberPlot$relativeGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...

if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
    message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
    print(resClus$clusterNumberPlot$secondOrderGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...

if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
    message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
    print(resClus$clusterNumberPlot$additiveGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...

In SGCP, we can illustrate the conductance index per cluster using SGCP_plot_conductance function. Conductance index is calculated if both kopt and method are NULL. In other words, when one of these parameter is known, SGCP does not need perform validation to find the cluster and therefore will skip the conductance computation.

# checking if conductance index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
    message("plotting cluster conductance index...")
    pl <- SGCP_plot_conductance(conduct = resClus$conductance, 
                        tit = "Cluster Conductance Index", 
                        xname = "clusterLabel", yname = "conductance index")
    print(pl)}
## plotting cluster conductance index...

In here we obsereved that cluster with lower conductance index tend to have higher enrichment. This analysis confirms this observation. In clustering function, the cluster with minimum conductance for method relativeGap (cluster rg1), secondOrderGap (cluster sg4), and additiveGap (cluster ag1) are compared using their gene ontology enrichment. And among them cluster rg1 has higher enrichment. Interestingly, this cluster also has lower conductance index in compare with ag1, and sg4.

SGCP also can plot the silhouette index per gene if sil parameter is TRUE in clustering function.

# checking if silhouette index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
    message("plotting gene silhouette index...")
    pl <- SGCP_plot_silhouette(resClus$silhouette, tit = "Gene Silhouette Index",
                            xname = "genes", yname = "silhouette index")
    print(pl)}
## plotting gene silhouette index...

Silhouette index ranges between -1 and 1. Closer to 1, the better the gene is explained by the cluster. As it is seen, majority of genes have Silhouette index very close to 1 which means that genes are well-described by the clusters based on Silhouette index perspective. Interestingly, in worse case scenario, genes have zero silhouette index, and there is no negative index for any gene.

3.3 geneOntology function

Gene Ontology Enrichment step is implemented using geneOntology function. This function returns the gene ontology enrichment on the clusters as GOresults. It also returns the FinalGOTermGenes list which identifies the gene IDs correspond to the GO terms found GOresults. Code below shows how to use this function. Again, since it takes time, I have commented the code, and use the storing resInitialGO as the output of this function. For your practice uncomment and run it.

# resInitialGO <- geneOntology(geneUniv = geneID, clusLab = resClus$clusterLabels, 
#                              annotation_db = annotation_db)

data(resInitialGO)

head(resInitialGO$GOresults)
##   clusterNum  GOtype       GOID       Pvalue OddsRatio  ExpCount Count Size
## 1          1 underMF GO:0005201 5.280629e-15 0.1084677  41.36170    10   72
## 2          1 underMF GO:0005198 2.081705e-11 0.2448824  62.04255    29  108
## 3          1 underCC GO:0031012 1.276834e-10 0.3605085 107.53125    67  186
## 4          1 underCC GO:0030312 2.159404e-10 0.3663565 108.10938    68  187
## 5          1 underCC GO:0005576 1.076738e-09 0.5227524 350.34375   294  606
## 6          1 underCC GO:0071944 1.653199e-09 0.5292955 446.31250   390  772
##                                          Term
## 1 extracellular matrix structural constituent
## 2                structural molecule activity
## 3                        extracellular matrix
## 4            external encapsulating structure
## 5                        extracellular region
## 6                              cell periphery

If you run the code, you will see the following will be printout for you.

## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

Above shows the enrichment dataframe returned from GOstats (link)[https://bioconductor.org/packages/release/bioc/html/GOstats.html] package. clusterNum indicates the cluster label.

SGCP has four functions that illustrate the gene ontology enrichment result; SGCP_plot_jitter, SGCP_plot_density, SGCP_plot_bar, and SGCP_plot_pie. Note that p-values are log-transformed, and therefore, The higher the point, the more significant the GO term is.

message("plotting initial GO p-values jitters...")
## plotting initial GO p-values jitters...
pl <- SGCP_plot_jitter(df = resInitialGO$GOresults, 
                    tit = "Initial GO p-values",
                    xname = "cluster", yname = "-log10 p-value", ps = 3)
print(pl)

message("plotting initial GO p-values density")
## plotting initial GO p-values density
pl <- SGCP_plot_density(df = resInitialGO$GOresults, 
                    tit = "Initial GO p-values Density",
                    xname = "cluster", yname = "-log10 p-value")

print(pl)
## Picking joint bandwidth of 0.199

message("plotting initial GO p-values mean")
## plotting initial GO p-values mean
pl <- SGCP_plot_bar(df = resInitialGO$GOresults, tit = "Cluster Performance",
                    xname = "cluster", yname = "mean -log10 p-value")
print(pl)

message("plotting initial GO p-values pie chart...")
## plotting initial GO p-values pie chart...
pl <- SGCP_plot_pie(df = resInitialGO$GOresults, tit = "Initial GO Analysis",
                xname = "cluster", yname = "count", posx = 1.8)
print(pl)

Label beside each segment indicates the log-transformed p-value of the most significant term found in that segment.

3.4 semiLabeling function

Gene Semi-labeling step is implemented using semiLabeling function. This function takes the result from geneOntology function and returns a dataframe that identify remarkable and unremarkable genes along with the cutoff value. Code below shows how to use this function.

resSemiLabel <- semiLabeling(geneID = geneID, 
                            df_GO = resInitialGO$GOresults, 
                            GOgenes = resInitialGO$FinalGOTermGenes)
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
print(head(resSemiLabel$geneLabel))
##      geneID label
## 1      7503     1
## 2 100462977  <NA>
## 3      4878     2
## 4      4879     2
## 5      6192     1
## 6      9086  <NA>
table(resSemiLabel$geneLabel$label)
## 
##   1   2 
## 667 498

Above table shows the genes and their corresponding cluster label if is remarkable otherwise NA. we say a cluster is remarkable if it has at least one remarkable gene. Therefore, both clusters are remarkable here. In here, we discussed that if number of clusters is larger than 2, then SGCP will wipe out unremarkable clusters through the Semi-labeling and Semi-supervised steps. Gene in these clusters will be placed in remarkable clusters.. In other words, number of modules may drop but the module enrichment will increase. These steps, in fact, change the original clusters produced by clustering function and produce a different set of clusters. Therefore, SGCP returns to set of clusters, (i) immediately as the clustering function output, and (ii) after semiSupervised function. To distinguish these two sets, we call (i) and (ii) as clusters and modules respectively. In the manuscript, we also distinguish them using pSGCP and SGCP.

3.5 semiSupervised function

Gene Semi-supervised step is implemented using semiSupervised function. Final module labeling is available in FinalLabeling which is a dataframe of geneID with its corresponding semi-label and final label. Code below shows how to use this function.

resSemiSupervised <- semiSupervised(specExp = resClus$Y, 
                                geneLab = resSemiLabel$geneLabel,
                                model = "knn", kn = NULL)
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
##  it may take time
## Loading required package: ggplot2
## Loading required package: lattice
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      4      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list     
## 23-nearest neighbor model
## Training set outcome distribution:
## 
##   1   2 
## 667 498
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
##   1   2 
## 205 130
## semi-supervised done!..
print(head(resSemiSupervised$FinalLabeling))
##      geneID semiLabel FinalLabel
## 1      7503         1          1
## 2 100462977      <NA>          2
## 3      4878         2          2
## 4      4879         2          2
## 5      6192         1          1
## 6      9086      <NA>          1
print(table(resSemiSupervised$FinalLabeling$FinalLabel))
## 
##   1   2 
## 872 628

In this step, k-nearest neighbor model is selected. It hyper-parameter is selected based on cross validation on accuracy metric. Table above shows the gene semi label and final labeling.

Now you can see the PCA plot with the final labeling using SGCP_plot_pca function. For practice uncomment the following to see the resulting PCAs.

# message("Plotting PCA of transformed data with label")
# pl <- SGCP_plot_pca(m = expData, 
#                clusLabs = resSemiSupervised$FinalLabeling$FinalLabel, 
#                tit = "PCA plot with label", ps = .5)
print(pl)

# message("Plotting PCA of transformed data with label")
# pl <- SGCP_plot_pca(m = resClus$Y, 
#                clusLabs = resSemiSupervised$FinalLabeling$FinalLabel, 
#                tit = "PCA plot with label", ps = .5)
#print(pl)

3.6 geneOntology function

Finally, Gene Ontology Enrichment step is performed one more time on the final modules for module enrichment.A gain, since it takes time, I have commented the code, and use the storing resFinalGO as the output of this function. For your practice uncomment and run it.

# resFinalGO <- geneOntology(geneUniv = geneID, clusLab = resSemiSupervised$FinalLabeling$FinalLabel, 
#                              annotation_db = annotation_db)

data(resFinalGO)
head(resFinalGO$GOresults)
##   clusterNum  GOtype       GOID       Pvalue OddsRatio  ExpCount Count Size
## 1          1 underMF GO:0005201 4.792355e-15 0.1081310  41.41277    10   72
## 2          1 underMF GO:0005198 1.872816e-11 0.2440998  62.11915    29  108
## 3          1 underCC GO:0031012 1.118311e-10 0.3593320 107.65761    67  186
## 4          1 underCC GO:0030312 1.893638e-10 0.3651603 108.23641    68  187
## 5          1 underCC GO:0005576 8.138271e-10 0.5201268 350.75543   294  606
## 6          1 underCC GO:0071944 1.165513e-09 0.5259400 446.83696   390  772
##                                          Term
## 1 extracellular matrix structural constituent
## 2                structural molecule activity
## 3                        extracellular matrix
## 4            external encapsulating structure
## 5                        extracellular region
## 6                              cell periphery

If you run the code, you will see the following will be printout for you.

## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

Now let see the corresponding plots for final module enrichment.

message("plotting final GO p-values jitters...")
## plotting final GO p-values jitters...
pl <- SGCP_plot_jitter(df = resFinalGO$GOresults, tit = "Final GO p-values",
                    xname = "module", yname = "-log10 p-value", ps = 3)
print(pl)

message("plotting final GO p-values density...")
## plotting final GO p-values density...
pl <- SGCP_plot_density(df = resFinalGO$GOresults, 
                    tit = "Final GO p-values Density",
                    xname = "module", yname = "-log10 p-value")
print(pl)
## Picking joint bandwidth of 0.197

rm(pl)
message("plotting final GO p-values mean...")
## plotting final GO p-values mean...
pl <- SGCP_plot_bar(df = resFinalGO$GOresults, tit = "Module Performance",
                xname = "module", yname = "mean -log10 p-value")
print(pl)

rm(pl)
message("plotting final GO p-values pie xhart...")
## plotting final GO p-values pie xhart...
pl <- SGCP_plot_pie(df = resFinalGO$GOresults, tit = "Final GO Analysis",
                xname = "module", yname = "count", posx = 1.9)
print(pl)

rm(pl)

Comparing this result with initial clusters, it is observed that gene ontology enrichment has been increased slightly. We found out, if all initial clusters are remarkable, like here, Semi-labeling and Semi-supervised steps does not change the genes cluster labels fundamentally, and initial clusters converges to final clusters. Otherwise, SGCP will wiped out the non-significant clusters and increases the enrichment of remarkable clusters. And, final modules are different from initial clusters. You can see the detail in manuscript SGCP manuscript.

rm(resClus, resInitialGO, resSemiLabel, resSemiSupervised, resFinalGO, pl)
sI <- sessionInfo()
sI
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] caret_6.0-94                lattice_0.21-8             
##  [3] ggplot2_3.4.2               org.Hs.eg.db_3.17.0        
##  [5] AnnotationDbi_1.62.0        SummarizedExperiment_1.30.0
##  [7] Biobase_2.60.0              GenomicRanges_1.52.0       
##  [9] GenomeInfoDb_1.36.0         IRanges_2.34.0             
## [11] S4Vectors_0.38.0            BiocGenerics_0.46.0        
## [13] MatrixGenerics_1.12.0       matrixStats_0.63.0         
## [15] SGCP_1.0.0                  knitr_1.42                 
## [17] BiocStyle_2.28.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.14         jsonlite_1.8.4         
##   [4] magrittr_2.0.3          magick_2.7.4            farver_2.1.1           
##   [7] rmarkdown_2.21          zlibbioc_1.46.0         vctrs_0.6.2            
##  [10] memoise_2.0.1           RCurl_1.98-1.12         htmltools_0.5.5        
##  [13] cellranger_1.1.0        pROC_1.18.0             sass_0.4.5             
##  [16] parallelly_1.35.0       bslib_0.4.2             plyr_1.8.8             
##  [19] rootSolve_1.8.2.3       lubridate_1.9.2         cachem_1.0.7           
##  [22] lifecycle_1.0.3         iterators_1.0.14        pkgconfig_2.0.3        
##  [25] Matrix_1.5-4            R6_2.5.1                fastmap_1.1.1          
##  [28] GenomeInfoDbData_1.2.10 future_1.32.0           digest_0.6.31          
##  [31] Exact_3.2               colorspace_2.1-0        RSpectra_0.16-1        
##  [34] RSQLite_2.3.1           labeling_0.4.2          timechange_0.2.0       
##  [37] fansi_1.0.4             httr_1.4.5              compiler_4.3.0         
##  [40] proxy_0.4-27            withr_2.5.0             bit64_4.0.5            
##  [43] DBI_1.1.3               highr_0.10              MASS_7.3-59            
##  [46] lava_1.7.2.1            DelayedArray_0.26.0     gld_2.6.6              
##  [49] ModelMetrics_1.2.2.2    Category_2.66.0         tools_4.3.0            
##  [52] zip_2.3.0               future.apply_1.10.0     nnet_7.3-18            
##  [55] glue_1.6.2              nlme_3.1-162            grid_4.3.0             
##  [58] reshape2_1.4.4          generics_0.1.3          recipes_1.0.6          
##  [61] gtable_0.3.3            class_7.3-21            data.table_1.14.8      
##  [64] lmom_2.9                utf8_1.2.3              XVector_0.40.0         
##  [67] foreach_1.5.2           pillar_1.9.0            stringr_1.5.0          
##  [70] genefilter_1.82.0       splines_4.3.0           dplyr_1.1.2            
##  [73] survival_3.5-5          bit_4.0.5               annotate_1.78.0        
##  [76] tidyselect_1.2.0        RBGL_1.76.0             GO.db_3.17.0           
##  [79] Biostrings_2.68.0       bookdown_0.33           xfun_0.39              
##  [82] expm_0.999-7            hardhat_1.3.0           timeDate_4022.108      
##  [85] stringi_1.7.12          yaml_2.3.7              boot_1.3-28.1          
##  [88] evaluate_0.20           codetools_0.2-19        tibble_3.2.1           
##  [91] Rgraphviz_2.44.0        BiocManager_1.30.20     graph_1.78.0           
##  [94] cli_3.6.1               rpart_4.1.19            xtable_1.8-4           
##  [97] DescTools_0.99.48       munsell_0.5.0           jquerylib_0.1.4        
## [100] Rcpp_1.0.10             readxl_1.4.2            globals_0.16.2         
## [103] png_0.1-8               XML_3.99-0.14           parallel_4.3.0         
## [106] gower_1.0.1             blob_1.2.4              GOstats_2.66.0         
## [109] AnnotationForge_1.42.0  bitops_1.0-7            listenv_0.9.0          
## [112] mvtnorm_1.1-3           GSEABase_1.62.0         ipred_0.9-14           
## [115] scales_1.2.1            prodlim_2023.03.31      e1071_1.7-13           
## [118] ggridges_0.5.4          purrr_1.0.1             openxlsx_4.2.5.2       
## [121] crayon_1.5.2            rlang_1.1.0             KEGGREST_1.40.0