Whole transcriptomic profiles are useful for studying the expression levels of thousands of genes across samples. Clustering algorithms are used to identify patterns in these profiles to determine clinically relevant subgroups. Feature selection is a critical integral part of the process. Currently, there are many feature selection and clustering methods to identify the relevant genes and perform clustering of samples. However, choosing the appropriate methods is difficult as recent work demonstrates that no method is the clear winner. Hence, we present an R-package called multiClust
that allows researchers to experiment with the choice of combination of methods for gene selection and clustering with ease. In addition, using multiClust, we present the merit of gene selection and clustering methods in the context of clinical relevance of clustering, specifically clinical outcome.
Our integrative R-package contains:
A function to read in gene expression data and format appropriately for analysis in R.
Four different ways to select the number of genes a. Fixed b. Percent c. Poly d. GMM
Four gene ranking options that order genes based on different statistical criteria a. CV_Rank b. CV_Guided c. SD_Rank d. Poly
Two ways to determine the cluster number a. Fixed b. Gap Statistic
Two clustering algorithms a. Hierarchical clustering b. K-means clustering
A function to correlate sample clusters with clinical outcome
Function Workflow
The seven functions listed below should be used in the following order:
input_file
, a function to read-in a text file containing the matrix of gene probes and samples to analyze.
number_probes
, a function to determine the number of gene probes to select for in the feature selection process.
probe_ranking
, a function to rank and select for probes using one of the available ranking options.
number_clusters
, a function to determine the number of clusters to be used to cluster gene probes and samples.
cluster_analysis
, a function to perform Kmeans or Hierarchical clustering analysis of the selected gene probe expression data.
avg_probe_exp
, a function to produce a matrix containing the average expression of each gene probe within each sample cluster.
surv_analysis
, a function to produce Kaplan-Meier Survival Plots after the discretization of samples into different clusters.
Other Functions Included
WriteMatrixToFile
, a function to write a data matrix to a text file.
nor.min.max
, a function to that uses feature scaling to normalize values in a matrix between 0 and 1.
In order to use multiClust
, the user will need two text files.
The first file is a gene probe expression dataset. This file should be a matrix with columns being the samples and the rows being genes or probes.
The second file contains the patient clinical parameters. This file should be a matrix consisting of two columns. The first column contains the patient survival time and the second column contains the patient survival event occurrence. Survival time should be recorded in months. Survival event occurrence should be indicated by a 0 or 1. A 0 indicates censorship and a 1 indicates an event has occurred.
In this example, a gene expression dataset, GSE2034, will be obtained from the Gene Expression Omnibus (GEO) website http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse2034.
The following code can be used to extract the gene expression data and clinical information from the series matrix text zip file. When using this code, be sure that you have a strong internet connection. The GSE datasets are very large!
# Load GEO and biobase libraries
library(GEOquery)
library(Biobase)
library(multiClust)
library(preprocessCore)
library(ctc)
library(gplots)
library(dendextend)
library(graphics)
library(grDevices)
library(amap)
library(survival)
# Obtain GSE series matrix file from GEO website using getGEO function
gse <- getGEO(GEO="GSE2034")
# Save the gene expression matrix as an object
data.gse <- exprs(gse[[1]])
# Save the patient clinical data to an object
pheno <- pData(phenoData(gse[[1]]))
# Write the gene expression and clinical data to text files
WriteMatrixToFile(tmpMatrix=data.gse, tmpFileName="GSE2034.expression.txt",
blnRowNames=TRUE, blnColNames=TRUE)
WriteMatrixToFile(tmpMatrix=pheno, tmpFileName="GSE2034.clinical.txt",
blnRowNames=TRUE, blnColNames=TRUE)
In the event that an error is produced when trying to obtain GSE datasets from the NCBI GEO FTP site, please use this alternative method:
After downloading the GSE series matrix file and moving it to a directory the following code can be used.
# Obtain GSE series matrix file from GEO website using getGEO function
gse <- getGEO(filename="GSE2034_series_matrix.txt.gz")
# Save the gene expression matrix as an object
data.gse <- exprs(gse[[1]])
# Save the patient clinical data to an object
pheno <- pData(phenoData(gse[[1]]))
# Write the gene expression and clinical data to text files
WriteMatrixToFile(tmpMatrix=data.gse, tmpFileName="GSE2034.expression.txt",
blnRowNames=TRUE, blnColNames=TRUE)
WriteMatrixToFile(tmpMatrix=pheno, tmpFileName="GSE2034.clinical.txt",
blnRowNames=TRUE, blnColNames=TRUE)
Some datasets on GEO are already normalized using Robust Multichip Average (RMA) procedure available in the ‘affy’ R pacakge or quantile normalization and log2 scaling using the ‘preprocessCore’ R package.
For datasets that are not normalized, the following code can be used to extract the gene expression matrix from the series matrix text file, quantile normalize the data, log2 scale the data, and lastly write the data to a text file.
# Obtain GSE series matrix file from GEO website using getGEo function
gse <- getGEO(GEO="GSE2034")
# Save the gene expression matrix as an object
data.gse <- exprs(gse)
# Quantile normalization of the dataset
data.norm <- normalize.quantiles(data.gse, copy=FALSE)
# shift data before log scaling to prevent errors from log scaling
# negative numbers
if (min(data.norm)> 0) {
}
else {
mindata.norm=abs(min(data.norm)) + .001
data.norm=data.norm + mindata.norm
}
# Log2 scaling of the dataset
data.log <- t(apply(data.norm, 1, log2))
# Write the gene expression and clinical data to text files
WriteMatrixToFile(tmpMatrix=data.log,
tmpFileName="GSE2034.normalized.expression.txt",
blnRowNames=TRUE, blnColNames=TRUE)
As mentioned in the beginning of this section, the patient survival information file should be a matrix consisting of two columns. However, when the patient clinical information is written to a text file, it is generally a matrix of more than two columns with all of the clinical parameters.
It is recommended that this text file be loaded in R or Microsoft Excel to obtain the two columns of survival information needed. Below is an example of what the patient survival information file should look like:
# Obtain clinical outcome file
clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt",
package="multiClust")
# Load in survival information file
clinical <- read.delim2(file=clin_file, header=TRUE)
# Display first few rows of the file
clinical[1:5, 1:2]
## RFS_time RFS
## 1 77.88 0
## 2 49.32 1
## 3 130.2 0
## 4 82.8 0
## 5 144.96 0
# Column one represents the survival time in months
# Column two represents the relapse free survival (RFS) event
The first function of multiClust
is used to load a gene expression dataset into R and format the matrix so that the probe or gene names are the rownames of the matrix.
In this example, the GSE2034 dataset is loaded into the R console using the input_file
function.
Function Arguments
Example
# Obtain gene expression matrix
exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package= "multiClust")
# Load the gene expression matrix
data.exprs <- input_file(input=exp_file)
## [1] "The gene expression matrix has been loaded"
# View the first few rows and columns of the matrix
data.exprs[1:4,1:4]
## GSM36777 GSM36778 GSM36779 GSM36780
## 1007_s_at 10.65626945 10.82617907 11.12609748 10.47785521
## 1053_at 6.293524052 6.105409137 6.239483636 6.628974931
## 117_at 6.785154455 6.446397235 6.779564325 6.673216272
## 121_at 8.147451246 8.757777784 8.530315085 8.499970427
In this example, the gene expression matrix is stored into the object “data.exprs”.
The second function of this package, number_probes
is a function used to specify the number of desired probes or genes the user would like to select for in their gene selection analysis.
Function Arguments
This function has multiple arguments. The first argument “input” represents the string name of the expression matrix text file to be loaded into R.
The second argument “data.exp” is the object containing the expression matrix
The last four arguments “Fixed”, “Percent”, “Poly”, and “Adaptive” are the four different methods that determine how the number of genes are selected. The default option in the function is set to “Fixed”.
# Example 1: Choosing a fixed gene or probe number
# Obtain gene expression matrix
exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt",
package="multiClust")
gene_num <- number_probes(input=exp_file,
data.exp=data.exprs, Fixed=300,
Percent=NULL, Poly=NULL, Adaptive=NULL)
## [1] "The fixed gene probe number is: 300"
As a result, 300 genes from the dataset will be selected during the gene or probe selection process.
# Example 2: Choosing 50% of the total selected gene probes in a dataset
# Obtain gene expression matrix
exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt",
package="multiClust")
gene_num <- number_probes(input=exp_file,
data.exp=data.exprs, Fixed=NULL,
Percent=50, Poly=NULL, Adaptive=NULL)
## [1] "The percent gene probe number is: 150"
As a result, 50% of the total dataset probes, or 300 probes, will be selected.
# Example 3: Choosing the polynomial selected gene probes in a dataset
# Obtain gene expression matrix
exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt",
package="multiClust")
gene_num <- number_probes(input=exp_file,
data.exp=data.exprs, Fixed=NULL,
Percent=NULL, Poly=TRUE, Adaptive=NULL)
## [1] "The poly gene probe number is: 19"
# Example 4: Choosing the Adaptive Gaussian Mixture Modeling method
gene_num <- number_probes(input=exp_file,
data.exp=data.exprs, Fixed=NULL,
Percent=NULL, Poly=NULL, Adaptive=TRUE)
The “Mclust” function from the package mclust https://cran.r-project.org/web/packages/mclust/mclust.pdf is used to apply Gaussian mixture modeling to the inputted gene expression matrix. In addition, files containing information about the dataset’s mean, variance, mixing proportion, and gaussian assignment are also outputted.
It should be noted that when choosing one of the three methods of gene or probe number selection, the other methods should be set equal to “NULL”. Furthermore, the Adaptive option has a very long computational time.
Function Output
This function returns an object containing the number of genes/probes to select for during the feature selection process.
The third function of this package, probe_ranking
is used to select the most informative gene probes within a dataset. This function allows the user choose from one of four different probe ranking algorithms.
Function Arguments
This function has multiple arguments. The first argument “input” represents the string name of the expression matrix text file to be loaded into R.
The second argument “probe_number” is an output of the number probes
function. For this argument, the user specifies the number of genes to select for.
The third argument “probe_num selection” is a string specifying which type of probe number selection method was used for the number_probes
function.
The fourth argument “data.exp” is the object containing the expression matrix.
The last argument “method” is a string that specifies which of the four probe ranking methods to use.
List of the Five Probe Ranking Methods
CV_Rank is a gene probe ranking method that selects for probes using the coefficient of variation of the entire dataset (CV).
CV_Guided is a gene probe ranking method that uses the coefficient of variation (CV) for each probe to select for probes.
SD_Rank is a gene probe ranking method that selects for probes with the highest standard deviation within the dataset.
Poly is a ranking method that fits three second degree polynomial functions of mean and standard deviation to the dataset to select the most variable probes in the dataset. This is the only method that does not use a probe number input from the number_probes
function. When using this ranking method “probe_num” can be set to NULL.
Example
Below is an example of how to use the probe_ranking
function:
# Obtain gene expression matrix
exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt",
package="multiClust")
# Load the gene expression matrix
data.exprs <- input_file(input=exp_file)
## [1] "The gene expression matrix has been loaded"
# Call probe_ranking function
# Select for 500 probes
# Choose genes using the SD_Rank method
ranked.exprs <- probe_ranking(input=exp_file,
probe_number=300,
probe_num_selection="Fixed_Probe_Num",
data.exp=data.exprs,
method="SD_Rank")
## [1] "The selected SD_Rank Gene Expression text file has been written"
# Display the first few columns and rows of the matrix
ranked.exprs[1:4,1:4]
## GSM36777 GSM36778 GSM36779 GSM36780
## 1405_i_at 2.663816 2.992651 1.833091 3.793433
## 200656_s_at 5.616488 2.372672 7.134110 5.639419
## 200641_s_at 4.165681 2.922955 4.397136 4.934420
## 200670_at 7.606198 6.223078 8.604945 7.279471
Function Outputs
The output of this function is a text file containing a selected gene or probe expression matrix. In addition, the selected expression matrix is stored into the chosen variable “ranked.exprs”.
The fourth function is this package, number_clusters
, is a function used to designate the number of clusters that samples will be assigned to.
Function Arguments
The “data.exp” argument represents an object containing the original gene expression matrix to be used for clustering of genes and samples. This object is an output of the input_file
function.
The “Fixed” argument is a positive integer used to represent the number of clusters the samples and probes will be divided into.
The “gap_statistic” argument is a logical indicating whether to calculate the optimal number of clusters to divide the samples into using a gap statistic function clus_Gap
from the package cluster
.
Note
The user should only choose either the “Fixed” or “gap_statistic” option, not both. When using the gap_statistic option, change the argument to TRUE and “Fixed” to NULL.
Fixed Cluster Number Example
Below is an example of how to use the number_clusters
function with a fixed number of clusters:
# Call the number_clusters function
# data.exp is the original expression matrix object outputted from
# the input_file function
# User specifies that samples will be separated into 3 clusters
# with the "Fixed" argument
cluster_num <- number_clusters(data.exp=data.exprs, Fixed=3,
gap_statistic=NULL)
## [1] "The fixed cluster number is: 3"
The output from this example will be the object “cluster_num” with a value of 3.
Gap statistic Example
The number_clusters
function can be called in a similar way to use the gap_statistic option:
# Call the number_clusters function
# data.exp is the original expression matrix object ouputted from
# the input_file function
# User chooses the gap_statistic option by making gap_statistic equal TRUE
# The Fixed argument is also set to NULL
cluster_num <- number_clusters(data.exp=data.exprs, Fixed=NULL,
gap_statistic=TRUE)
The gap_statistic option has a very long computational time and can take up to several hours. The output of this function will be a positive integer indicating the optimal number of clusters to divide samples into.
The fifth function in this package, cluster_analysis
is a function used to perform Kmeans or Hierarchical clustering analysis of genes/probes and samples after undergoing a feature selection process in the probe_ranking
function.
Function Arguments:
The first argument “sel.exp”, is an object containing the numeric selected gene expression matrix. This object is an output of the probe_ranking
function.
The second argument “cluster_type” is a string indicating the type of clustering method to use. Kmeans or HClust are the two options.
The third argument “distance” is a string describing the distance metric to use for Hierarchical clustering via the dist
function. dist
uses a default distance metric of Euclidean distance. Options include one of “euclidean”, “maximum”, manhattan“,”canberra“,”binary“, or”minkowski". Kmeans does not use a distance metric.
The fourth argument, “linkage_type” is a string describing the linkage metric to be used for hierarchical clustering in the hclust
function. The default is “ward.D2”, however other options include “average”, “complete”, “median”, “centroid”, “single”, and “mcquitty”. Kmeans does not use a linkage metric.
The fifth argument, “gene_distance” is a string describing the distance measure to be used for the Dist
function when performing hierarchical clustering of genes. Options include one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “pearson”, “abspearson”, “correlation”, “abscorrelation”, “spearman” or “kendall”. The default of gene_distance is set to “correlation”. The argument can be set to NULL when Kmeans clustering is used.
The sixth argument “num_clusters” is a positive integer to specify the number of clusters samples will be divided into. This number is determined by the number_clusters
function.
The seventh argument “data_name” is a string indicating the cancer type and/or name of the dataset being analyzed. This name will be used to label the sample dendrograms and heatmap files.
The eighth argument “probe_rank” is a string indicating the feature selection method used in the probe_ranking
function. Options include “CV_Rank”, “CV_Guided”, “SD_Rank”, and “Poly”.
The ninth argument “probe_num_selection” is a string indicating the way in which probes were selected in the number_probes
function. Options include “Fixed_Probe_Num”, “Percent_Probe_Num”, “Poly_Probe_Num”, and “Adaptive_Probe_Num”.
The tenth argument, “cluster_num_selection” is a string indicating how the number of clusters were determined in the number_clusters function. Options include “Fixed_Clust_Num” and “Gap_Statistic”.
Function Outputs
A CSV file containing the sample names and their respective cluster. An object containing a vector of the sample names and their cluster number is returned.
When hierarchical clustering is chosen as the cluster method, a pdf file of the sample dendrogram as well as atr, gtr, and cdt files for viewing in Java TreeView are outputted. In the Java TreeView heatmap, the samples are clustered by the method indicated by the “linkage_type” argument while genes are clustered by the method indicated by the “gene_distance” argument. The default of sample clustering for the Java TreeView heatmap is “ward.D2” and the default for gene clustering is centered pearson “correlation”.
Hierarchical Clustering Example
Below is an example of how to use the cluster_analysis
function using the Hierarchical clustering option:
# Call the cluster_analysis function
hclust_analysis <- cluster_analysis(sel.exp=ranked.exprs,
cluster_type="HClust",
distance="euclidean", linkage_type="ward.D2",
gene_distance="correlation",
num_clusters=3, data_name="GSE2034 Breast",
probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num",
cluster_num_selection="Fixed_Clust_Num")
## [1] "Your HClust Sample Dendrogram has been outputted"
## [1] "Your atr, gtr, and cdt files have been outputted for viewing in Java TreeView"
## [1] "A CSV file has been produced containing your sample and cluster assignment information"
# Display the first few columns and rows of the object
head(hclust_analysis)
## GSM36777 GSM36778 GSM36779 GSM36780 GSM36781 GSM36782
## 1 3 1 1 1 2
Function Outputs
The outputs from this example would be an object “hclust_analysis” containing a vector of the sample names and cluster number. In addition, a sample dendrogram pdf file would be written. Lastly, the atr, gtr, and cdt files are outputted to view a heatmap of the genes and samples in Java TreeView.
Note
The cluster_analysis
function does not display the sample dendrogram and heatmap in the console. To show what these images would look like, the following examples are provided below.
Java TreeView Heatmap Example
Below is an example of a heatmap produced in Java TreeView from the atr, gtr, abnd cdt files produced by this function. In this heatmap, rows represent genes and columns represent samples.