structToolbox 1.0.1
The ‘structToolbox’ includes an extensive set of data (pre-)processing and analysis tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning. The methods and tools have been implemented using class-based templates available via the struct
(Statistics in R Using Class-based Templates) package. The aim of this vignette is to introduce the reader to basic and more advanced structToolbox-based operations and implementations, such as the use of struct
objects, getting/setting methods/parameters, and building workflows for the analysis of mass spectrometry (MS) and nuclear magnetic resonance (NMR)-based Metabolomics and proteomics datasets. The workflows demonstrated here include a wide range of methods and tools including pre-processing such as filtering, normalisation and scaling, followed by univariate and/or multivariate statistics, and machine learning approaches.
The latest version of structToolbox
compatible with your current R version can be installed using BiocManager
.
# install BiocManager if not present
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# install structToolbox and dependencies
BiocManager::install("structToolbox")
A number of additional packages are needed for this vignette.
## install additional bioc packages for vignette if needed
#BiocManager::install(c('pmp', 'ropls', 'BiocFileCache'))
## install additional CRAN packages if needed
#install.packages(c('cowplot', 'openxlsx'))
suppressPackageStartupMessages({
# Bioconductor packages
library(structToolbox)
library(pmp)
library(ropls)
library(BiocFileCache)
# CRAN libraries
library(ggplot2)
library(gridExtra)
library(cowplot)
library(openxlsx)
})
# use the BiocFileCache
bfc <- BiocFileCache(ask = FALSE)
struct
objects, including models, model sequences, model charts and ontology.PCA (Principal Component Analysis) and PLS (Partial Least Squares) are commonly applied methods for exploring and analysing multivariate datasets. Here we use these two statistical methods to demonstrate the different types of struct
(STatistics in R Using Class Templates) objects that are available as part of the structToolbox
and how these objects (i.e. class templates) can be used to conduct unsupervised and supervised multivariate statistical analysis.
For demonstration purposes we will use the “Iris” dataset. This famous (Fisher’s or Anderson’s) dataset contains measurements of sepal length and width and petal length and width, in centimeters, for 50 flowers from each of 3 class of Iris. The class are Iris setosa, versicolor, and virginica. See here (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html) for more information.
Note: this vignette is also compatible with the Direct infusion mass spectrometry metabolomics “benchmark” dataset described in Kirwan et al., Sci Data 1, 140012 (2014) (https://doi.org/10.1038/sdata.2014.12).
Both datasets are available as part of the structToolbox package and already prepared as a DatasetExperiment
object.
## Iris dataset (comment if using MTBLS79 benchmark data)
D = iris_DatasetExperiment()
D$sample_meta$class = D$sample_meta$Species
## MTBLS (comment if using Iris data)
# D = MTBLS79_DatasetExperiment(filtered=TRUE)
# M = pqn_norm(qc_label='QC',factor_name='sample_type') +
# knn_impute(neighbours=5) +
# glog_transform(qc_label='QC',factor_name='sample_type') +
# filter_smeta(mode='exclude',levels='QC',factor_name='sample_type')
# M = model_apply(M,D)
# D = predicted(M)
# show info
D
## A "DatasetExperiment" object
## ----------------------------
## name: Fisher's Iris dataset
## description: This famous (Fisher's or Anderson's) iris data set gives the
## measurements in centimeters of the variables
## sepal length and width and petal length and
## width, respectively, for 50 flowers from each of
## 3 species of iris. The species are Iris setosa,
## versicolor, and virginica.
## data: 150 rows x 4 columns
## sample_meta: 150 rows x 2 columns
## variable_meta: 4 rows x 1 columns
The DatasetExperiment
object is an extension of the SummarizedExperiment
class used by the Bioconductor community. It contains three main parts:
data
A data frame containing the measured data for each sample.sample_meta
A data frame of additional information related to the samples e.g. group labels.variable_meta
A data frame of additional information related to the variables (features) e.g. annotationsLike all struct
objects it also contains name
and description
fields (called “slots” is R language).
A key difference between DatasetExperiment
and SummarizedExperiment
objects is that the data is transposed. i.e. for DatasetExperiment
objects the samples are in rows and the features are in columns, while the opposite is true for SummarizedExperiment
objects.
All slots are accessible using dollar notation.
# show some data
head(D$data[,1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
struct
model objectsBefore we can apply e.g. PCA we first need to create a PCA object. This object contains all the inputs, outputs and methods needed to apply PCA. We can set parameters such as the number of components when the PCA model is created, but we can also use dollar notation to change/view it later.
P = PCA(number_components=15)
P$number_components=5
P$number_components
## [1] 5
The inputs for a model can be listed using param_ids(object)
:
param_ids(P)
## [1] "number_components"
Or a summary of the object can be printed to the console:
P
## A "PCA" object
## --------------
## name: Principal Component Analysis (PCA)
## description: PCA is a multivariate data reduction technique. It summarises
## the data in a smaller number of Principal
## Components that describe the maximum variation
## present in the DatasetExperiment.
## input params: number_components
## outputs: scores, loadings, eigenvalues, ssx, correlation, that
## predicted: that
## seq_in: data
Unless you have good reason not to, it is usually sensible to mean centre the columns of the data before PCA. Using the STRUCT
framework we can create a model sequence that will mean centre and then apply PCA to the mean centred data.
M = mean_centre() + PCA(number_components = 4)
In structToolbox
mean centring and PCA are both model objects, and joining them using “+” creates a model_sequence object. In a model_sequence the outputs of the first object (mean centring) are automatically passed to the inputs of the second object (PCA), which allows you chain together modelling steps in order to build a workflow.
The objects in the model_sequence can be accessed by indexing, and we can combine this with dollar notation. For example, the PCA object is the second object in our sequence and we can access the number of components as follows:
M[2]$number_components
## [1] 4
Model and model_sequence objects need to be trained using data in the form of a DatasetExperiment
object. For example, the PCA model sequence we created (M
) can be trained using the iris DatasetExperiment
object (‘D’).
M = model_train(M,D)
This model sequence has now mean centred the original data and calculated the PCA scores and loadings.
Model objects can be used to generate predictions for test datasets. For the PCA model sequence this involves mean centring the test data using the mean of training data, and the projecting the centred test data onto the PCA model using the loadings. The outputs are all stored in the model sequence and can be accessed using dollar notation. For this example we will just use the training data again (sometimes called autoprediction), which for PCA allows us to explore the training data in more detail.
M = model_predict(M,D)
Sometimes models don’t make use the training/test approach e.g. univariate statsitics, filtering etc. For these models the model_apply
method can be used instead. For models that do provide training/test methods, model_apply
applies autoprediction by default i.e. it is a short-cut for applying model_train
and model_predict
to the same data.
M = model_apply(M,D)
The available outputs for an object can be listed and accessed like input params, using dollar notation:
output_ids(M[2])
## [1] "scores" "loadings" "eigenvalues" "ssx" "correlation"
## [6] "that"
M[2]$scores
## A "DatasetExperiment" object
## ----------------------------
## name:
## description:
## data: 150 rows x 4 columns
## sample_meta: 150 rows x 2 columns
## variable_meta: 4 rows x 1 columns
The struct
framework includes chart objects. Charts associated with a model object can be listed.
chart_names(M[2])
## [1] "pca_biplot_plot" "pca_correlation_plot" "pca_dstat_plot"
## [4] "pca_loadings_plot" "pca_scores_plot" "pca_scree_plot"
Like model objects, chart objects need to be created before they can be used. Here we will plot the PCA scores plot for our mean centred PCA model.
C = pca_scores_plot(factor_name='class') # colour by class
chart_plot(C,M[2])
Note that indexing the PCA model is required because the pca_scores_plot
object requires a PCA object as input, not a model_sequence.
If we make changes to the input parameters of the chart, chart_plot
must be called again to see the effects.
# add petal width to meta data of pca scores
M[2]$scores$sample_meta$example=D$data[,1]
# update plot
C$factor_name='example'
chart_plot(C,M[2])
The chart_plot
method returns a ggplot object so that you can easily combine it with other plots using the gridExtra
or cowplot
packages for example.
# scores plot
C1 = pca_scores_plot(factor_name='class') # colour by class
g1 = chart_plot(C1,M[2])
# scree plot
C2 = pca_scree_plot()
g2 = chart_plot(C2,M[2])
# arange in grid
grid.arrange(grobs=list(g1,g2),nrow=1)
Within the struct
framework (and structToolbox
) we use STATO objects to provide standardardised definitions for objects and its inputs and outputs.
STATO is a general purpose STATistics Ontology (http://stato-ontology.org). From the webpage:
Its aim is to provide coverage for processes such as statistical tests, their conditions of application, and information needed or resulting from statistical methods, such as probability distributions, variables, spread and variation metrics. STATO also covers aspects of experimental design and description of plots and graphical representations commonly used to provide visual cues of data distribution or layout and to assist review of the results.
The STATistics Ontology for the PCA model is available via the following methods. We can verify that an object has STATO definitions attached by checking that the object is derived from the stat
class:
is(PCA(),'stato')
## [1] TRUE
STATO definitions can be extracted from the object using the stato_id
, stato_name
and stato_definition
methods.
# this is the stato id for PCA
stato_id(M[2])
## [1] "OBI:0200051"
# this is the stato name
stato_name(M[2])
## [1] "principal components analysis dimensionality reduction"
# this is the stato definition
stato_definition(M[2])
## [1] "A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones. "
This information is more succinctly displayed using stato_summary
. This method also scans over all input params and outputs for those with STATO definitions and displays those as well.
For PCA the “number of components” input is listed with definitions because it is a STATO object, but none of the outputs (e.g. PCA scores) are STATO objects at this time and therefore no definition is provided.
stato_summary(M[2])
## OBI:0200051
## principal components analysis dimensionality reduction
## A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones.
##
## Inputs:
## STATO:0000555
## number of predictive components
## number of predictive components is a count used as input to the principle component analysis (PCA)
##
##
## Outputs:
Validation is an important aspect of chemometric modelling. The struct
framework enables this kind of iterative model testing through iterator
objects.
Cross validation is a common technique for assessing the performance of classification models. For this example we will use a Partial least squares-discriminant analysis (PLS-DA) model. Data should be mean centred prior to PLS, so we will build a model sequence first.
M = mean_centre() + PLSDA(number_components=2,factor_name='class')
M
## A model_seq object containing:
##
## [1]
## A "mean_centre" object
## ----------------------
## name: Mean centre
## description:
## input params: mode
## outputs: centred, mean_data, mean_sample_meta
## predicted: centred
## seq_in: data
##
## [2]
## A "PLSDA" object
## ----------------
## name: Partial least squares discriminant analysis
## description:
## input params: number_components, factor_name
## outputs: scores, loadings, yhat, design_matrix, y, reg_coeff, probability, vip, pls_model, pred, threshold
## predicted: pred
## seq_in: data
iterator
objects like the k-fold cross-validation object (kfold_xval
) can be created just like any other struct
object. Parameters can be set at creation using the equals sign, and accessed or changed later using dollar notation.
# create object
XCV = kfold_xval(folds=5,factor_name='class')
# change the number of folds
XCV$folds=10
XCV$folds
## [1] 10
The model to be cross-validated can be set/accessed using the models
method.
models(XCV)=M
models(XCV)
## A model_seq object containing:
##
## [1]
## A "mean_centre" object
## ----------------------
## name: Mean centre
## description:
## input params: mode
## outputs: centred, mean_data, mean_sample_meta
## predicted: centred
## seq_in: data
##
## [2]
## A "PLSDA" object
## ----------------
## name: Partial least squares discriminant analysis
## description:
## input params: number_components, factor_name
## outputs: scores, loadings, yhat, design_matrix, y, reg_coeff, probability, vip, pls_model, pred, threshold
## predicted: pred
## seq_in: data
Alternatively, iterators can be combined with models using the multiplication symbol was shorthand for the models
assignement method:
# cross validation of a mean centred PLSDA model
XCV = kfold_xval(
folds=5,
method='venetian',
factor_name='class') *
(mean_centre() + PLSDA(factor_name='class'))
The run
method can be used with any iterator
object. The iterator will then run the set model or model sequence multiple times.
In this case we run cross-validation 5 times, splitting the data into different training and test sets each time.
The run
method also needs a metric
to be specified, which is another type of struct
object. This metric may be calculated once after all iterations, or after each iteration, depending on the iterator type (resampling, permutation etc). For cross-validation we will calculate “balanced accuracy” after all iterations.
XCV = run(XCV,D,balanced_accuracy())
XCV$metric
## metric mean sd
## 1 balanced_accuracy 0.23 NA
Note The balanced_accuracy
metric actually reports 1-accuracy, so a value of 0 indicates perfect performance. The standard deviation “sd” is NA in this example because there is only one permutation.
Like other struct
objects, iterators can have chart objects associated with them. The chart_names
function will list them for an object.
chart_names(XCV)
## [1] "kfoldxcv_grid" "kfoldxcv_metric"
Charts for iterator
objects can be plotted in the same way as charts for any other object.
C = kfoldxcv_grid(
factor_name='class',
level=levels(D$sample_meta$class)[2]) # first level
chart_plot(C,XCV)
It is possible to combine multiple iterators by using the multiplication symbol. This is equivalent to nesting one iterator inside the other. For example, we can repeat our cross-validation multiple times by permuting the sample order.
# permute sample order 10 times and run cross-validation
P = permute_sample_order(number_of_permutations = 10) *
kfold_xval(folds=5,factor_name='class')*
(mean_centre() + PLSDA(factor_name='class',number_components=2))
P = run(P,D,balanced_accuracy())
P$metric
## metric mean sd
## 1 balanced_accuracy 0.2155 0.01403369
This vignette provides an overview of a structToolbox
workflow implemented to process (e.g. filter features, signal drift and batch correction, normalise and missing value imputation) mass spectrometry data. The workflow exists of methods that are part of the Peak Matrix Processing (pmp) package, including a range of additional filters that are described in Kirwan et al., 2013, 2014.
Some packages are required for this vignette in addition structToolbox
:
For demonstration purposes we will process and analyse the MTBLS79 dataset (‘Dataset 7:SFPM’ Kirwan et al., 2014. This dataset represents a systematic evaluation of the reproducibility of a multi-batch direct-infusion mass spectrometry (DIMS)-based metabolomics study of cardiac tissue extracts. It comprises twenty biological samples (cow vs. sheep) that were analysed repeatedly, in 8 batches across 7 days, together with a concurrent set of quality control (QC) samples. Data are presented from each step of the data processing workflow and are available through MetaboLights (https://www.ebi.ac.uk/metabolights/MTBLS79).
The MTBLS79_DatasetExperiment
object included in the structToolbox
package is a processed version of the MTBLS79 dataset available in peak matrix processing (pmp) package. This vignette describes step by step how the structToolbox
version was created from the pmp
version (i.e. ‘Dataset 7:SFPM’ from the Scientific Data publication - https://doi.org/10.1038/sdata.2014.12).
The SummarizedExperiment
object from the pmp
package needs to be converted to a DatasetExperiment
object for use with structToolbox
.
# the pmp SE object
SE = MTBLS79
# convert to DE
DE = as.DatasetExperiment(SE)
DE$name = 'MTBLS79'
DE$description = 'Converted from SE provided by the pmp package'
# add a column indicating the order the samples were measured in
DE$sample_meta$run_order = 1:nrow(DE)
# add a column indicating if the sample is biological or a QC
Type=as.character(DE$sample_meta$Class)
Type[Type != 'QC'] = 'Sample'
DE$sample_meta$Type = factor(Type)
# convert to factors
DE$sample_meta$Batch = factor(DE$sample_meta$Batch)
DE$sample_meta$Class = factor(DE$sample_meta$Class)
# print summary
DE
## A "DatasetExperiment" object
## ----------------------------
## name: MTBLS79
## description: Converted from SE provided by the pmp package
## data: 172 rows x 2488 columns
## sample_meta: 172 rows x 6 columns
## variable_meta: 2488 rows x 0 columns
Full processing of the data set requires a number of steps. These will be applied using a single struct
model sequence (model.seq
).
A batch correction algorithm is applied to reduce intra- and inter- batch variations in the dataset.
Quality Control-Robust Spline Correction (QC-RSC) is provided in the pmp
package, and it has been
wrapped into a structToolbox
object called sb_corr
.
M = # batch correction
sb_corr(
order_col='run_order',
batch_col='Batch',
qc_col='Type',
qc_label='QC'
)
M = model_apply(M,DE)
The figure below shows a plot of a feature vs run order, before and after the correction. It can be seen that the correction has removed instrument drift within and between batches.
C = feature_profile(
run_order='run_order',
qc_label='QC',
qc_column='Type',
colour_by='Batch',
feature_to_plot='200.03196'
)
# plot and modify using ggplot2
chart_plot(C,DE)+ylab('Peak area')+ggtitle('Before')
chart_plot(C,predicted(M))+ylab('Peak area')+ggtitle('After')
An additional step is added to the published workflow to remove any feature not corrected by QCRCMS. This can occur if there are not enough measured QC values within a batch. QCRMS
in the pmp
package currently returns NA for all samples in the feature where this occurs. Features where this occurs will be excluded.
M2 = filter_na_count(
threshold=3,
factor_name='Batch'
)
M2 = model_apply(M2,predicted(M))
# calculate number of features removed
nc = ncol(DE) - ncol(predicted(M2))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 425
The output of this step is the output of MTBLS79_DatasetExperiment(filtered=FALSE)
.
In the journal article three spectral cleaning steps are applied. In the first filter a Kruskal-Wallis test is used to identify features not reliably detected in the QC samples (p < 0.0001) of all batches. We follow the same parameters as the original article and do not use multiple test correction (mtc = 'false'
).
M3 = kw_rank_sum(
alpha=0.0001,
mtc='none',
factor_names='Batch',
predicted='significant'
) +
filter_by_name(
mode='exclude',
dimension = 'variable',
seq_in = 'names',
names='seq_fcn', # this is a placeholder and will be replaced by seq_fcn
seq_fcn=function(x){return(x[,1])}
)
M3 = model_apply(M3, predicted(M2))
nc = ncol(predicted(M2)) - ncol(predicted(M3))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 262
To make use of univariate tests such as kw_rank_sum
as a filter some advanced features of struct
are needed. Slots predicted
, and seq_in
are used to ensure the correct output of the univariate test is connected to the correct input of a feature filter using filter_by_name
. Another slot seq_fcn
is used to extract the relevant column of the predicted
output so that it is compatible with the seq_in
input. A placeholder is used for the “names” parameter (names = 'place_holder'
) as this input will be replaced by the output from seq_fcn
.
The second filter is a Wilcoxon Signed-Rank test. It is used to identify features that are not representative of the average of the biological samples (p < 1e-14). Again we make use of seq_in
and seq_fcn
.
M4 = wilcox_test(
alpha=1e-14,
factor_names='Type',
mtc='none',
predicted = 'significant'
) +
filter_by_name(
mode='exclude',
dimension='variable',
seq_in='names',
names='place_holder'
)
M4 = model_apply(M4, predicted(M3))
nc = ncol(predicted(M3)) - ncol(predicted(M4))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 162
Finally, an RSD filter is used to remove features with high analytical variation (QC RSD > 20 removed)
M5 = rsd_filter(
rsd_threshold=20,
factor_name='Type'
)
M5 = model_apply(M5,predicted(M4))
nc = ncol(predicted(M4)) - ncol(predicted(M5))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 22
The output of this filter is the output of MTBLS79_DatasetExperiment(filtered=TRUE)
.
We will apply a number of common pre-processing steps to the filtered peak matrix that are identical to steps applied in are described in Kirwan et al. 2013, 2014.
These steps prepare the data for multivariate analysis by accounting for sample concentration differences, imputing missing values and scaling the data.
# peak matrix processing
M6 = pqn_norm(qc_label='QC',factor_name='Type') +
knn_impute(neighbours=5) +
glog_transform(qc_label='QC',factor_name='Type')
M6 = model_apply(M6,predicted(M5))
Principal Component Analysis (PCA) can be used to visualise high-dimensional data. It is an unsupervised method that maximises variance in a reduced number of latent variables, or principal components.
# PCA
M7 = mean_centre() + PCA(number_components = 2)
# apply model sequence to data
M7 = model_apply(M7,predicted(M6))
# plot pca scores
C = pca_scores_plot(factor_name=c('Sample_Rep','Class'),ellipse='none')
chart_plot(C,M7[2]) + coord_fixed() +guides(colour=FALSE)
This plot is very similar to Figure 3b of the original publication link. Sample replicates are represented by colours and samples groups (C = cow and S = Sheep) by different shapes.
Plotting the scores and colouring by Batch indicates that the signal/batch correction was effective as all batches are overlapping.
# chart object
C = pca_scores_plot(factor_name=c('Batch'),ellipse='none')
# plot
chart_plot(C,M7[2]) + coord_fixed()
The aim of this vignette is to demonstrate how to 1) apply and validate Partial Least Squares (PLS) analysis using the structToolbox, 2) reproduce statistical analysis in Thevenot et al. (2015) and 3. compare different implementations of PLS.
The objective of the original study was to: > …study the influence of age, body mass index (bmi), and gender on metabolite concentrations in urine, by analysing 183 samples from a cohort of adults with liquid chromatography coupled to high-resolution mass spectrometry.
The “Sacurine” dataset needs to be converted to a DatasetExperiment
object. The ropls package provides the data as a list containing a dataMatrix
, sampleMetadata
and variableMetadata
.
data('sacurine',package = 'ropls')
# the 'sacurine' list should now be available
# move the annotations to a new column and rename the features by index to avoid issues
# later when data.frames get transposed and names get checked/changed
sacurine$variableMetadata$annotation=rownames(sacurine$variableMetadata)
rownames(sacurine$variableMetadata)=1:nrow(sacurine$variableMetadata)
colnames(sacurine$dataMatrix)=1:ncol(sacurine$dataMatrix)
# create DatasetExperiment
DE = DatasetExperiment(data = data.frame(sacurine$dataMatrix),
sample_meta = sacurine$sampleMetadata,
variable_meta = sacurine$variableMetadata,
name = 'Sacurine data',
description = 'See ropls package documentation for details')
# print summary
DE
## A "DatasetExperiment" object
## ----------------------------
## name: Sacurine data
## description: See ropls package documentation for details
## data: 183 rows x 109 columns
## sample_meta: 183 rows x 3 columns
## variable_meta: 109 rows x 4 columns
The Sacurine dataset used within this vignette has already been pre-processed:
After signal drift and batch effect correction of intensities, each urine profile was normalized to the osmolality of the sample. Finally, the data were log10 transformed.
Since the data has already been processed the data can be visualised using Principal Component Analysis (PCA) without further pre-processing. The ropls
package automatically applies unit variance scaling (autoscaling) by default. The same approach is applied here.
# prepare model sequence
M = autoscale() + PCA(number_components = 5)
# apply model sequence to dataset
M = model_apply(M,DE)
# pca scores plots
g=list()
for (k in colnames(DE$sample_meta)) {
C = pca_scores_plot(factor_name = k)
g[[k]] = chart_plot(C,M[2])
}
# plot using cowplot
plot_grid(plotlist=g, nrow=1, align='vh', labels=c('A','B','C'))
The third plot coloured by gender (C) is identical to Figure 2 of the ropls package vignette. The structToolbox
package provides a range of PCA-related diagnostic plots, including D-statistic, scree, and loadings plots. These plots can be used to further explore the variance of the data.
C = pca_scree_plot()
g1 = chart_plot(C,M[2])
C = pca_loadings_plot()
g2 = chart_plot(C,M[2])
C = pca_dstat_plot(alpha=0.95)
g3 = chart_plot(C,M[2])
p1=plot_grid(plotlist = list(g1,g2),align='h',nrow=1,axis='b')
p2=plot_grid(plotlist = list(g3),nrow=1)
plot_grid(p1,p2,nrow=2)
The ropls
package uses its own implementation of the (O)PLS algorithms. structToolbox
uses the pls
package, so it is interesting to compare the outputs from both approaches. For simplicity only the scores
plots are compared.
# prepare model sequence
M = autoscale() + PLSDA(factor_name='gender')
M = model_apply(M,DE)
C = plsda_scores_plot(factor_name = 'gender')
chart_plot(C,M[2])
The plot is similar to fig.3 of the ropls vignette. Differences are due to inverted LV axes, a common occurrence with the NIPALS algorithm (used by both structToolbox
and ropls
) which depends on how the algorithm is initialised.
To compare the R2 values for the model in structToolbox we have to use a regression model, instead of a discriminant model. For this we convert the gender factor to a numeric variable before applying the model.
# convert gender to numeric
DE$sample_meta$gender=as.numeric(DE$sample_meta$gender)
# models sequence
M = autoscale(mode='both') + PLSR(factor_name='gender',number_components=3)
M = model_apply(M,DE)
# some diagnostic charts
C = plsr_cook_dist()
g1 = chart_plot(C,M[2])
C = plsr_prediction_plot()
g2 = chart_plot(C,M[2])
C = plsr_qq_plot()
g3 = chart_plot(C,M[2])
C = plsr_residual_hist()
g4 = chart_plot(C,M[2])
plot_grid(plotlist = list(g1,g2,g3,g4), nrow=2,align='vh')
The ropls
package automatically applies cross-validation to asses the performance of the PLSDA model. In structToolbox
this is applied separately to give more control over the approach used if desired. The default cross-validation used by the ropls
package is 7-fold cross-validation and we replicate that here.
# model sequence
M = kfold_xval(folds=7, factor_name='gender') *
(autoscale(mode='both') + PLSR(factor_name='gender'))
M = run(M,DE,r_squared())
Training set R2: 0.6975706 0.6798415 0.646671 0.6532914 0.7109769 0.670777 0.6935344
Test set Q2: 0.5460723
The validity of the model can further be assessed using permutation testing. For this we will return to a discriminant model.
# reset gender to original factor
DE$sample_meta$gender=sacurine$sampleMetadata$gender
# model sequence
M = permutation_test(number_of_permutations = 10, factor_name='gender') *
kfold_xval(folds=7,factor_name='gender') *
(autoscale() + PLSDA(factor_name='gender',number_components = 3))
M = run(M,DE,balanced_accuracy())
C = permutation_test_plot(style='boxplot')
chart_plot(C,M)+ylab('1 - balanced accuracy')
The permuted models have a balanced accuracy of around 50%, which is to be expected for a dataset with two groups. The unpermuted models have a balanced accuracy of around 90% and is therefore much better than might be expected to occur by chance.
The purpose of this vignette is to demonstrate the different functionalities and methods that are available as part of the structToolbox and reproduce the data analysis reported in Mendez et al., (2020) and Chan et al., (2016).
The 1H-NMR dataset used and described in Mendez et al., (2020) and in this vignette contains processed spectra of urine samples obtained from gastric cancer and healthy patients Chan et al., (2016). The experimental raw data is available through Metabolomics Workbench (PR000699) and the processed version is available from here as an Excel data file.
As a first step we need to reorganise and convert the Excel data file into a DatasetExperiment object. Using the openxlsx
package the file can be read directly into an R data.frame
and then manipulated as required.
url = 'https://github.com/CIMCB/MetabWorkflowTutorial/raw/master/GastricCancer_NMR.xlsx'
# read in file directly from github...
# X=read.xlsx(url)
# ...or use BiocFileCache
path = bfcrpath(bfc,url)
X = read.xlsx(path)
# sample meta data
SM=X[,1:4]
rownames(SM)=SM$SampleID
# convert to factors
SM$SampleType=factor(SM$SampleType)
SM$Class=factor(SM$Class)
# keep a numeric version of class for regression
SM$Class_num = as.numeric(SM$Class)
## data matrix
# remove meta data
X[,1:4]=NULL
rownames(X)=SM$SampleID
# feature meta data
VM=data.frame(idx=1:ncol(X))
rownames(VM)=colnames(X)
# prepare DatasetExperiment
DE = DatasetExperiment(
data=X,
sample_meta=SM,
variable_meta=VM,
description='1H-NMR urinary metabolomic profiling for diagnosis of gastric cancer',
name='Gastric cancer (NMR)')
DE
## A "DatasetExperiment" object
## ----------------------------
## name: Gastric cancer (NMR)
## description: 1H-NMR urinary metabolomic profiling for diagnosis of gastric
## cancer
## data: 140 rows x 149 columns
## sample_meta: 140 rows x 5 columns
## variable_meta: 149 rows x 1 columns
It is good practice to remove any features that may be of low quality, and to assess the quality of the data in general. In the Tutorial features with QC-RSD > 20% and where more than 10% of the features are missing are retained.
# prepare model sequence
M = rsd_filter(rsd_threshold=20,qc_label='QC',factor_name='Class') +
mv_feature_filter(threshold = 10,method='across',factor_name='Class')
# apply model
M = model_apply(M,DE)
# get the model output
filtered = predicted(M)
# summary of filtered data
filtered
## A "DatasetExperiment" object
## ----------------------------
## name: Gastric cancer (NMR)
## description: 1H-NMR urinary metabolomic profiling for diagnosis of gastric
## cancer
## data: 140 rows x 53 columns
## sample_meta: 140 rows x 5 columns
## variable_meta: 53 rows x 1 columns
Note there is an additional feature vs the the processing reported by Mendez et. al. because the filters here use >= or <= instead of > and <.
After suitable scaling and transformation PCA can be used to assess data quality. It is expected that the biological variance (samples) will be larger than the technical variance (QCs). In the workflow that we are reproducing (link) the following steps were applied:
The transformed and scaled matrix in then used as input to PCA. Using struct
we can chain all of these steps into a single model sequence.
# prepare the model sequence
M = log_transform(base = 10) +
autoscale() +
knn_impute(neighbours = 3) +
PCA(number_components = 10)
# apply model sequence to data
M = model_apply(M,filtered)
# get the transformed, scaled and imputed matrix
TSI = predicted(M[3])
# scores plot
C = pca_scores_plot(factor_name = 'SampleType')
g1 = chart_plot(C,M[4])
# loadings plot
C = pca_loadings_plot()
g2 = chart_plot(C,M[4])
plot_grid(g1,g2,align='hv',nrow=1,axis='tblr')
structToolbox
provides a number of objects for ttest, counting numbers of features etc. For brevity only the ttest is calculated for comparison with the workflow we are following (link). The QC samples need to be excluded, and the data reduced to only the GC and HE groups.
# prepare model
TT = filter_smeta(mode='include',factor_name='Class',levels=c('GC','HE')) +
ttest(alpha=0.05,mtc='fdr',factor_names='Class')
# apply model
TT = model_apply(TT,filtered)
# keep the data filtered by group for later
filtered = predicted(TT[1])
# convert to data frame
out=as_data_frame(TT[2])
# show first few features
head(out)
## t_statistic t_p_value t_significant estimate.mean.GC estimate.mean.HE
## M4 -3.5392652 0.008421042 TRUE 26.47778 51.73947
## M5 1.4296604 0.410396437 FALSE 265.11860 169.91500
## M7 2.7456506 0.051494976 FALSE 118.52558 53.98718
## M8 -2.1294198 0.178392032 FALSE 54.39535 79.26750
## M11 0.5106536 0.776939682 FALSE 201.34390 171.27949
## M14 -1.4786810 0.403091881 FALSE 61.53171 83.90250
## lower upper
## M4 -39.56162 -10.961769
## M5 -38.04747 228.454679
## M7 17.60818 111.468619
## M8 -48.20069 -1.543611
## M11 -87.30604 147.434869
## M14 -52.57754 7.835950
Splitting data into training and test sets is an important aspect of machine learning. In structToolbox
this is implemented using the split_data
object for random subsampling across the whole dataset, and stratified_split
for splitting based on group sizes, which is the approach used by Mendez et al.
# prepare model
M = stratified_split(p_train=0.75,factor_name='Class')
# apply to filtered data
M = model_apply(M,filtered)
# get data from object
train = M$training
train
## A "DatasetExperiment" object
## ----------------------------
## name: Gastric cancer (NMR)(Training set)
## description: 1H-NMR urinary metabolomic profiling for diagnosis of gastric
## cancer
## A subset of the data has been selected as a training set
## data: 62 rows x 53 columns
## sample_meta: 62 rows x 5 columns
## variable_meta: 53 rows x 1 columns
cat('\n')
test = M$testing
test
## A "DatasetExperiment" object
## ----------------------------
## name: Gastric cancer (NMR)(Testing set)
## description: 1H-NMR urinary metabolomic profiling for diagnosis of gastric
## cancer
## A subset of the data has been selected as a test set
## data: 21 rows x 53 columns
## sample_meta: 21 rows x 5 columns
## variable_meta: 53 rows x 1 columns
In Mendez et al a k-fold cross-validation is used to determine the optimal number of PLS components. 100 bootstrap iterations are used to generate confidence intervals. In strucToolbox
these are implemented using “iterator” objects, that can be combined with model objects. R2 is used as the metric for optimisation, so the PLSR
model in structToolbox will be used. For speed only 10 bootstrap iterations are used here.
# scale/transform training data
M = log_transform(base = 10) +
autoscale() +
knn_impute(neighbours = 3,by='samples')
# apply model
M = model_apply(M,train)
# get scaled/transformed training data
train_st = predicted(M)
# prepare model sequence
MS = grid_search_1d(
param_to_optimise = 'number_components',
search_values = as.numeric(c(1:6)),
model_index = 2,
factor_name = 'Class_num',
max_min = 'max') *
permute_sample_order(
number_of_permutations = 10) *
kfold_xval(
folds = 5,
factor_name = 'Class_num') *
(mean_centre(mode='sample_meta')+
PLSR(factor_name='Class_num'))
# run the validation
MS = struct::run(MS,train_st,r_squared())
#
C = gs_line()
chart_plot(C,MS)
The chart plotted shows Q2, which is comparable with Figure 13 of Mendez et al . Two components were selected by Mendez et al, so we will use that here.
To evaluate the model for discriminant analysis in structToolbox the PLSDA
model is appropriate.
# prepare the discriminant model
P = PLSDA(number_components = 2, factor_name='Class')
# apply the model
P = model_apply(P,train_st)
# charts
C = plsda_predicted_plot(factor_name='Class',style='boxplot')
g1 = chart_plot(C,P)
C = plsda_predicted_plot(factor_name='Class',style='density')
g2 = chart_plot(C,P)+xlim(c(-2,2))
C = plsda_roc_plot(factor_name='Class')
g3 = chart_plot(C,P)
plot_grid(g1,g2,g3,align='vh',axis='tblr',nrow=1, labels=c('A','B','C'))
# AUC for comparison with Mendez et al
MET = calculate(AUC(),P$y$Class,P$yhat[,1])
MET
## A "AUC" object
## --------------
## name: Area under ROC
## description:
## value: 0.9739583
Note that the default cutoff in A and B of the figure above for the PLS models in structToolbox
is 0, because groups are encoded as +/-1. This has no impact on the overall performance of the model.
A permutation test can be used to assess how likely the observed result is to have occurred by chance. In structToolbox permutation_test
is an iterator object that can be combined with other iterators and models.
# model sequence
MS = permutation_test(number_of_permutations = 20,factor_name = 'Class_num') *
kfold_xval(folds = 5,factor_name = 'Class_num') *
(mean_centre(mode='sample_meta') + PLSR(factor_name='Class_num', number_components = 2))
# run iterator
MS = struct::run(MS,train_st,r_squared())
# chart
C = permutation_test_plot(style = 'density')
chart_plot(C,MS) + xlim(c(-1,1)) + xlab('R Squared')
This plot is comparable to the bottom half of Figure 17 in Mendez et. al.. The unpermuted (true) Q2 values are consistently better than the permuted (null) models. i.e. the model is reliable.
PLS can also be used to visualise the model and interpret the latent variables.
# prepare the discriminant model
P = PLSDA(number_components = 2, factor_name='Class')
# apply the model
P = model_apply(P,train_st)
C = plsda_scores_plot(components=c(1,2),factor_name = 'Class')
chart_plot(C,P)
Regression coefficients and VIP scores can be used to estimate the importance of individual features to the PLS model. In Mendez et al bootstrapping is used to estimate the confidence intervals, but for brevity here we will skip this.
# prepare chart
C = plsda_vip_plot(level='HE')
g1 = chart_plot(C,P)
C = plsda_regcoeff_plot(level='HE')
g2 = chart_plot(C,P)
plot_grid(g1,g2,align='hv',axis='tblr',nrow=2)
The aim of this vignette is to illustrate how to apply SVM analysis for Classifying Metabolomics data.
Support vector Machines (SVM) are a commonly used method in Machine Learning. For classification tasks they are used to generate a boundary between groups of samples in the training set. As well as generating linear boundaries, SVM can be extended to exploit the use of kernels and generate complex non-linear boundaries between groups if required.
For the structToolbox
package, SVM functionality provided by the e1071 package has been incorporated into a model
object. A chart object (svm_plot_2d
) is also available to plot SVM boundaries for data with two variables.
The 1H-NMR dataset used and described in Mendez et al., (2020) and in this vignette contains processed spectra of urine samples obtained from gastric cancer and healthy patients Chan et al., (2016). The raw experimental data is available through Metabolomics Workbench (PR000699) and the processed version is available from here as an Excel data file.
For simplicity we will use a pre-processed version of the 1H-NMR “Gastric cancer” dataset using the structToolbox
package. Details in regards to pre-processing are reported in the “NMR_clinical_metabolomics” vignette of the `r Biocpkg(“structToolbox”) package.
# summary of DatasetExperiment object
DE
## A "DatasetExperiment" object
## ----------------------------
## name:
## description:
## data: 140 rows x 53 columns
## sample_meta: 140 rows x 5 columns
## variable_meta: 53 rows x 1 columns
For the purposes of illustrating the effect of SVM parameters on the boundary between groups, we reduce the data to include only the GC and HE groups and apply PLS to reduce the data to two components. We then treat the PLS scores as as a two group dataset with only two features.
# model sequence and pls model (NB data already centred)
MS = filter_smeta(mode = 'include', levels = c('GC','HE'), factor_name = 'Class') +
PLSDA(factor_name = 'Class',number_components = 2)
# apply PLS model
MS = model_apply(MS,DE)
# plot the data
C = plsda_scores_plot(factor_name = 'Class')
chart_plot(C,MS[2])
# new DatasetExperiment object from the PLS scores
DE2 = DatasetExperiment(
data = MS[2]$scores,
sample_meta = predicted(MS[1])$sample_meta,
variable_meta = data.frame('LV'=c(1,2),row.names = colnames(MS[2]$scores)),
name = 'Illustrativate SVM dataset',
description = 'Generated by applying PLS to the processed Gastric cancer (NMR) dataset'
)
DE2
## A "DatasetExperiment" object
## ----------------------------
## name: Illustrativate SVM dataset
## description: Generated by applying PLS to the processed Gastric cancer (NMR)
## dataset
## data: 83 rows x 2 columns
## sample_meta: 83 rows x 5 columns
## variable_meta: 2 rows x 1 columns
The simplest SVM model uses a linear kernel. In structToolbox
the SVM
model can be used to train and apply SVM models. A svm_plot_2d
chart object is provided for visualisation of boundaries in two dimensions.
# SVM model
M = SVM(
factor_name = 'Class',
kernel = 'linear'
)
# apply model
M = model_apply(M,DE2)
# plot boundary
C = svm_plot_2d(factor_name = 'Class')
chart_plot(C,M, DE2)
The SVM boundary is plotted in black, the margins in grey and support vectors are indicated by grey circles.
The SVM cost function applies a penalty to samples on the wrong side of the margins. A high penalty results in a narrow margin and tries to force more samples to be on the correct side of the boundary. A low penalty makes for a wider margin and is less strict about samples being misclassified. The optimal cost to use is data dependent.
# low cost
M$cost=0.01
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g1=chart_plot(C,M,DE2)
# medium cost
M$cost=0.05
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g2=chart_plot(C,M,DE2)
# high cost
M$cost=100
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g3=chart_plot(C,M,DE2)
# plot
prow <- plot_grid(
g1 + theme(legend.position="none"),
g2 + theme(legend.position="none"),
g3 + theme(legend.position="none"),
align = 'vh',
labels = c("Low cost", "Medium cost", "High cost"),
hjust = -1,
nrow = 2
)
legend <- get_legend(
# create some space to the left of the legend
g1 + guides(color = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
)
plot_grid(prow, legend, ncol=1, rel_heights = c(1, .1))
A number of different kernels can be used with support vector machines. For the structToolbox
wrapper
‘linear’, ‘polynomial’,‘radial’ and ‘sigmoid’ kernels can be specified. Using kernels allows the boundary to be
more flexible, but often require additional parameters to be specified. The best kernel to use will vary depending on the dataset, but a common choice is the radial kernel as it allows high flexibility with a single parameter.
# set a fixed cost for this comparison
M$cost=1
# linear kernel
M$kernel='linear'
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g1=chart_plot(C,M,DE2)
# polynomial kernel
M$kernel='polynomial'
M$gamma=1
M$coef0=0
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g2=chart_plot(C,M,DE2)
# rbf kernel
M$kernel='radial'
M$gamma=1
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g3=chart_plot(C,M,DE2)
# sigmoid kernel
M$kernel='sigmoid'
M$gamma=1
M$coef0=0
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g4=chart_plot(C,M,DE2)
# plot
prow <- plot_grid(
g1 + theme(legend.position="none"),
g2 + theme(legend.position="none"),
g3 + theme(legend.position="none"),
g4 + theme(legend.position="none"),
align = 'vh',
labels = c("Linear", "Polynomial", "Radial","Sigmoid"),
hjust = -1,
nrow = 2
)
legend <- get_legend(
# create some space to the left of the legend
g1 + guides(color = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
)
plot_grid(prow, legend, ncol = 1, rel_heights = c(1, .1))
The parameters of a kernel can be used to control the complexity of the boundary. Here I show how the radial kernel parameter “gamma” can be used to change the complexity of the boundary. In combination with the cost parameter (which I keep constant here) this allows for highly flexible boundary models.
# rbf kernel and cost
M$kernel = 'radial'
M$cost = 1
# low gamma
M$gamma=0.01
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g1=chart_plot(C,M,DE2)
# medium gamma
M$gamma=0.1
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g2=chart_plot(C,M,DE2)
# high gamma
M$gamma=1
M=model_apply(M,DE2)
C=svm_plot_2d(factor_name='Species')
g3=chart_plot(C,M,DE2)
# plot
prow <- plot_grid(
g1 + theme(legend.position="none"),
g2 + theme(legend.position="none"),
g3 + theme(legend.position="none"),
align = 'vh',
labels = c("Low gamma", "Medium gamma", "High gamma"),
hjust = -1,
nrow = 2
)
legend <- get_legend(
# create some space to the left of the legend
g1 + guides(color = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
)
plot_grid(prow, legend, ncol = 1, rel_heights = c(1, .1))
Note that best practice would be to select the optimal kernel parameter(s) in combination with the cost parameter (e.g. by 2d grid search) so that the best combination of both is identified.
The aim of this vignette is to conduct data preprocessing and exploratory analysis of data from the STATegra project (https://www.nature.com/articles/s41597-019-0202-7). For demonstration purposes we will focus on the Proteomics and Metabolomics datasets that are publicly available as part of the STATegra multi-omics dataset.
…the STATegra multi-omics dataset combines measurements from up to 10 different omics technologies applied to the same biological system, namely the well-studied mouse pre-B-cell differentiation. STATegra includes high-throughput measurements of chromatin structure, gene expression, proteomics and metabolomics, and it is complemented with single-cell data. Gomez-Cabrero et al
STATegra includes high-throughput measurements of chromatin structure, gene expression, proteomics and metabolomics, and it is complemented with single-cell data.
The LC-MS-based proteomics dataset from the STATegra multi-omics dataset (see Introduction) can be found on github and must be extracted from the zip file prior to data analysis.
# path to zip
zipfile = "https://raw.github.com/STATegraData/STATegraData/master/Script_STATegra_Proteomics.zip"
## retrieve from BiocFileCache
path = bfcrpath(bfc,zipfile)
temp = bfccache(bfc)
## ... or download to temp location
# path = tempfile()
# temp = tempdir()
# download.file(zipfile,path)
# unzip
unzip(path, files = "Proteomics_01_uniprot_canonical_normalized.txt", exdir=temp)
# read samples
all_data <- read.delim(file.path(temp,"Proteomics_01_uniprot_canonical_normalized.txt"), as.is = TRUE, header = TRUE, sep = "\t")
The imported data needs to be converted to DatasetExperiment
format for use with structToolbox
.
# extract data matrix
data = all_data[1:2527,51:86]
# shorten sample names
colnames(data) = lapply(colnames(data), function (x) substr(x, 27, nchar(x)))
# replace 0 with NA
data[data == 0] <- NA
# transpose
data=as.data.frame(t(data))
# prepare sample meta
SM = lapply(rownames(data),function(x) {
s=strsplit(x,'_')[[1]] # split at underscore
out=data.frame(
'treatment' = s[[1]],
'time' = substr(s[[2]],1,nchar(s[[2]])-1) ,
'batch' = substr(s[[3]],6,nchar(s[[3]])),
'condition' = substr(x,1,6) # interaction between treatment and time
)
return(out)
})
SM = do.call(rbind,SM)
rownames(SM)=rownames(data)
# convert to factors
SM$treatment=factor(SM$treatment)
SM$time=ordered(SM$time,c("0","2","6","12","18","24"))
SM$batch=ordered(SM$batch,c(1,3,4,5,6,7))
SM$condition=factor(SM$condition)
# variable meta data
VM = all_data[1:2527,c(1,6,7)]
rownames(VM)=colnames(data)
# prepare DatasetExperiment
DS = DatasetExperiment(
data = data,
sample_meta = SM,
variable_meta = VM,
name = 'STATegra Proteomics',
description = 'downloaded from: https://github.com/STATegraData/STATegraData/'
)
DS
## A "DatasetExperiment" object
## ----------------------------
## name: STATegra Proteomics
## description: downloaded from: https://github.com/STATegraData/STATegraData/
## data: 36 rows x 2527 columns
## sample_meta: 36 rows x 4 columns
## variable_meta: 2527 rows x 3 columns
A number of Reporter genes were included in the study. We plot two of them here to illustrate some trends in the data.
# find id of reporters
Ldha = which(DS$variable_meta$Gene.names=='Ldha')
Hk2 = which(DS$variable_meta$Gene.names=='Hk2')
# chart object
C = feature_boxplot(feature_to_plot=Ldha,factor_name='time',label_outliers=FALSE)
g1=chart_plot(C,DS)+ggtitle('Ldha')+ylab('expression')
C = feature_boxplot(feature_to_plot=Hk2,factor_name='time',label_outliers=FALSE)
g2=chart_plot(C,DS)+ggtitle('Hk2')+ylab('expression')
plot_grid(g1,g2,nrow=1,align='vh',axis='tblr')
The data is log2 transformed, then scaled such that the mean of the medians is equal for all conditions. These steps are available in structToolbox
using log_transform
and mean_of_medians
objects.
# prepare model sequence
M = log_transform(
base=2) +
mean_of_medians(
factor_name = 'condition')
# apply model sequence
M = model_apply(M,DS)
# get transformed data
DST = predicted(M)
The Reporter genes are plotted again for comparison.
# chart object
C = feature_boxplot(feature_to_plot=Ldha,factor_name='time',label_outliers=FALSE)
g1=chart_plot(C,DST)+ggtitle('Ldha')+ylab('log2(expression)')
C = feature_boxplot(feature_to_plot=Hk2,factor_name='time',label_outliers=FALSE)
g2=chart_plot(C,DST)+ggtitle('Hk2')+ylab('log2(expression)')
plot_grid(g1,g2,nrow=1,align='vh',axis='tblr')
Missing value filtering involves removing any feature (gene) where there are at least 3 missing values per group in at least 11 samples.
This specific filter is not in structToolbox
at this time, but can be achieved by combining filter_na_count
and filter_by_name
objects.
Specifically, the default output of filter_na_count
is changed to return a matrix of NA counts per class. This output is then connected to the ‘names’ input of filter_by_names
and converted to TRUE/FALSE using the ‘seq_fcn’ input.
The ‘seq_fcn’ function processes the NA counts before they are used as inputs for filter_by_names
. When data is passed along the model sequence it passes unchanged through the filter_na_count
object becuase the default output has been changed, so the filter_na_count
and filter_by_name
objects are working together as a single filter.
# build model sequence
M2 = filter_na_count(
threshold=2,
factor_name='condition',
predicted='na_count') + # override the default output
filter_by_name(
mode='exclude',
dimension='variable',
names='place_holder',
seq_in='names',
seq_fcn=function(x) { # convert NA count pre group to true/false
x=x>2 # more the two missing per group
x=rowSums(x)>10 # in more than 10 groups
return(x)
}
)
# apply to transformed data
M2 = model_apply(M2,DST)
# get the filtered data
DSTF = predicted(M2)
STATegra uses two imputation methods that are not available as struct
objects, so we create temporary STATegra_impute
objects to do this using some functions from the struct
package.
The first imputation method imputes missing values for any treatment where values are missing for all samples using a “random value below discovery”. We create a new struct object using set_struct_obj
in the global environment, and a “method_apply” method that implements the imputation.
# create new imputation object
set_struct_obj(
class_name = 'STATegra_impute1',
struct_obj = 'model',
stato=FALSE,
params=c(factor_sd='character',factor_name='character'),
outputs=c(imputed='DatasetExperiment'),
prototype = list(
name = 'STATegra imputation 1',
description = 'If missing values are present for all one group then they are replaced with min/2 + "random value below discovery".',
predicted = 'imputed'
)
)
# create method_apply for imputation method 1
set_obj_method(
class_name='STATegra_impute1',
method_name='model_apply',
definition=function(M,D) {
# for each feature count NA within each level
na = apply(D$data,2,function(x){
tapply(x,D$sample_meta[[M$factor_name]],function(y){
sum(is.na(y))
})
})
# count number of samples in each group
count=summary(D$sample_meta[[M$factor_name]])
# standard deviation of features within levels of factor_sd
sd = apply(D$data,2,function(x) {tapply(x,D$sample_meta[[M$factor_sd]],sd,na.rm=TRUE)})
sd = median(sd,na.rm=TRUE)
# impute or not
check=na == matrix(count,nrow=2,ncol=ncol(D)) # all missing in one class
# impute matrix
mi = D$data
for (j in 1:nrow(mi)) {
# index of group for this sample
g = which(levels(D$sample_meta[[M$factor_name]])==D$sample_meta[[M$factor_name]][j])
iv=rnorm(ncol(D),min(D$data[j,],na.rm=TRUE)/2,sd)
mi[j,is.na(mi[j,]) & check[g,]] = iv[is.na(mi[j,]) & check[g,]]
}
D$data = mi
M$imputed=D
return(M)
}
)
The second imputation method replacing missing values in any condition with exactly 1 missing value with the mean of the values for that condition. Again we create a new struct object and corresponding method for the the new object to implement the filter.
# create new imputation object
set_struct_obj(
class_name = 'STATegra_impute2',
struct_obj = 'model',
stato=FALSE,
params=c(factor_name='character'),
outputs=c(imputed='DatasetExperiment'),
prototype = list(
name = 'STATegra imputation 2',
description = 'For those conditions with only 1 NA impute with the mean of the condition.',
predicted = 'imputed'
)
)
# create method_apply for imputation method 2
set_obj_method(
class_name='STATegra_impute2',
method_name='model_apply',
definition=function(M,D) {
# levels in condition
L = levels(D$sample_meta[[M$factor_name]])
# for each feature count NA within each level
na = apply(D$data,2,function(x){
tapply(x,D$sample_meta[[M$factor_name]],function(y){
sum(is.na(y))
})
})
# standard deviation of features within levels of factor_sd
sd = apply(D$data,2,function(x) {tapply(x,D$sample_meta[[M$factor_name]],sd,na.rm=TRUE)})
sd = median(sd,na.rm=TRUE)
# impute or not
check=na == 1 # only one missing for a condition
# index of samples for each condition
IDX = list()
for (k in L) {
IDX[[k]]=which(D$sample_meta[[M$factor_name]]==k)
}
## impute
# for each feature
for (k in 1:ncol(D)) {
# for each condition
for (j in 1:length(L)) {
# if passes test
if (check[j,k]) {
# mean of samples in group
m = mean(D$data[IDX[[j]],k],na.rm=TRUE)
# imputed value
im = rnorm(1,m,sd)
# replace NA with imputed
D$data[is.na(D$data[,k]) & D$sample_meta[[M$factor_name]]==L[j],k]=im
}
}
}
M$imputed=D
return(M)
}
)
The new STATegra imputation objects can now be used in model sequences like any other struct
object. A final filter is added to remove any feature that has missing values after imputation.
# model sequence
M3 = STATegra_impute1(factor_name='treatment',factor_sd='condition') +
STATegra_impute2(factor_name = 'condition') +
filter_na_count(threshold = 3, factor_name='condition')
# apply model
M3 = model_apply(M3,DSTF)
# get imputed data
DSTFI = predicted(M3)
DSTFI
## A "DatasetExperiment" object
## ----------------------------
## name: STATegra Proteomics
## description: downloaded from: https://github.com/STATegraData/STATegraData/
## data: 36 rows x 864 columns
## sample_meta: 36 rows x 4 columns
## variable_meta: 864 rows x 3 columns
It is often useful to visualise the distribution of values across samples to verify that the transformations/normalisation/filtering etc have been effective.
The values are no longer skewed and show an approximately normal distribution. The boxplots are comparable in width with very few outliers indicated, so the transformations etc have had an overall positive effect.
PCA is used to provide a graphical representation of the data. For comparison with the outputs from STATegra a filter is included to reduce the data to include only the treated samples (IKA)
# model sequence
P = filter_smeta(mode='include',factor_name='treatment',levels='IKA') +
mean_centre() +
PCA(number_components = 2)
# apply model
P = model_apply(P,DSTFI)
# scores plots coloured by factors
g = list()
for (k in c('batch','time')) {
C = pca_scores_plot(factor_name=k,ellipse='none')
g[[k]]=chart_plot(C,P[3])
}
plot_grid(plotlist = g,nrow=1)
There does not appear to be a strong batch effect. PC1 is dominated by time point “24” and some potentially outlying points from time points “2” and “0”.
The LC-MS-based metabolomics dataset from the STATegra multi-omics dataset (see Introduction) can be found on github and must be extracted from zip file prior to data analysis.
# path to zip
zipfile = "https://raw.github.com/STATegraData/STATegraData/master/Script_STATegra_Metabolomics.zip"
## retrieve from BiocFileCache
path = bfcrpath(bfc,zipfile)
temp = bfccache(bfc)
## ... or download to temp location
# path = tempfile()
# temp = tempdir()
# download.file(zipfile,path)
# unzip
unzip(zipfile=path, files = "LC_MS_raw_data.xlsx", exdir=temp)
# read samples
data <- as.data.frame(read.xlsx(file.path(temp,"LC_MS_raw_data.xlsx"),sheet = 'Data'))
The imported data needs to be converted to DatasetExperiment
format for use with structToolbox
.
# extract sample meta data
SM = data[ ,1:8]
# add coloumn for sample type (QC, blank etc)
blanks=c(1,2,33,34,65,66)
QCs=c(3,4,11,18,25,32,35,36,43,50,57,64)
SM$sample_type='Sample'
SM$sample_type[blanks]='Blank'
SM$sample_type[QCs]='QC'
# put qc/blank labels in all factors for plotting later
SM$biol.batch[SM$sample_type!='Sample']=SM$sample_type[SM$sample_type!='Sample']
SM$time.point[SM$sample_type!='Sample']=SM$sample_type[SM$sample_type!='Sample']
SM$condition[SM$sample_type!='Sample']=SM$sample_type[SM$sample_type!='Sample']
# convert to factors
SM$biol.batch=ordered(SM$biol.batch,c('9','10','11','12','QC','Blank'))
SM$time.point=ordered(SM$time.point,c('0h','2h','6h','12h','18h','24h','QC','Blank'))
SM$condition=factor(SM$condition)
SM$sample_type=factor(SM$sample_type)
# variable meta data
VM = data.frame('annotation'=colnames(data)[9:ncol(data)])
# raw data
X = data[,9:ncol(data)]
# convert 0 to NA
X[X==0]=NA
# force to numeric; any non-numerics will become NA
X=data.frame(lapply(X,as.numeric),check.names = FALSE)
# make sure row/col names match
rownames(X)=data$label
rownames(SM)=data$label
rownames(VM)=colnames(X)
# create DatasetExperiment object
DE = DatasetExperiment(
data = X,
sample_meta = SM,
variable_meta = VM,
name = 'STATegra Metabolomics LCMS',
description = 'https://www.nature.com/articles/s41597-019-0202-7'
)
DE
## A "DatasetExperiment" object
## ----------------------------
## name: STATegra Metabolomics LCMS
## description: https://www.nature.com/articles/s41597-019-0202-7
## data: 66 rows x 152 columns
## sample_meta: 66 rows x 9 columns
## variable_meta: 152 rows x 1 columns
In the STATegra project the LCMS data was combined with GCMS data and multiblock analysis was conducted. Here only the LCMS will be explored, so the data will be processed differently in comparison to Gomez-Cabrero et al. Some basic processing steps will be applied in order to generate a valid PCA plot from the biological and QC samples.
# prepare model sequence
MS = filter_smeta(mode = 'include', levels='QC', factor_name = 'sample_type') +
knn_impute(neighbours=5) +
vec_norm() +
log_transform(base = 10)
# apply model sequence
MS = model_apply(MS, DE)
First we will use PCA to look at the QC samples in order to make an assessment of the data quality.
# pca model sequence
M = mean_centre() +
PCA(number_components = 3)
# apply model
M = model_apply(M,predicted(MS))
# PCA scores plot
C = pca_scores_plot(factor_name = 'sample_type',label_factor = 'order',points_to_label = 'all')
# plot
chart_plot(C,M[2])
The QC labelled “36” is clearly very different to the other QCs. In STATegra this QC was removed, so we will exclude it here as well. This corresponds to QC H1. STATegra also excluded QC samples measured immediately after a blank, which we will also do here.
# prepare mdoel sequence
MS = filter_smeta(
mode = 'include',
levels='QC',
factor_name = 'sample_type') +
filter_by_name(
mode = 'exclude',
dimension='sample',
names = c('1358BZU_0001QC_H1','1358BZU_0001QC_A1','1358BZU_0001QC_G1')) +
knn_impute(
neighbours=5) +
vec_norm() +
log_transform(
base = 10) +
mean_centre() +
PCA(
number_components = 3)
# apply model sequence
MS = model_apply(MS, DE)
# PCA scores plot
C = pca_scores_plot(factor_name = 'sample_type',label_factor = 'order',points_to_label = 'all')
# plot
chart_plot(C,MS[7])
Now we will plot the QC samples in context with the samples. There are several possible approaches, and we will apply the approach of applying PCA to the full dataset including the QCs. We will exclude the blanks as it is likely that they will dominate the plot if not removed. All samples from batch 12 were excluded from STATegra and we will replicate that here.
# prepare model sequence
MS = filter_smeta(
mode = 'exclude',
levels='Blank',
factor_name = 'sample_type') +
filter_smeta(
mode = 'exclude',
levels='12',
factor_name = 'biol.batch') +
filter_by_name(
mode = 'exclude',
dimension='sample',
names = c('1358BZU_0001QC_H1',
'1358BZU_0001QC_A1',
'1358BZU_0001QC_G1')) +
knn_impute(
neighbours=5) +
vec_norm() +
log_transform(
base = 10) +
mean_centre() +
PCA(
number_components = 3)
# apply model sequence
MS = model_apply(MS, DE)
# PCA scores plots
C = pca_scores_plot(factor_name = 'sample_type')
# plot
chart_plot(C,MS[8])
The QCs appear to representative of the samples, but there are strong clusters in the data, including the QC samples which have no biological variation. There is likely to be a number of ‘low quality’ features that should be excluded, so we will do that now, and use more sophisticated normalisation (PQN) and scaling methods (glog).
MS = filter_smeta(
mode = 'exclude',
levels = '12',
factor_name = 'biol.batch') +
filter_by_name(
mode = 'exclude',
dimension='sample',
names = c('1358BZU_0001QC_H1',
'1358BZU_0001QC_A1',
'1358BZU_0001QC_G1')) +
blank_filter(
fold_change = 20,
qc_label = 'QC',
factor_name = 'sample_type') +
filter_smeta(
mode='exclude',
levels='Blank',
factor_name='sample_type') +
mv_feature_filter(
threshold = 80,
qc_label = 'QC',
factor_name = 'sample_type',
method = 'QC') +
mv_feature_filter(
threshold = 50,
factor_name = 'sample_type',
method='across') +
rsd_filter(
rsd_threshold=20,
qc_label='QC',
factor_name='sample_type') +
mv_sample_filter(
mv_threshold = 50) +
pqn_norm(
qc_label='QC',
factor_name='sample_type') +
knn_impute(
neighbours=5,
by='samples') +
glog_transform(
qc_label = 'QC',
factor_name = 'sample_type') +
mean_centre() +
PCA(
number_components = 10)
# apply model sequence
MS = model_apply(MS, DE)
# PCA plots using different factors
g=list()
for (k in c('order','biol.batch','time.point','condition')) {
C = pca_scores_plot(factor_name = k,ellipse='none')
# plot
g[[k]]=chart_plot(C,MS[length(MS)])
}
plot_grid(plotlist = g,align='vh',axis='tblr',nrow=2,labels=c('A','B','C','D'))
We can see now that the QCs are tightly clustered. This indicates that the biological variance of the remaining high quality features is much greater than the technical variance represented by the QCs.
There does not appear to be a trend by measurement order (A), which is an important indicator that instrument drift throughout the run is not a large source of variation in this dataset.
There does not appear to be strong clustering related to biological batch (B).
There does not appear to be a strong trend with time (C) but this is likely to be a more subtle variation and might be masked by other sources of variance at this stage.
There is some clustering related to condition (D) but with overlap.
To further explore any trends with time, we will split the data by the condition factor and only explore the Ikaros group. Removing the condition factor variation will potentially make it easier to spot any more subtle trends. We will extract the glog transformed matrix from the previous model sequence and continue from there.
# get the glog scaled data
GL = predicted(MS[11])
# extract the Ikaros group and apply PCA
IK = filter_smeta(
mode='include',
factor_name='condition',
levels='Ikaros') +
mean_centre() +
PCA(number_components = 5)
# apply the model sequence to glog transformed data
IK = model_apply(IK,GL)
# plot the PCA scores
C = pca_scores_plot(factor_name='time.point',ellipse = 'sample')
g1=chart_plot(C,IK[3])
g2=g1 + scale_color_viridis_d() # add continuous scale colouring
plot_grid(g1,g2,nrow=2,align='vh',axis = 'tblr',labels=c('A','B'))
Colouring by groups (A) makes the time point trend difficult to see, but by adding a ggplot
continuous colour scale “viridis” (B) the trend with time along PC1 becomes much clearer.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] openxlsx_4.1.5 cowplot_1.0.0 gridExtra_2.3
## [4] ggplot2_3.3.0 BiocFileCache_1.12.0 dbplyr_1.4.3
## [7] ropls_1.20.0 Biobase_2.48.0 BiocGenerics_0.34.0
## [10] pmp_1.0.0 structToolbox_1.0.1 struct_1.0.0
## [13] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 viridisLite_0.3.0
## [3] bit64_0.9-7 foreach_1.5.0
## [5] assertthat_0.2.1 BiocManager_1.30.10
## [7] sp_1.4-2 stats4_4.0.0
## [9] blob_1.2.1 GenomeInfoDbData_1.2.3
## [11] yaml_2.2.1 impute_1.62.0
## [13] pillar_1.4.4 RSQLite_2.2.0
## [15] lattice_0.20-41 glue_1.4.1
## [17] digest_0.6.25 GenomicRanges_1.40.0
## [19] XVector_0.28.0 randomForest_4.6-14
## [21] colorspace_1.4-1 htmltools_0.4.0
## [23] Matrix_1.2-18 plyr_1.8.6
## [25] pkgconfig_2.0.3 magick_2.3
## [27] bookdown_0.19 zlibbioc_1.34.0
## [29] purrr_0.3.4 scales_1.1.1
## [31] itertools_0.1-3 tibble_3.0.1
## [33] farver_2.0.3 IRanges_2.22.2
## [35] ellipsis_0.3.1 withr_2.2.0
## [37] SummarizedExperiment_1.18.1 magrittr_1.5
## [39] crayon_1.3.4 memoise_1.1.0
## [41] evaluate_0.14 class_7.3-17
## [43] ggthemes_4.2.0 tools_4.0.0
## [45] lifecycle_0.2.0 matrixStats_0.56.0
## [47] stringr_1.4.0 S4Vectors_0.26.1
## [49] munsell_0.5.0 zip_2.0.4
## [51] DelayedArray_0.14.0 missForest_1.4
## [53] isoband_0.2.1 e1071_1.7-3
## [55] pls_2.7-2 pcaMethods_1.80.0
## [57] compiler_4.0.0 GenomeInfoDb_1.24.0
## [59] rlang_0.4.6 grid_4.0.0
## [61] RCurl_1.98-1.2 iterators_1.0.12
## [63] rappdirs_0.3.1 labeling_0.3
## [65] bitops_1.0-6 rmarkdown_2.1
## [67] gtable_0.3.0 codetools_0.2-16
## [69] DBI_1.1.0 curl_4.3
## [71] reshape2_1.4.4 R6_2.4.1
## [73] knitr_1.28 dplyr_0.8.5
## [75] bit_1.1-15.2 ontologyIndex_2.5
## [77] stringi_1.4.6 Rcpp_1.0.4.6
## [79] vctrs_0.3.0 tidyselect_1.1.0
## [81] xfun_0.14