1 Introduction

MSPrep provides a convenient set of functionalities used in the pre-analytic processing pipeline for mass spectrometry based metabolomics data. Functions are included for the following processes commonly performed prior to analysis of such data:

  1. Summarization of technical replicates (if available)
  2. Filtering of metabolites
  3. Imputation of missing values
  4. Transformation, normalization, and batch correction

The sections which follow provide an explanation of each function contained in MSPrep, those functions’ respective options, and examples of the pre-analysis pipeline using two data sets provided in the MSPrep package, MSQuant and COPD_131.

For further information, please see MSPrep—Summarization, normalization and diagnostics for processing of mass spectrometry–based metabolomic data (Hughes et al., 2014) and Pre-analytic Considerations for Mass Spectrometry-Bas ed Untargeted Metabolomics Data (Reinhold et al., 2019).

## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'

2 Expected Data Format

Data my be input as a Data Frame or SummarizedExperiment.

2.1 Data Frame

When using the functions provided by MSPrep on a data frame, the following format is expected throughout the pipeline.

Most often, two or more columns of the data frame will identify unique compounds. This may include columns which specify the mass-to-charge ratio, the retention time, or the name of each compound. Using the parameter compVars, the names of these columns should be provided to each function as a vector of character strings.

The remainder of the columns in the data frame should specify the respective abundances of each compound for each sample. It is expected that one or more identifying variables for each sample will be specified by the column name (e.g. Sample ID, batch number, or replicate). Each piece of information contained in the column names must be separated by a consistent character not present anywhere else in the column name. Using the parameter sampleVars, the sample variables present in the column names should be provided to each function as a vector of character strings specifying the order the variables appear, and the parameter separator should identify the character which separates each sample variable. Each column name may also include consistent non-identifying text at the beginning of each column name. This text should be provided to each function using the colExtraText parameter.

As an example see the provided data set msquant and its use in the pipeline below.

2.2 SummarizedExperiment

When using the functions provided by MSPrep on a SummarizedExperiment, it is expected that the data will include a single assay of abundances, rowData identifying characteristics of each metabolite, and colData specifying characteristics of each sample. The parameters discussed in the previous section may be ignored.

3 Example One - Technical Data Set

data(msquant)
colnames(msquant)[3]
## [1] "Neutral_Operator_Dif_Pos_1x_O1_A_01"

Above is the third column name in msquant. The first part “Neutral_Operator_Dif_Pos_” will not be used in this analysis, so we will assign it to colExtraText parameter. The next value, “1x”, is the spike-in value. The following value, “O1”, specifies the sample’s batch. The remaining values, “A” and “01”, specify the replicate and subject IDs. We will pass these sample variables to each function with the sampleVars parameter. Finally, note that msquant contains two columns, mz and rt which identify each compounds’ mass-to-charge ratio and retention time, respectively. We will pass these column names to each function using the compVars parameter.

With our data in this format, we can start the pipeline.

3.1 Summarizing

This step summarizes the technical replicates using the following procedure for each compound in each batch.

  1. If there are less than a minimum proportion of the values found among the replicates (usually one or zero), leave the value empty. Otherwise proceed.
  2. Calculate the coefficient of variation between the replicates using \(c_v = \frac{\sigma}{\mu}\), where \(\mu\) is the mean and \(\sigma\) is the standard deviation.
  3. For three replicates, if the coefficient of variation is above a specified level, use the median value for the compound, to correct for the large dispersion.
  4. Otherwise, use the mean value of the replicates for the compound.

If the name of variable specifying replicate in sampleVars for a data frame or the column data of a SummarizedExperiment, specify the name of the variable using the replicate parameter.

The technical replicates in MSQuant are summarized below using a CV maximum of .50 and a minimum proportion present of 1 out of 3 replicates. Note that in the MSQuant dataset, missing values are represented as ‘1’, which is specified in the missingValue argument below. msSummarize() will replace these missing values with ‘0’ in all instances where the summarization algorithm determines the values to be truly missing.

summarizedDF <- msSummarize(msquant,
                            cvMax = 0.50,
                            minPropPresent = 1/3,
                            compVars = c("mz", "rt"),
                            sampleVars = c("spike", "batch", "replicate", 
                                           "subject_id"),
                            colExtraText = "Neutral_Operator_Dif_Pos_",
                            separator = "_",
                            missingValue = 1)
## # A tibble: 10 × 6
##       mz rt        `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
##    <dbl> <chr>          <dbl>      <dbl>      <dbl>      <dbl>
##  1  78.0 21.362432    120937.    121018.    118425.    118527.
##  2  80.0 9.772964      63334.     63415.     69530      69631.
##  3  80.1 0.6517916     78601      78668.    154636     154737.
##  4  83.1 1.3226668     58473.     58554.    298703.    298804.
##  5  84.1 7.864             0          0          0          0 
##  6  85.1 22.307388    348686     348753.    342413     342420.
##  7  85.1 0.7104762         0          0          0          0 
##  8  85.1 1.3228333    335092.    335172.   1753681    1753782.
##  9  86.0 22.587963    226792.    226872.    240137     240238.
## 10  87.0 1.702             0          0     674771.    674872.

3.2 Filtering

Following the summarization of technical replicates, the data can be filtered to only contain compounds present in a specified proportion of samples. To do so, the msFilter() function is provided. By default, msFilter() uses the 80% rule and filters the compounds in the data set leaving only those which are present in 80% of the samples.

filteredDF <- msFilter(summarizedDF,
                       filterPercent = 0.8,
                       compVars = c("mz", "rt"),
                       sampleVars = c("spike", "batch", "subject_id"),
                       separator = "_")
## # A tibble: 10 × 6
##       mz rt        `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
##    <dbl> <chr>          <dbl>      <dbl>      <dbl>      <dbl>
##  1  78.0 21.362432    120937.    121018.    118425.    118527.
##  2  80.0 9.772964      63334.     63415.     69530      69631.
##  3  80.1 0.6517916     78601      78668.    154636     154737.
##  4  86.0 22.587963    226792.    226872.    240137     240238.
##  5  90.0 3.0758798         0          0     148358.    148460.
##  6  99.1 22.379221   6216101    6216182.   6137392    6137493.
##  7  99.9 21.431955    117396     117477.    110735     110836.
##  8 102.  3.076125      92308      92389.         0          0 
##  9 104.  9.770778      78100      78181.     61308      61409.
## 10 107.  22.5569      134960     135039.     83090.     83281.

3.3 Imputation

Next, depending on the downstream analysis, you may need to impute missing data. Three imputation methods are provided:

  1. half-min (half the minimum value)
  2. bpca (Bayesian PCA),
  3. knn (k-nearest neighbors)

Half-min imputes each missing value as one half of the minimum observed value for that compound. Half-min imputation performs faster than other methods, but may introduce bias. The BPCA algorithm, provided by the pcaMethods package, estimates the missing value by a linear combination of principle axis vectors, with the number of principle components specified by the user with the nPcs argument. KNN uses a K-Nearest Neighbors algorithm provided by the VIM package. Users may provide their preferred value of k using the kKnn argument. By default, KNN uses samples as neighbors, but by specifying compoundsAsNeighbors = TRUE, compounds will be used as neighbors instead. Note that this is significantly slower than using samples as neighbors and may take several minutes or more to run depending on the size of your data set.

imputedDF <- msImpute(filteredDF,
                      imputeMethod = "knn",
                      compVars = c("mz", "rt"),
                      sampleVars = c("spike", "batch", "subject_id"),
                      separator = "_",
                      returnToSE = FALSE,
                      missingValue = 0)
## # A tibble: 10 × 6
##       mz rt        `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
##    <dbl> <chr>          <dbl>      <dbl>      <dbl>      <dbl>
##  1  78.0 21.362432    120937.    121018.    118425.    118527.
##  2  80.0 9.772964      63334.     63415.     69530      69631.
##  3  80.1 0.6517916     78601      78668.    154636     154737.
##  4  86.0 22.587963    226792.    226872.    240137     240238.
##  5  90.0 3.0758798    121246     116549.    148358.    148460.
##  6  99.1 22.379221   6216101    6216182.   6137392    6137493.
##  7  99.9 21.431955    117396     117477.    110735     110836.
##  8 102.  3.076125      92308      92389.   1215434.    908742.
##  9 104.  9.770778      78100      78181.     61308      61409.
## 10 107.  22.5569      134960     135039.     83090.     83281.

3.4 Normalization

In order to make comparisons between samples, the data may need to be transformed and normalized This step transforms the data and performs one of eight normalization strategies:

  1. Median
  2. ComBat
  3. Quantile
  4. Quantile + ComBat
  5. Median + ComBat
  6. CRMN
  7. RUV
  8. SVA

msNormalize() also provides options to transform the data using a log base 10 (default), log base 2, or natural log transformation. To select either option, or to forego transformation, use the transform argument to specify "log10", "log2", "ln", or "none" respectively.

Quantile normalization, provided by the preprocessCore package, ensures that the provided samples have the same quantiles. Median normalization subtracts the median abundance of each sample from every compound in that sample, thereby aligning the median abundance of each sample at 0.

ComBat, provided by the sva package, is an empirical Bayes batch effect correction algorithm to remove unwanted batch effects and may be used separately or in conjunction with quantile or median normalization. When using ComBat, a sampleVar called “batch” must be present for data frames, or for SummarizedExperiment “batch” must be present in the columns names of colData. Or, if the sample variable corresponding to batch differs from “batch”, you may specify the batch variables name using the batch parameter.

RUV and SVA normalization each estimate a matrix of unobserved factors of importance using different methods of supervised factor analysis. For both methods, known covariates (e.g. sex, age) should be provided using the covariatesOfInterest parameter, and must correspond to the sample variables specified by sampleVars in the case of a data frame or colData in the case of a SummarizedExperiment. For RUV normalization, the kRUV argument specifies the number of factors on which the data is normalized.

Cross-Contribution Compensating Multiple Standard Normalization (CRMN), provided by the crmn package, normalizes based on internal standards. The sample variable identifying internal standards must be provided using the covariatesOfInterest parameter. For experiments which have control compounds, a vector of the row numbers containing them should be provided in the controls variable. If a vector of control compounds is not provided, data driven controls will be generated.

Below, we apply a log base 10 transformation, quantile normalization, and ComBat batch correction.

normalizedDF <- msNormalize(imputedDF,
                            normalizeMethod = "quantile + ComBat",
                            transform = "log10",
                            compVars = c("mz", "rt"),
                            sampleVars = c("spike", "batch", "subject_id"),
                            covariatesOfInterest = c("spike"),
                            separator = "_")
## # A tibble: 10 × 6
##       mz rt        `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
##    <dbl> <chr>          <dbl>      <dbl>      <dbl>      <dbl>
##  1  78.0 21.362432       5.07       5.08       5.03       5.04
##  2  80.0 9.772964        4.80       4.80       4.82       4.83
##  3  80.1 0.6517916       4.97       4.97       5.08       5.09
##  4  86.0 22.587963       5.39       5.39       5.29       5.31
##  5  90.0 3.0758798       5.12       5.10       5.11       5.12
##  6  99.1 22.379221       6.82       6.82       6.82       6.82
##  7  99.9 21.431955       5.06       5.06       5.01       5.03
##  8 102.  3.076125        5.05       5.05       5.68       5.63
##  9 104.  9.770778        4.90       4.90       4.82       4.83
## 10 107.  22.5569         5.09       5.09       4.95       4.96

3.5 Pipeline

Often, all the above steps will need to be conducted. This can be done in a single statement using the msPrepare() function. Simply provide the function the same arguments that you would provide to the individual functions.

preparedDF <- msPrepare(msquant,
                        minPropPresent = 1/3,
                        missingValue = 1,
                        filterPercent = 0.8,
                        imputeMethod = "knn",
                        normalizeMethod = "quantile + ComBat",
                        transform = "log10",
                        covariatesOfInterest = c("spike"),
                        compVars = c("mz", "rt"),
                        sampleVars = c("spike", "batch", "replicate", 
                                       "subject_id"),
                        colExtraText = "Neutral_Operator_Dif_Pos_",
                        separator = "_")
## # A tibble: 10 × 6
##       mz rt        `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
##    <dbl> <chr>          <dbl>      <dbl>      <dbl>      <dbl>
##  1  78.0 21.362432       5.07       5.08       5.03       5.04
##  2  80.0 9.772964        4.80       4.80       4.82       4.83
##  3  80.1 0.6517916       4.97       4.97       5.08       5.09
##  4  86.0 22.587963       5.39       5.39       5.29       5.31
##  5  90.0 3.0758798       5.12       5.10       5.11       5.12
##  6  99.1 22.379221       6.82       6.82       6.82       6.82
##  7  99.9 21.431955       5.06       5.06       5.01       5.03
##  8 102.  3.076125        5.05       5.05       5.68       5.63
##  9 104.  9.770778        4.90       4.90       4.82       4.83
## 10 107.  22.5569         5.09       5.09       4.95       4.96

4 Example Two - Biological Data Set

Next, the functionality of MSPrep will be demonstrated using the included data COPD_131. The raw data set can be found here, at Metabolomics Workbench. Note that only a portion of the compounds in the original COPD_131 data set are included in this package in order to limit file size and example run time. Generally, the number of compounds in a data set will greatly exceed the number of samples, and the functions included in this package will take more time to process the data.

This data set differs from msquant in several ways. First, it has a column Compound.Name which specifies compound names, and the mass-to-charge ratio and retention-time columns are named Mass and Retention.Time respectively. Second, this data set does not have spike-ins or batches (but it does have technical replicates). Finally, the data has already been transformed, so that step of the pipeline will be excluded.

4.1 Summarizing

Next, the technical replicates in COPD_131 need to be summarized. This process is largely the same as before, but with different column names passed to compVars, so msSummarize() is called as follows:

data(COPD_131)

summarizedSE131 <- msSummarize(COPD_131,
                               cvMax = 0.5,
                               minPropPresent = 1/3,
                               replicate = "replicate",
                               compVars = c("Mass", "Retention.Time", 
                                            "Compound.Name"),
                               sampleVars = c("subject_id", "replicate"),
                               colExtraText = "X",
                               separator = "_",
                               returnToSE = TRUE)

4.2 Filtering

Again, this process is largely the same as before, choosing a filter percentage of 0.8. So, we call msFilter() as follows:

filteredSE131 <- msFilter(summarizedSE131,
                          filterPercent = 0.8)

4.3 Imputing

For this example, msImpute() will be called using Bayesian PCA using three principle components to impute the missing values for the data.

imputedSE131 <- msImpute(filteredSE131,
                         imputeMethod = "bpca",
                         nPcs = 3,
                         missingValue = 0)

4.4 Normalizing

For this example, msNormalize() will be called using median normalization with no transformation applied.

normalizedSe131 <- msNormalize(imputedSE131,
                               normalizeMethod = "median",
                               transform = "none")

4.5 Pipeline

As with the previous example, the above steps can be performed in a pipeline using the msPrepare() function. To skip the transformation step of the pipeline, set the transform parameter to “none” (note that the same can be done for imputationMethod and normalizationMethod).

preparedSE <- msPrepare(COPD_131,
                        cvMax = 0.5,
                        minPropPresent = 1/3,
                        compVars = c("Mass", "Retention.Time", 
                                     "Compound.Name"),
                        sampleVars = c("subject_id", "replicate"),
                        colExtraText = "X",
                        separator = "_",
                        filterPercent = 0.8,
                        imputeMethod = "bpca",
                        normalizeMethod = "median",
                        transform = "none",
                        nPcs = 3,
                        missingValue = 0,
                        returnToSE = TRUE)
## Summarizing
## Filtering
## Imputing
## Normalizing

5 Session Info

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MSPrep_1.10.0    BiocStyle_2.28.0
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.1.3                   bitops_1.0-7               
##   [3] rlang_1.1.0                 magrittr_2.0.3             
##   [5] matrixStats_0.63.0          e1071_1.7-13               
##   [7] compiler_4.3.0              RSQLite_2.3.1              
##   [9] mgcv_1.8-42                 png_0.1-8                  
##  [11] vctrs_0.6.2                 sva_3.48.0                 
##  [13] stringr_1.5.0               crayon_1.5.2               
##  [15] pkgconfig_2.0.3             fastmap_1.1.1              
##  [17] XVector_0.40.0              utf8_1.2.3                 
##  [19] rmarkdown_2.21              preprocessCore_1.62.0      
##  [21] itertools_0.1-3             purrr_1.0.1                
##  [23] bit_4.0.5                   xfun_0.39                  
##  [25] randomForest_4.7-1.1        zlibbioc_1.46.0            
##  [27] cachem_1.0.7                GenomeInfoDb_1.36.0        
##  [29] jsonlite_1.8.4              blob_1.2.4                 
##  [31] DelayedArray_0.26.0         BiocParallel_1.34.0        
##  [33] parallel_4.3.0              R6_2.5.1                   
##  [35] bslib_0.4.2                 stringi_1.7.12             
##  [37] vcd_1.4-11                  limma_3.56.0               
##  [39] ranger_0.15.1               genefilter_1.82.0          
##  [41] car_3.1-2                   boot_1.3-28.1              
##  [43] lmtest_0.9-40               GenomicRanges_1.52.0       
##  [45] jquerylib_0.1.4             Rcpp_1.0.10                
##  [47] bookdown_0.33               SummarizedExperiment_1.30.0
##  [49] iterators_1.0.14            knitr_1.42                 
##  [51] zoo_1.8-12                  IRanges_2.34.0             
##  [53] splines_4.3.0               Matrix_1.5-4               
##  [55] nnet_7.3-18                 tidyselect_1.2.0           
##  [57] abind_1.4-5                 yaml_2.3.7                 
##  [59] codetools_0.2-19            doRNG_1.8.6                
##  [61] lattice_0.21-8              tibble_3.2.1               
##  [63] withr_2.5.0                 Biobase_2.60.0             
##  [65] KEGGREST_1.40.0             evaluate_0.20              
##  [67] survival_3.5-5              proxy_0.4-27               
##  [69] Biostrings_2.68.0           pillar_1.9.0               
##  [71] BiocManager_1.30.20         MatrixGenerics_1.12.0      
##  [73] carData_3.0-5               rngtools_1.5.2             
##  [75] VIM_6.2.2                   foreach_1.5.2              
##  [77] stats4_4.3.0                generics_0.1.3             
##  [79] sp_1.6-0                    RCurl_1.98-1.12            
##  [81] S4Vectors_0.38.0            laeken_0.5.2               
##  [83] xtable_1.8-4                class_7.3-21               
##  [85] glue_1.6.2                  tools_4.3.0                
##  [87] robustbase_0.95-1           data.table_1.14.8          
##  [89] locfit_1.5-9.7              annotate_1.78.0            
##  [91] XML_3.99-0.14               grid_4.3.0                 
##  [93] tidyr_1.3.0                 missForest_1.5             
##  [95] edgeR_3.42.0                AnnotationDbi_1.62.0       
##  [97] colorspace_2.1-0            nlme_3.1-162               
##  [99] GenomeInfoDbData_1.2.10     crmn_0.0.21                
## [101] cli_3.6.1                   fansi_1.0.4                
## [103] dplyr_1.1.2                 pcaMethods_1.92.0          
## [105] DEoptimR_1.0-12             sass_0.4.5                 
## [107] digest_0.6.31               BiocGenerics_0.46.0        
## [109] memoise_2.0.1               htmltools_0.5.5            
## [111] lifecycle_1.0.3             httr_1.4.5                 
## [113] bit64_4.0.5                 MASS_7.3-59