Package: MsBackendSql
Authors: Johannes Rainer [aut, cre] (https://orcid.org/0000-0002-6977-7147),
Chong Tang [ctb],
Laurent Gatto [ctb] (https://orcid.org/0000-0002-1520-2268)
Compiled: Tue Oct 24 18:01:53 2023
The Spectra Bioconductor package provides a flexible and
expandable infrastructure for Mass Spectrometry (MS) data. The package supports
interchangeable use of different backends that provide additional file support
or different ways to store and represent MS data. The
MsBackendSql package provides backends to store data from whole
MS experiments in SQL databases. The data in such databases can be easily (and
efficiently) accessed using Spectra objects that use the MsBackendSql class
as an interface to the data in the database. Such Spectra objects have a
minimal memory footprint and hence allow analysis of very large data sets even
on computers with limited hardware capabilities. For certain operations, the
performance of this data representation is superior to that of other low-memory
(on-disk) data representations such as Spectra’s MsBackendMzR backend.
Finally, the MsBackendSql supports also remote data access to e.g. a central
database server hosting several large MS data sets.
The package can be installed with the BiocManager package. To install
BiocManager use install.packages("BiocManager") and, after that,
BiocManager::install("MsBackendSql") to install this package.
MsBackendSql SQL databasesMsBackendSql SQL databases can be created either by importing (raw) MS data
from MS data files using the createMsBackendSqlDatabase or using the
backendInitialize function by providing in addition to the database connection
also the full MS data to import as a DataFrame. In the first example we use
the createMsBackendSqlDatabase function which takes a connection to an (empty)
database and the names of the files from which the data should be imported as
input parameters creates all necessary database tables and stores the full data
into the database. Below we create an empty SQLite database (in a temporary
file) and fill that with MS data from two mzML files (from the r Biocpkg("msdata") package).
library(RSQLite)
dbfile <- tempfile()
con <- dbConnect(SQLite(), dbfile)
library(MsBackendSql)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
createMsBackendSqlDatabase(con, fls)By default the m/z and intensity values are stored as BLOB data types in the database. This has advantages on the performance to extract peaks data from the database but would for example not allow to filter peaks by m/z values directly in the database. As an alternative it is also possible to the individual m/z and intensity values in separate rows of the database table. This long table format results however in considerably larger databases (with potentially poorer performance). Note also that the code and backend is optimized for MySQL/MariaDB databases by taking advantage of table partitioning and specialized table storage options. Any other SQL database server is however also supported (also portable, self-contained SQLite databases).
The MsBackendSql package provides two backends to interact with such
databases: the (default) MsBackendSql class and the MsBackendOfflineSql,
that inherits all properties and functions from the former, but which does not
store the connection to the database within the object but connects (and
disconnects) to (and from) the database in each function call. This allows to
use the latter also for parallel processing setups or to save/load the object
(e.g. using save and saveRDS). Thus, for most applications the
MsBackendOfflineSql might be used as the preferred backend to SQL databases.
To access the data in the database we create below a Spectra object providing
the connection to the database in the constructor call and specifying to use the
MsBackendSql as backend using the source parameter.
sps <- Spectra(con, source = MsBackendSql())
sps## MSn data (Spectra) with 1862 spectra in a MsBackendSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/RtmpIopHGY/file2ea1fb1043c298As an alternative the MsBackendOfflineSql backend could be used instead, which
supports serializing the data to disk and allows, if supported by the SQL
database, also parallel processing. Thus, for most use cases the
MsBackendOfflineSql should be used instead. See further below for more
information on that backend..
Spectra objects allow also to change the backend to any other backend
(extending MsBackend) using the setBackend function. Below we use this
function to first load all data into memory by changing from the MsBackendSql
to a MsBackendMemory.
sps_mem <- setBackend(sps, MsBackendMemory())
sps_mem## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 34 more variables/columns.
## Processing:
##  Switch backend from MsBackendSql to MsBackendMemory [Tue Oct 24 18:02:01 2023]With this function it is also possible to change from any backend to a
MsBackendSql in which case a new database is created and all data from the
originating backend is stored in this database. To change the backend to an
MsBackendOfflineSql we need to provide the connection information to the SQL
database as additional parameters. These are the same as we would need to
connect to the database through a dbConnect call and includes the database
driver to be used (parameter drv) as well as additional parameters such as the
database name and eventually the user name, host etc (see ?dbConnect for more
information). In the simple example below we store the data into a SQLite
database and thus only need to provide the database name, which corresponds
SQLite database file. In our example we store the data into a temporary
file. Importantly, we also need to disable parallel processing by specifying
BPPARAM = SerialParam() since (most) SQL databases don’t provide parallel data
insertion.
sps2 <- setBackend(sps_mem, MsBackendOfflineSql(), drv = SQLite(),
                   dbname = tempfile())## Warning in .create_from_spectra_data(dbcon, data = data, ...): Replacing
## original column "spectrum_id_"sps2## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/RtmpIopHGY/file2ea1fb317cdb06
## Processing:
##  Switch backend from MsBackendSql to MsBackendMemory [Tue Oct 24 18:02:01 2023]
##  Switch backend from MsBackendMemory to MsBackendOfflineSql [Tue Oct 24 18:02:02 2023]Similar to any other Spectra object we can retrieve the available spectra
variables using the spectraVariables function.
spectraVariables(sps)##  [1] "msLevel"                  "rtime"                   
##  [3] "acquisitionNum"           "scanIndex"               
##  [5] "dataStorage"              "dataOrigin"              
##  [7] "centroided"               "smoothed"                
##  [9] "polarity"                 "precScanNum"             
## [11] "precursorMz"              "precursorIntensity"      
## [13] "precursorCharge"          "collisionEnergy"         
## [15] "isolationWindowLowerMz"   "isolationWindowTargetMz" 
## [17] "isolationWindowUpperMz"   "peaksCount"              
## [19] "totIonCurrent"            "basePeakMZ"              
## [21] "basePeakIntensity"        "ionisationEnergy"        
## [23] "lowMZ"                    "highMZ"                  
## [25] "mergedScan"               "mergedResultScanNum"     
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"  
## [29] "injectionTime"            "filterString"            
## [31] "spectrumId"               "ionMobilityDriftTime"    
## [33] "scanWindowLowerLimit"     "scanWindowUpperLimit"    
## [35] "spectrum_id_"The MS peak data can be accessed using either the mz, intensity or
peaksData functions. Below we extract the peaks matrix of the 5th spectrum and
display the first 6 rows.
peaksData(sps)[[5]] |>
head()##            mz intensity
## [1,] 105.0347         0
## [2,] 105.0362       164
## [3,] 105.0376         0
## [4,] 105.0391         0
## [5,] 105.0405       328
## [6,] 105.0420         0All data (peaks data or spectra variables) are always retrieved on the fly
from the database resulting thus in a minimal memory footprint for the Spectra
object.
print(object.size(sps), units = "KB")## 91.4 KbThe backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum.
sps$rtime <- sps$rtime + 10Such operations do however not change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable.
print(object.size(sps), units = "KB")## 106 KbSuch $<- operations can also be used to cache spectra variables
(temporarily) in memory which can eventually improve performance. Below we test
the time it takes to extract the MS level from each spectrum from the database,
then cache the MS levels in memory using $msLevel <- and test the timing to
extract these cached variable.
system.time(msLevel(sps))##    user  system elapsed 
##   0.013   0.000   0.013sps$msLevel <- msLevel(sps)
system.time(msLevel(sps))##    user  system elapsed 
##   0.003   0.000   0.004We can also use the reset function to reset the data to its original state
(this will cause any local spectra variables to be deleted and the backend to be
initialized with the original data in the database).
sps <- reset(sps)To use the MsBackendOfflineSql backend we need to provide all information
required to connect to the database along with the database driver to the
Spectra function. Which parameters are required to connect to the database
depends on the SQL database and the used driver. In our example the data is
stored in a SQLite database, hence we use the SQLite() database driver and
only need to provide the database name with the dbname parameter. For a
MySQL/MariaDB database we would use the MariaDB() driver and would have to
provide the database name, user name, password as well as the host name and port
through which the database is accessible.
sps_off <- Spectra(dbfile, drv = SQLite(),
                   source = MsBackendOfflineSql())
sps_off## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/RtmpIopHGY/file2ea1fb1043c298This backend provides the exact same functionality than MsBackendSql with the
difference that the connection to the database is opened and closed for each
function call. While this leads to a slightly lower performance, it allows to to
serialize the object (i.e. save/load the object to/from disk) and to use it (and
hence the Spectra object) also in a parallel processing setup. In contrast,
for the MsBackendSql parallel processing is disabled since it is not possible
to share the active backend connection within the object across different
parallel processes.
Below we compare the performance of the two backends. The performance difference is the result from opening and closing the database connection for each call. Note that this will also depend on the SQL server that is being used. For SQLite databases there is almost no overhead.
library(microbenchmark)
microbenchmark(msLevel(sps), msLevel(sps_off))## Unit: milliseconds
##              expr       min       lq     mean   median       uq      max neval
##      msLevel(sps)  9.995385 10.72235 11.37150 10.99907 11.34193 24.72228   100
##  msLevel(sps_off) 12.372394 12.87954 13.64058 13.56651 14.04818 18.59363   100
##  cld
##   a 
##    bThe need to retrieve any spectra data on-the-fly from the database will have an
impact on the performance of data access function of Spectra objects using the
MsBackendSql backends. To evaluate its impact we next compare the performance
of the MsBackendSql to other Spectra backends, specifically, the
MsBackendMzR which is the default backend to read and represent raw MS data,
and the MsBackendMemory backend that keeps all MS data in memory (and is thus
not suggested for larger MS experiments). Similar to the MsBackendMzR, also
the MsBackendSql keeps only a limited amount of data in memory. These
on-disk backends need thus to retrieve spectra and MS peaks data on-the-fly
from either the original raw data files (in the case of the MsBackendMzR) or
from the SQL database (in the case of the MsBackendSql). The in-memory backend
MsBackendMemory is supposed to provide the fastest data access since all data
is kept in memory.
Below we thus create Spectra objects from the same data but using the
different backends.
sps <- Spectra(con, source = MsBackendSql())
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_im <- setBackend(sps_mzr, backend = MsBackendMemory())At first we compare the memory footprint of the 3 backends.
print(object.size(sps), units = "KB")## 91.4 Kbprint(object.size(sps_mzr), units = "KB")## 386.5 Kbprint(object.size(sps_im), units = "KB")## 54494.3 KbThe MsBackendSql has the lowest memory footprint of all 3 backends because it
does not keep any data in memory. The MsBackendMzR keeps all spectra
variables, except the MS peaks data, in memory and has thus a larger size. The
MsBackendMemory keeps all data (including the MS peaks data) in memory and has
thus the largest size in memory.
Next we compare the performance to extract the MS level for each spectrum from
the 4 different Spectra objects.
library(microbenchmark)
microbenchmark(msLevel(sps),
               msLevel(sps_mzr),
               msLevel(sps_im))## Unit: microseconds
##              expr       min         lq        mean     median         uq
##      msLevel(sps) 10648.207 11623.8505 12065.04997 11926.5265 12194.3610
##  msLevel(sps_mzr)   631.325   733.4795   764.11065   749.8215   806.2045
##   msLevel(sps_im)    17.918    27.4800    42.31061    44.3885    54.8675
##        max neval cld
##  21803.986   100 a  
##   1040.090   100  b 
##     72.677   100   cExtracting MS levels is thus slowest for the MsBackendSql, which is not
surprising because both other backends keep this data in memory while the
MsBackendSql needs to retrieve it from the database.
We next compare the performance to access the full peaks data from each
Spectra object.
microbenchmark(peaksData(sps, BPPARAM = SerialParam()),
               peaksData(sps_mzr, BPPARAM = SerialParam()),
               peaksData(sps_im, BPPARAM = SerialParam()), times = 10)## Unit: milliseconds
##                                         expr        min         lq       mean
##      peaksData(sps, BPPARAM = SerialParam()) 142.335975 169.889121 246.768938
##  peaksData(sps_mzr, BPPARAM = SerialParam()) 765.123873 782.944718 844.694926
##   peaksData(sps_im, BPPARAM = SerialParam())   3.210359   3.314762   6.206017
##     median         uq        max neval cld
##  177.32765 193.921547  580.26182    10 a  
##  801.98875 817.758543 1271.08248    10  b 
##    4.06397   4.684763   25.70204    10   cAs expected, the MsBackendMemory has the fasted access to the full peaks
data. The MsBackendSql outperforms however the MsBackendMzR providing faster
access to the m/z and intensity values.
Performance can be improved for the MsBackendMzR using parallel
processing. Note that the MsBackendSql does not support parallel
processing and thus parallel processing is (silently) disabled in functions such
as peaksData.
m2 <- MulticoreParam(2)
microbenchmark(peaksData(sps, BPPARAM = m2),
               peaksData(sps_mzr, BPPARAM = m2),
               peaksData(sps_im, BPPARAM = m2), times = 10)## Unit: microseconds
##                              expr        min         lq       mean     median
##      peaksData(sps, BPPARAM = m2) 165463.864 176699.449 273255.760 204437.353
##  peaksData(sps_mzr, BPPARAM = m2) 687955.252 694908.432 867258.072 742413.981
##   peaksData(sps_im, BPPARAM = m2)    713.153   1032.549   1955.433   2030.657
##          uq         max neval cld
##  222395.037  920561.481    10  a 
##  799350.172 2017662.995    10   b
##    2797.983    3166.455    10  aWe next compare the performance of subsetting operations.
microbenchmark(filterRt(sps, rt = c(50, 100)),
               filterRt(sps_mzr, rt = c(50, 100)),
               filterRt(sps_im, rt = c(50, 100)))## Unit: microseconds
##                                expr      min        lq      mean   median
##      filterRt(sps, rt = c(50, 100)) 4484.013 4757.9315 5124.9582 4964.732
##  filterRt(sps_mzr, rt = c(50, 100)) 3376.626 3688.7035 4069.2083 3906.391
##   filterRt(sps_im, rt = c(50, 100))  717.156  854.7545  928.6941  898.433
##         uq      max neval cld
##  5255.8780 10559.58   100 a  
##  4208.2605 13918.52   100  b 
##   960.3355  2596.38   100   cThe two on-disk backends MsBackendSql and MsBackendMzR show a comparable
performance for this operation. This filtering does involves access to a spectra
variables (the retention time in this case) which, for the MsBackendSql needs
first to be retrieved from the backend. The MsBackendSql backend allows
however also to cache spectra variables (i.e. they are stored within the
MsBackendSql object). Any access to such cached spectra variables can
eventually be faster because no dedicated SQL query is needed.
To evaluate the performance of a pure subsetting operation we first define the
indices of 10 random spectra and subset the Spectra objects to these.
idx <- sample(seq_along(sps), 10)
microbenchmark(sps[idx],
               sps_mzr[idx],
               sps_im[idx])## Unit: microseconds
##          expr      min        lq      mean   median        uq      max neval
##      sps[idx]  192.819  213.1925  272.2885  252.893  269.8155 3188.434   100
##  sps_mzr[idx] 1055.779 1085.1430 1105.9115 1102.365 1115.3440 1423.962   100
##   sps_im[idx]  308.906  329.1040  378.8860  343.697  376.6340 2784.214   100
##  cld
##  a  
##   b 
##    cHere the MsBackendSql outperforms the other backends because it does not keep
any data in memory and hence does not need to subset these. The two other
backends need to subset the data they keep in memory which is in both cases a
data frame with either a reduced set of spectra variables or the full MS data.
At last we compare also the extraction of the peaks data from the such subset
Spectra objects.
sps_10 <- sps[idx]
sps_mzr_10 <- sps_mzr[idx]
sps_im_10 <- sps_im[idx]
microbenchmark(peaksData(sps_10),
               peaksData(sps_mzr_10),
               peaksData(sps_im_10),
               times = 10)## Unit: microseconds
##                   expr       min        lq       mean     median        uq
##      peaksData(sps_10)  4851.784  6215.287  7482.5088  7347.1270  8425.152
##  peaksData(sps_mzr_10) 72976.977 75492.207 80816.9275 76674.7765 86066.039
##   peaksData(sps_im_10)   594.851   743.788   966.1253   999.0015  1089.484
##        max neval cld
##  11175.727    10 a  
##  97702.096    10  b 
##   1474.528    10   cThe MsBackendSql outperforms the MsBackendMzR while, not unexpectedly, the
MsBackendMemory provides fasted access.
MsBackendSqlThe MsBackendSql backend does not support parallel processing since the
database connection can not be shared across the different (parallel)
processes. Thus, all methods on Spectra objects that use a MsBackendSql will
automatically (and silently) disable parallel processing even if a dedicated
parallel processing setup was passed along with the BPPARAM method.
Some functions on Spectra objects require to load the MS peak data (i.e., m/z
and intensity values) into memory. For very large data sets (or computers with
limited hardware resources) such function calls can cause out-of-memory
errors. One example is the lengths function that determines the number of
peaks per spectrum by loading the peak matrix first into memory. Such functions
should ideally be called using the peaksapply function with parameter
chunkSize (e.g., peaksapply(sps, lengths, chunkSize = 5000L)). Instead of
processing the full data set, the data will be first split into chunks of size
chunkSize that are stepwise processed. Hence, only data from chunkSize
spectra is loaded into memory in one iteration.
The MsBackendSql provides an MS data representations and storage mode with a
minimal memory footprint (in R) that is still comparably efficient for standard
processing and subsetting operations. This backend is specifically useful for
very large MS data sets, that could even be hosted on remote (MySQL/MariaDB)
servers. A potential use case for this backend could thus be to set up a central
storage place for MS experiments with data analysts connecting remotely to this
server to perform initial data exploration and filtering. After subsetting to a
smaller data set of interest, users could then retrieve/download this data by
changing the backend to e.g. a MsBackendMemory, which would result in a
download of the full data to the user computer’s memory.
sessionInfo()## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] microbenchmark_1.4.10 RSQLite_2.3.1         MsBackendSql_1.2.0   
## [4] Spectra_1.12.0        ProtGenerics_1.34.0   BiocParallel_1.36.0  
## [7] S4Vectors_0.40.0      BiocGenerics_0.48.0   BiocStyle_2.30.0     
## 
## loaded via a namespace (and not attached):
##  [1] sandwich_3.0-2         sass_0.4.7             MsCoreUtils_1.14.0    
##  [4] lattice_0.22-5         hms_1.1.3              digest_0.6.33         
##  [7] grid_4.3.1             evaluate_0.22          bookdown_0.36         
## [10] mvtnorm_1.2-3          fastmap_1.1.1          blob_1.2.4            
## [13] Matrix_1.6-1.1         jsonlite_1.8.7         progress_1.2.2        
## [16] mzR_2.36.0             DBI_1.1.3              survival_3.5-7        
## [19] multcomp_1.4-25        BiocManager_1.30.22    TH.data_1.1-2         
## [22] codetools_0.2-19       jquerylib_0.1.4        cli_3.6.1             
## [25] rlang_1.1.1            crayon_1.5.2           Biobase_2.62.0        
## [28] splines_4.3.1          bit64_4.0.5            cachem_1.0.8          
## [31] yaml_2.3.7             tools_4.3.1            parallel_4.3.1        
## [34] memoise_2.0.1          ncdf4_1.21             vctrs_0.6.4           
## [37] R6_2.5.1               zoo_1.8-12             lifecycle_1.0.3       
## [40] fs_1.6.3               IRanges_2.36.0         bit_4.0.5             
## [43] clue_0.3-65            MASS_7.3-60            cluster_2.1.4         
## [46] pkgconfig_2.0.3        bslib_0.5.1            data.table_1.14.8     
## [49] Rcpp_1.0.11            xfun_0.40              knitr_1.44            
## [52] htmltools_0.5.6.1      rmarkdown_2.25         compiler_4.3.1        
## [55] prettyunits_1.2.0      MetaboCoreUtils_1.10.0