The sfi package provides features extraction/alignment
tools for single file injections(sfi) mode Gas/liquid
chromatography-mass spectrometry (GC/LC-MS) data. The SFI technique
enables the analysis of numerous samples (e.g., up to 1000 per day) by
acquiring data from multiple injections within a single analytical run,
significantly reducing the time compared to traditional serial LC-MS
injections.
The input of sfi package could be a single mzML file,
which contains the interleaved data from all injected samples. Feature
table from mzML file by other feature extraction tools such as
xcms can also be used as input and the output will be a
feature table compatible with regular metabolomics/non-target analysis.
Such a feature table can be used for downstream analysis, such as
statistical analysis and machine learning. The core function of
sfi package is designed to handle sfi feature table and
recover peaks back to individual samples. Currently, no bioconductor
package is available to analysis such high throughput data.
We also provide a simple local maxim peak picking algorithm for sfi
raw data analysis as regular feature extraction tools such as
xcms package were designed for single injection files and
need to perform feature alignment across multiple files. This is not
suitable for sfi data, where all samples are injected in a single file.
However, xcms package could still serve as a feature
extraction tool for sfi data, but it is still need sfi
package to handle the feature table and recover peaks back to individual
samples. We depend mzR package to read single mzML file and
data.table package to handle large data.
The SFI method (see demo figure) operates by:
Injecting a pooled sample multiple times initially to establish reference chromatographic profiles.
Subsequently injecting individual samples repeatedly at fixed, short time intervals under a continuous isocratic elution. Each sample undergoes a fixed chromatographic separation.
This process generates a single, continuous data file containing the interleaved data from all injected samples.
knitr::include_graphics("https://github.com/yufree/presentation/blob/gh-pages/figure/SFI.png?raw=true")
Schematic representation of the Single File Injection (SFI) workflow.
(Top) A pooled Quality Control (QC) sample is injected multiple times at
the beginning of the run to establish a robust reference for peak
alignment. Individual study samples are then injected sequentially at
short, fixed time intervals (e.g., every 1 min) into a continuous
isocratic flow. This results in a single, long mzML file containing
interleaved chromatograms from all samples, which sfi
separates and aligns. (Bottom) Demonstration of peak recovery algorithms
for SFI data using fixed time intervals
You need to convert the raw data into mzML file. You might try ThermoFlask for Thermo data or ProteoWizard for other vendor file.
The find_2d_peaks() function will perform peak picking
on the mzML file. The input is the mz, rt and intensity of the peaks.
The output is a data frame containing the peak information.
The getsfm() function will extract sample features from
one injections file.
# injection interval
idelta <- 92
# time windows for a full separation
windows <- 632
# sample numbers in the files
n <- 100
# retention time shift in seconds
deltart <- 10
# min peak number in pooled qc samples
minn <- 6
# align peaks from sfi. Note: peaklist is an sfi_peaks object, so we can pass it directly.
se <- getsfm(peaklist, idelta=idelta, windows=windows, n=100, deltart=10, minn=6)
#> 9 peaks found in all blank samples
#> 9 peaks found in at least 6 QC samples
#> 29825 sample peaks found
#> 188 aligned QC peaks found
#> Recover 0.00630343671416597 peaks
#> 218 aligned matrix peaks found
#> Recover 0.00643755238893546 peaks
#> 9 peaks found in samples
#> 9 matrix peaks found in samples
#> 0 isomer peaks found in samples
#> 26 isomer matrix peaks found in samples
# The output is a SummarizedExperiment object
se
#> class: SummarizedExperiment
#> dim: 9 100
#> metadata(0):
#> assays(1): intensities
#> rownames(9): 195.0876@74 196.8651@74 ... 199.1004@74 199.1749@73
#> rowData names(2): mz rt
#> colnames(100): S1 S2 ... S99 S100
#> colData names(1): sampleidxThe primary output of the sfi workflow is a feature
table (matrix) where rows represent mass spectral features (unique m/z
and retention time pairs) and columns represent individual samples. The
values in the cells are the integrated intensities of these
features.
This feature table allows you to answer biological or chemical questions about your samples: - Metabolomics/Lipidomics Profiling: Identify which compounds change in abundance between different experimental groups (e.g., disease vs. control). - Quality Control: Assess the stability of the instrument over the run by checking the intensity of internal standards or features in QC samples. - Sample Classification: Use the feature profiles to cluster samples or build predictive models.
The “Downstream Analysis” section below demonstrates how to format this data for use with other R/Bioconductor tools that specialize in these statistical tasks.
For integration with other Bioconductor packages, the
sfi workflow already provides results as a
SummarizedExperiment object. This class is the standard
container for rectangular data in Bioconductor and facilitates
interoperability with tools like DESeq2,
limma, or SummarizedExperiment-based
metabolomics workflows.
Once the data is in a SummarizedExperiment object, we
can leverage the rich Bioconductor ecosystem for visualization and
quality control.
if (requireNamespace("SummarizedExperiment", quietly = TRUE) &&
requireNamespace("ggplot2", quietly = TRUE)) {
library(SummarizedExperiment)
library(ggplot2)
# Basic data access
intensities <- assay(se, "intensities")
# Visualization: Distribution of intensities across samples
# Reshape for ggplot2
plot_df <- data.frame(
Intensity = as.vector(intensities),
Sample = rep(colnames(intensities), each = nrow(intensities))
)
# Filter out zeros for log transformation
plot_df <- plot_df[plot_df$Intensity > 0, ]
ggplot(plot_df, aes(x = Sample, y = Intensity, fill = Sample)) +
geom_boxplot() +
scale_y_log10() +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Intensity Distribution Across Samples",
y = "Log10 Intensity",
x = "Sample Index") +
guides(fill = "none")
}Typical metabolomics workflows require data normalization and
transformation. We can use MsCoreUtils for these tasks
while maintaining the metadata within our object.
if (requireNamespace("SummarizedExperiment", quietly = TRUE) &&
requireNamespace("MsCoreUtils", quietly = TRUE)) {
library(SummarizedExperiment)
library(MsCoreUtils)
# Log2 transformation
# We can create a new assay in the SummarizedExperiment
assay(se, "log2") <- log2(assay(se, "intensities") + 1)
# Quantile normalization
# assay(se, "norm") <- MsCoreUtils::normalizeMethods()[["quantile"]](assay(se, "log2"))
# Example: Summarize feature metadata
cat("Total number of features:", nrow(se), "\n")
cat("M/Z range:", paste(range(rowData(se)$mz), collapse = " - "), "\n")
# Accessing sample metadata for group-wise analysis
colData(se)$Group <- rep(c("Control", "Treatment"), length.out = ncol(se))
# You can now pass this object to other tools like limma or DESeq2
# print(se)
}
#> Total number of features: 9
#> M/Z range: 195.087615966797 - 199.174865722656In real data, you might find the input window for separation and
injection interval are not accurate due to the lag in sample injection
process. It’s suggested to filter high intensity peaks and use
get_sfi_params to find the accurate value for accurate
instrumental window for separation and injection interval.
# get windows and delta time
sfmsub <- peaklist[peaklist$intensity>1e4,]
class(sfmsub) <- c("sfi_peaks", "data.frame") # Preserve class for method dispatch
# get windows and delta time
sfi_params <- get_sfi_params(sfmsub, n=158, deltart = 5)
#> Window is 632.11
#> No retention time information!
#> Delta retention time is 92.213125Then you can use those parameters for all data.
windows <- sfi_params['window']
idelta <- sfi_params['idelta']
se <- getsfm(peaklist,
idelta = idelta, windows = windows,
n = 158, deltart = 10, minn = 6)
#> 9 peaks found in all blank samples
#> 9 peaks found in at least 6 QC samples
#> 29819 sample peaks found
#> 366 aligned QC peaks found
#> Recover 0.0122740534558503 peaks
#> 371 aligned matrix peaks found
#> Recover 0.011569804487072 peaks
#> 9 peaks found in samples
#> 9 matrix peaks found in samples
#> 0 isomer peaks found in samples
#> 26 isomer matrix peaks found in samplesThe sfi package streamlines the complex task of
processing Single File Injection data, converting a continuous stream of
interleaved mass spectral data into a structured format ready for
analysis. By correctly identifying and aligning features from rapid,
sequential injections, it unlocks the throughput potential of SFI-MS for
large-scale metabolomics and screening studies.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MsCoreUtils_1.24.0 ggplot2_4.0.3
#> [3] SummarizedExperiment_1.42.0 Biobase_2.72.0
#> [5] GenomicRanges_1.64.0 Seqinfo_1.2.0
#> [7] IRanges_2.46.0 S4Vectors_0.50.0
#> [9] BiocGenerics_0.58.0 generics_0.1.4
#> [11] MatrixGenerics_1.24.0 matrixStats_1.5.0
#> [13] sfi_1.0.0 BiocStyle_2.40.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 SparseArray_1.12.0 enviGCMS_0.8.0
#> [4] lattice_0.22-9 digest_0.6.39 magrittr_2.0.5
#> [7] RColorBrewer_1.1-3 evaluate_1.0.5 grid_4.6.0
#> [10] fastmap_1.2.0 jsonlite_2.0.0 Matrix_1.7-5
#> [13] BiocManager_1.30.27 scales_1.4.0 jquerylib_0.1.4
#> [16] abind_1.4-8 cli_3.6.6 rlang_1.2.0
#> [19] XVector_0.52.0 withr_3.0.2 cachem_1.1.0
#> [22] DelayedArray_0.38.1 yaml_2.3.12 otel_0.2.0
#> [25] S4Arrays_1.12.0 tools_4.6.0 dplyr_1.2.1
#> [28] vctrs_0.7.3 buildtools_1.0.0 R6_2.6.1
#> [31] lifecycle_1.0.5 clue_0.3-68 MASS_7.3-65
#> [34] cluster_2.1.8.2 pkgconfig_2.0.3 pillar_1.11.1
#> [37] bslib_0.10.0 gtable_0.3.6 data.table_1.18.2.1
#> [40] glue_1.8.1 Rcpp_1.1.1-1.1 tidyselect_1.2.1
#> [43] tibble_3.3.1 xfun_0.57 sys_3.4.3
#> [46] knitr_1.51 farver_2.1.2 htmltools_0.5.9
#> [49] igraph_2.3.0 rmarkdown_2.31 maketools_1.3.2
#> [52] compiler_4.6.0 S7_0.2.2