Abstract
The Systems Biology Markup Language (SBML) is a format based on XML (eXtensible Markup Language) used to describe biological pathways in a sherable and detailed standard form. This vignette aims to introduce and show the functions workflow to be used to convert an SBML file into a list of R dataframes through the tidysbml package. The package provides conversion for all SBML levels and versions available so far, in particular it is designed and tested to work with level 3 version 2 SBML or earlier. By means of its functions, the package can provide either a complete extraction, resulting in a list of at most 4 dataframes (i.e. one for listOfCompartments, one for listOfSpecies and two for the listOfReactions content), or a partial extraction, where the user may choose which of the four dataframes has to be exported.The aim of this package is to supply easy extraction and manipulation
of SBML information by insertion in tabular data structures. Because of
descriptive nature of SBML documents, the dataframe format is
particularly suitable for easily access data and be able to perform
subsequent analysis. Specially, this type of conversion enables easy
data interrogation by means of tidyverse verbs in order to facilitate,
for instance, usage of biomaRt
and igraph
-like
packages. The involvement in the Bioconductor project establishes a
direct and consistent connection with bioinformatics community while
providing cooperation of tools useful also within the frame of systems
biology and, in general, for the analysis of biological data.
In order to illustrate the package functioning, we used as examples an SBML file (Hucka et al. 2019) extracted from Reactome (Milacic et al. 2023), an open-source, open-access and peer-reviewed biological pathway database. Namely, the pathway is the “Aryl hydrocarbon receptor signalling” (R-HSA-8937144 (Jassal 2016)).
After providing installation instructions, the first section
describes the dataframes structure, in the subsequent two sections are
described the tidysbml steps to follow for pursuing the SBML conversion,
while in the last one are shown some examples to integrate tidysbml
dataframes with other Bioconductor packages (i.e. biomaRt and RCy3). In
the following, it is useful to distinguish the SBML tags names using
italic and the R commands with teletype
fonts,
respectively. Also, the terms ‘tag’ and ‘component’ are used
interchangeably.
To install tidysbml from Bioconductor, run
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("tidysbml")
The SBML components of main interest for this package are
listOfCompartments, listOfSpecies and
listOfReactions. Due to the underlying SBML format, these
dataframes may generally consist of the following three sets of columns:
(i) Attributes, involving tags such as id, metaid,
name, etc., (ii) Notes, consisting only of notes tag
and (iii) Annotation, whose qualifiers are tags like
bqmodel:is, bqbiol:is, bqbiol:hasPart, etc.
Each tag is exported in one separate column. Columns for tags in the
Attributes set are named as the tag name, notes column is named ‘notes’,
while Annotation columns are prefixed by ‘annotation_’ followed by tag
name after colon symbol (‘:’), for instance column with
bqbiol:is tag content is labelled ‘annotation_is’. If one
entity possesses multiple tags with the same name, the repeated column
name is accompanied by a number (from the second copy it starts from
’_1’). Whether more values are contained in one tag (e.g. as happens for
Annotation tags such as bqbiol:hasPart,
bqbiol:isDescribedBy, or also Notes column) they are separated
by delimiters like ” ” (i.e. single space character) for Annotation
values and “|” (i.e. pipe character) for Notes. Also, based on the
selected component, the respective dataframe may contain other columns,
depending on the xml structure of the underlying component’s class. See
for instance the df_species_in_reactions
dataframe
described in the following.
The first step to convert an SBML file into R dataframes is to
convert the SBML document into an R list object, by means of the
sbml_as_list()
function. In fact, all the other functions
in this package require a list as input. The sole exception is given by
as_dfs()
which incorporates this conversion function and
therefore may also receive directly an SBML file as first argument (and
not exclusively an SBML converted into a list object).
The sbml_as_list()
function exploits functions for
reading and converting xml files from the xml2
R package
(Wickham, Hester, and Ooms 2023) and
outputs an appropriate type of list. In the following a such list is
referred as SBML-converted list. The first argument reads the file path
where the SBML file is located, the second one sets the information
about which is the SBML component the user wants to look at (i.e. among
the listOfSpecies, listOfCompartments and
listOfReactions), if any (the default option gets ‘all’
components). Examples for both options are given below.
After running
library(tidysbml)
example of default option is
filepath <- system.file("extdata", "R-HSA-8937144.sbml", package = "tidysbml")
sbml_list <- sbml_as_list(filepath)
that returns a full SBML model, starting from the sbml tag, converted into a list of lists nested accordingly to the xml nesting rules.
Instead, an example of SBML-list conversion for only the list of species is given by
list_species <- sbml_as_list(filepath, component = "species")
which yields an SBML-converted list of lists starting from the
listOfSpecies tag, that is contained inside the sbml
and model tags. This last output is required in case the user
is interested in the extraction of only one dataframe, e.g., using the
as_df()
function, as described in the following
section.
The main function of this package is as_dfs()
, which is
able to provide the SBML information about Compartments, Species and
Reactions in a tabular format. It returns a list of at most 4
dataframes, depending on the components reported inside the SBML
selected.
The dataframe for listOfCompartments
(listOfSpecies) component, named df_compartments
(df_species
), has one row for each Compartment (Species)
and one column for each Attributes, Notes and Annotation value.
Similarly, the first dataframe about Reactions
(i.e. df_reactions
) contains one row for each Reaction with
their Attributes, Notes and Annotation values as columns, while the
second one (i.e. df_species_in_reactions
) has one row for
each Species involved in each Reaction, here with the addition of two
more columns: the reaction_id
column, with information
about the corresponding Reaction identificator reported in the SBML
document, and the container_list
column, with the name of
the listOf element containing that Species
(i.e. listOfReactants, listOfProducts,
listOfModifiers). It is possible to use as first argument the
SBML file path
list_of_dfs <- as_dfs(filepath, type = "file")
#> Empty notes' column for 'compartment' elements
or directly the SBML-converted list after using
sbml_as_list()
as described above
list_of_dfs <- as_dfs(sbml_list, type = "list")
#> Empty notes' column for 'compartment' elements
both returning the same output, that is the list with all the dataframes available from extraction. After the list has been extracted once, this second way is preferable, in order to avoid repeated sbml-list conversions.
Another function, namely as_df()
, enables the conversion
of only one dataframe at a time, depending on the SBML component of
interest. Here a SBML-converted list starting from
listOfCompartments/listOfSpecies/listOfReactions
component is a mandatory input. For instance, converting first the SBML
file into a list focusing at the listOfSpecies component
list_species <- sbml_as_list(filepath, component = "species")
df_species <- as_df(list_species)
returns one dataframe containing all information about species. Just
for listOfReactions component is possible to obtain two
dataframes. Here dfs_about_reactions
is a list of 2
dataframes obtained by
list_react <- sbml_as_list(filepath, component = "reactions")
dfs_about_reactions <- as_df(list_react)
whose first component, containing information about reactions, returns 15 columns for the 5 reactions of our example
dfs_about_reactions[[1]]
While df_species_in_reactions
, with information about
the 14 species involved in the 5 reactions described above, is obtained
by taking the second component
dfs_about_reactions[[2]]
Each function described in this section performs a control on the
input format correctness. In particular, it returns errors if the input
object is an empty list or not a list object, and also if its format is
not suitable for extraction (i.e. SBML tags are not properly named or
nested). In particular, the SBML file is accepted by
as_df()
if it contains only one type of tag within the
first level of ‘listOf’ components. For instance, if the SBML is
restricted to
listOfSpecies/listOfCompartments/listOfReactions
tag, the only type of tag within the list should be
species/compartment/reaction. One more
condition, given only in the as_dfs()
function, is that the
first two tags in the xml hierarchy should be sbml and
model, where the former contains the latter. If any one of
these conditions does not hold, the respective functions are not
executed.
This section provides R code to incorporate tidysbml dataframes with other Bioconductor packages. Here are shown examples of integration for RCy3 (Gustavsen et al. 2019) and biomaRt (Durinck et al. 2005) packages.
RCy3 package permits communication between R and Cytoscape softwares. After launching Cytoscape, it is possible to import graph in form of edgelist (i.e. dataframe with source and target columns) by simple (or heavier) data manipulation through dataframes as
library(dplyr)
edgelist <- df_species_in_reactions %>% select("reaction_id", "species") %>% `colnames<-`(c("source", "target"))
RCy3::createNetworkFromDataFrames(edges = edgelist) # while running Cytoscape
BiomaRt, instead, is an annotation package providing access to external public databases. One possible usage, for instance, is to visualize information about Uniprot ids reported in SBML for Species, here considering only those composed by multiple entities (i.e. multiple ids). First, extract URIs data about species from Annotation column with bqbiol:hasPart content
vec_uri <- na.omit( unlist(
lapply(X = list_of_dfs[[2]]$annotation_hasPart, FUN = function(x){
unlist(strsplit(x, "||", fixed = TRUE))
})
))
filter only Uniprot URIs
vec_uniprot <- na.omit( unlist(
lapply( X = vec_uri, FUN = function(x){
if( all(unlist(gregexpr("uniprot", x)) > -1) ){
x
} else {
NA
}
})
))
and extract Uniprot ids
vec_ids <- vapply(vec_uniprot, function(x){
chr <- "/"
first <- max(unlist(gregexpr(chr, x)))
substr(x, first + 1, nchar(x))
}, FUN.VALUE = character(1))
Then, using biomaRt commands, user can set attributes information to look at
library(biomaRt)
mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
df_mart_uniprot <- getBM( attributes = c("uniprot_gn_id", "uniprot_gn_symbol", "description"),
filters = "uniprot_gn_id",
values = vec_ids,
mart = mart)
df_mart_uniprot
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] biomaRt_2.63.0 tidysbml_1.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 xfun_0.48 bslib_0.8.0
#> [4] httr2_1.0.5 Biobase_2.67.0 vctrs_0.6.5
#> [7] tools_4.5.0 generics_0.1.3 stats4_4.5.0
#> [10] curl_5.2.3 tibble_3.2.1 fansi_1.0.6
#> [13] AnnotationDbi_1.69.0 RSQLite_2.3.7 blob_1.2.4
#> [16] pkgconfig_2.0.3 dbplyr_2.5.0 S4Vectors_0.45.0
#> [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.5.0
#> [22] stringr_1.5.1 Biostrings_2.75.0 progress_1.2.3
#> [25] GenomeInfoDb_1.43.0 htmltools_0.5.8.1 sass_0.4.9
#> [28] yaml_2.3.10 pillar_1.9.0 crayon_1.5.3
#> [31] jquerylib_0.1.4 cachem_1.1.0 tidyselect_1.2.1
#> [34] digest_0.6.37 stringi_1.8.4 dplyr_1.1.4
#> [37] purrr_1.0.2 fastmap_1.2.0 cli_3.6.3
#> [40] magrittr_2.0.3 utf8_1.2.4 withr_3.0.2
#> [43] prettyunits_1.2.0 filelock_1.0.3 UCSC.utils_1.3.0
#> [46] rappdirs_0.3.3 bit64_4.5.2 rmarkdown_2.28
#> [49] XVector_0.47.0 httr_1.4.7 bit_4.5.0
#> [52] png_0.1-8 hms_1.1.3 memoise_2.0.1
#> [55] evaluate_1.0.1 knitr_1.48 IRanges_2.41.0
#> [58] BiocFileCache_2.15.0 rlang_1.1.4 glue_1.8.0
#> [61] DBI_1.2.3 xml2_1.3.6 BiocGenerics_0.53.0
#> [64] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.53.0