Ready to unlock the full potential of glyrepr
? This vignette is for those who want to peek under the hood and master the art of efficient glycan computation. If you’re writing custom functions for glycan analysis or building the next great glycomics tool, you’re in the right place!
Fair warning: This guide assumes you’re comfortable with R programming and graph theory concepts. If you’re just getting started, check out our “Getting Started with glyrepr” vignette first.
library(glyrepr)
Before we dive into the smap
functions, let’s understand why they exist and why they’re game-changing for glycan analysis.
Working with glycan structures means working with graphs, and graph operations are computationally expensive. When you’re analyzing thousands of glycans from a large-scale study, this becomes a real bottleneck.
glyrepr
implements a clever optimization called unique structure storage. Instead of storing thousands of identical graphs, it stores only the unique ones and keeps track of which original positions they belong to.
Let’s see this in action:
# Our test data: some common glycan structures
<- c(
iupacs "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core
"Gal(b1-3)GalNAc(a1-", # O-glycan core 1
"Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-", # O-glycan core 2
"Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-", # Branched mannose
"GlcNAc6Ac(b1-4)Glc3Me(a1-" # With decorations
)
<- as_glycan_structure(iupacs)
struc
# Now let's create a realistic dataset with lots of repetition
<- rep(struc, 1000) # 5,000 total structures
large_struc
large_struc#> <glycan_structure[5000]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> [6] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [7] Gal(b1-3)GalNAc(a1-
#> [8] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [9] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [10] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> ... (4990 more not shown)
#> # Unique structures: 5
Notice that magical “# Unique structures: 5”? That’s your performance booster right there!
Let’s verify this optimization is real:
# Only 5 unique graphs are stored internally
length(attr(large_struc, "structures"))
#> [1] 5
# But we have 5,000 total elements
length(large_struc)
#> [1] 5000
library(lobstr)
obj_sizes(struc, large_struc)
#> * 14.33 kB
#> * 80.72 kB
80 kB vs 15 MB? That’s a 200x memory efficiency! But the real magic happens with computation speed…
smap
Universe 🌌Now here’s the problem: if you try to use regular lapply()
or purrr::map()
functions on glycan structures, you’ll hit a wall:
# This won't work and will throw an error
tryCatch(
::map_int(large_struc, ~ igraph::vcount(.x)),
purrrerror = function(e) cat("đź’Ą Error:", rlang::cnd_message(e))
)#> 💥 Error: ℹ In index: 1.
#> Caused by error in `ensure_igraph()`:
#> ! Must provide a graph object (provided wrong object type).
Why does this fail? Because purrr
functions don’t understand the internal structure optimization of glycan_structure
objects.
smap
Family to the Rescue!The smap
functions (think “structure map”) are drop-in replacements for purrr
functions that are glycan-aware. They understand the unique structure optimization and work directly with the underlying graph objects.
# This works beautifully!
<- smap_int(large_struc, ~ igraph::vcount(.x))
vertex_counts 1:10]
vertex_counts[#> [1] 5 2 3 5 2 5 2 3 5 2
The “s” stands for “structure” — these functions operate on the underlying igraph
objects that represent your glycan structures.
smap
Toolkit 🛠️The smap
family provides glycan-aware equivalents for virtually all purrr
functions:
purrr | smap | purrr | smap |
---|---|---|---|
map() |
smap() |
map2() |
smap2() |
map_lgl() |
smap_lgl() |
map2_lgl() |
smap2_lgl() |
map_int() |
smap_int() |
map2_int() |
smap2_int() |
map_dbl() |
smap_dbl() |
map2_dbl() |
smap2_dbl() |
map_chr() |
smap_chr() |
map2_chr() |
smap2_chr() |
some() |
ssome() |
pmap() |
spmap() |
every() |
severy() |
pmap_*() |
spmap_*() |
none() |
snone() |
imap() |
simap() |
imap_*() |
simap_*() |
Simple rule: Replace map
with smap
, pmap
with spmap
, and imap
with simap
. Everything else works exactly like purrr
!
Count vertices in each structure:
<- smap_int(large_struc, igraph::vcount)
vertex_counts summary(vertex_counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.0 2.0 3.0 3.4 5.0 5.0
Find structures with more than 4 vertices:
<- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4)
has_many_vertices sum(has_many_vertices)
#> [1] 2000
Get the degree sequence of each structure:
<- smap(large_struc, ~ igraph::degree(.x))
degree_sequences 1:3] # Show first 3
degree_sequences[#> [[1]]
#> 1 2 3 4 5
#> 1 1 3 2 1
#>
#> [[2]]
#> 1 2
#> 1 1
#>
#> [[3]]
#> 1 2 3
#> 1 1 2
Check if any structure has isolated vertices:
ssome(large_struc, ~ any(igraph::degree(.x) == 0))
#> [1] FALSE
Verify all structures are connected:
severy(large_struc, ~ igraph::is_connected(.x))
#> [1] TRUE
smap()
Quick examples of the extended family:
# smap2: Apply function with additional parameters
<- c(3, 4, 5)
thresholds <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) {
large_enough ::vcount(g) >= threshold
igraph
})
large_enough#> [1] TRUE FALSE FALSE
# simap: Include position information
<- simap_chr(large_struc[1:3], function(g, i) {
indexed_report paste0("#", i, ": ", igraph::vcount(g), " vertices")
})
indexed_report#> [1] "#1: 5 vertices" "#2: 2 vertices" "#3: 3 vertices"
⚠️ Performance Warning: simap
functions don’t benefit from the unique structure optimization! Since each element has a different index, the combination of (structure, index)
is always unique, breaking the deduplication that makes other smap
functions fast. Use simap
only when you truly need position information.
The beauty of smap
functions lies in automatic deduplication:
# Create a large dataset with high redundancy
<- rep(struc, 5000) # 25,000 structures, only 5 unique
huge_struc
cat("Dataset size:", length(huge_struc), "structures\n")
#> Dataset size: 25000 structures
cat("Unique structures:", length(attr(huge_struc, "structures")), "\n")
#> Unique structures: 5
cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n")
#> Redundancy factor: 5000 x
library(tictoc)
# Optimized approach: smap only processes 5 unique structures
tic("smap_int (optimized)")
<- smap_int(huge_struc, igraph::vcount)
vertex_counts_optimized toc()
#> smap_int (optimized): 0.001 sec elapsed
# Naive approach: extract all graphs and process each one
tic("Naive approach (all graphs)")
<- get_structure_graphs(huge_struc) # Extracts all 25,000 graphs
all_graphs <- purrr::map_int(all_graphs, igraph::vcount)
vertex_counts_naive toc()
#> Naive approach (all graphs): 0.089 sec elapsed
# Verify results are equivalent (though data types may differ)
all.equal(vertex_counts_optimized, vertex_counts_naive)
#> [1] TRUE
The higher the redundancy, the bigger the performance gain! In real glycoproteomics datasets with repeated structures, this optimization can provide about 10x speedups.
The function you pass to smap
must accept an igraph
object as its first argument. You can use purrr-style lambda notation:
# Calculate clustering coefficient for each structure
<- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global"))
clustering_coeffs summary(clustering_coeffs)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0 0 0 0 0 0 2000
# Create a comprehensive analysis
<- smap(large_struc, function(g) {
structure_analysis list(
vertices = igraph::vcount(g),
edges = igraph::ecount(g),
diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA),
clustering = igraph::transitivity(g, type = "global")
)
})
# Convert to a more usable format
<- do.call(rbind, lapply(structure_analysis, data.frame))
analysis_df head(analysis_df)
#> vertices edges diameter clustering
#> 1 5 4 3 0
#> 2 2 1 1 NaN
#> 3 3 2 1 0
#> 4 5 4 2 0
#> 5 2 1 1 NaN
#> 6 5 4 3 0
# Find only structures with exactly 5 vertices
<- smap_lgl(large_struc, ~ igraph::vcount(.x) == 5)
has_five_vertices <- large_struc[has_five_vertices]
five_vertex_structures
cat("Found", sum(has_five_vertices), "structures with exactly 5 vertices\n")
#> Found 2000 structures with exactly 5 vertices
smap
FunctionsUse smap
functions when:
igraph
-based functions to glycan structuresStick with regular R functions when:
⚠️ Special note on simap
:
While simap
functions are convenient for position-aware operations, they don’t provide performance benefits over regular imap
functions. The inclusion of index information breaks the unique structure optimization, making each (structure, index)
pair unique even when structures are identical.
Here’s how you might build a custom glycan analysis pipeline using smap
functions:
# Custom motif detector
<- function(g) {
detect_branching <- igraph::degree(g)
degrees any(degrees >= 3)
}
# Apply to large dataset - blazingly fast due to unique structure optimization
<- smap_lgl(large_struc, detect_branching)
has_branching cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n")
#> Structures with branching: 2000 out of 5000
# Use smap2 to check structures against complexity thresholds
<- rep(c(3, 4, 5, 2, 4), 1000) # Thresholds for each structure
complexity_thresholds <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) {
meets_threshold ::vcount(g) >= threshold
igraph
})cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n")
#> Structures meeting complexity threshold: 2000 out of 5000
Congratulations! You now understand the core optimization that makes glyrepr
blazingly fast and how to leverage it with the smap
family of functions.
Key takeaways: - đź§ Unique structure optimization is the secret sauce behind glyrepr
’s performance - 🚀 smap
functions are drop-in replacements for purrr
that understand glycan structures - ⚡ Performance gains are dramatic with large datasets containing repeated structures - 🛠️ Use smap
for structures, regular R functions for everything else
You’re now equipped to build the next generation of glycomics analysis tools. Go forth and analyze! 🌟
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] lobstr_1.1.2 dplyr_1.1.4 tibble_3.3.0 tictoc_1.2.1 purrr_1.1.0
#> [6] glyrepr_0.7.4
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_1.8.8 compiler_4.4.1 highr_0.11 tidyselect_1.2.1
#> [5] stringr_1.5.2 jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0
#> [9] R6_2.6.1 generics_0.1.4 igraph_2.1.4 knitr_1.48
#> [13] backports_1.5.0 checkmate_2.3.3 rstackdeque_1.1.1 bslib_0.8.0
#> [17] pillar_1.11.0 rlang_1.1.6 utf8_1.2.6 cachem_1.1.0
#> [21] stringi_1.8.7 xfun_0.46 sass_0.4.9 cli_3.6.5
#> [25] magrittr_2.0.4 digest_0.6.37 lifecycle_1.0.4 prettyunits_1.2.0
#> [29] vctrs_0.6.5 evaluate_1.0.3 glue_1.8.0 rmarkdown_2.27
#> [33] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1