library(FeatSeekR)
library(pheatmap)
library(SummarizedExperiment)

1 Introduction

A fundamental step in many analyses of high-dimensional data is dimension reduction. Feature selection is one approach to dimension reduction whose strengths include interpretability, conceptual simplicity, transferability and modularity. Here, we introduce the FeatSeekR algorithm, which selects features based on the consistency of their signal across replicates and their non-redundancy. It takes a 2 dimensional array (features x samples) of replicated measurements and returns a SummarizedExperiment object storing the selected features ranked by reproducibility.

2 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("FeatSeekR")

3 Feature selection on simulated data

Here we simulate a data set with features generated by orthogonal latent factors. Features derived from the same latent factor are highly redundant and form distinct clusters. The function simulates 10 redundant features per latent factor. Replicates are generated by adding independent Gaussian noise.

set.seed(111)
# simulate data with 500 conditions, 3 replicates and 5 latent factors 
conditions <- 500
latent_factors <- 5
replicates <- 3

# simData generates 10 features per latent_factor, so choosing latent_factors=5
# will generate 50 features.
# we simulate samples from 500 independent conditions per replicate. setting 
# conditions=500 and replicates=3 will generate 1500 samples, leading to 
# final data dimensions of 50 features x 1500 samples
sim <- simData(conditions=conditions, n_latent_factors=latent_factors,
                replicates=replicates)

# show that simulated data dimensions are indeed 50 x 1500 
dim(assay(sim, "data"))
## [1]   50 1500
# calculate the feature correlation for first replicate
data <- t(assay(sim, "data"))
cor <- cor(data, use = "pairwise.complete.obs")

# plot a heatmap of the features and color features according to their 
# generating latent factors
anno <- data.frame(Latent_factor = as.factor(rep(1:5, each=10)))
rownames(anno) <- dimnames(sim)[[1]]
colors        <- c("red", "blue", "darkorange", "darkgreen", "black")
names(colors) <- c("1", "2", "3", "4", "5")
anno_colors <- list(Latent_factor = colors)
range <- max(abs(cor))
pheatmap(cor, treeheight_row = 0 , treeheight_col = 0, 
        show_rownames = FALSE, show_colnames = FALSE,
        breaks = seq(-range, range, length.out = 100), cellwidth = 6, 
        cellheight = 6, annotation_col = anno, annotation_colors = anno_colors, 
        fontsize = 8)

We first plot the correlation matrix of the data to visualize feature redundancy. As intended by the simulation, the features derived from the same latent factor cluster together. This suggests that the true dimension is indeed lower than the number of features. We now run FeatSeekR to rank the features based on their uniqueness and reproducibility.

# select the top 5 features
res <- FeatSeek(sim, max_features=5)
## Input data has: 
## 1500 samples 
## 500 conditions 
## 50 features 
## 3 replicates
## Starting feature ranking!
## Iteration: 1 selected = Latent_factor2_Feature3, replicate F-statistic = 10873.0023316711
## Iteration: 2 selected = Latent_factor5_Feature6, replicate F-statistic = 9364.10884704126
## Iteration: 3 selected = Latent_factor3_Feature6, replicate F-statistic = 6048.5339756004
## Iteration: 4 selected = Latent_factor4_Feature1, replicate F-statistic = 1777.09431418772
## Iteration: 5 selected = Latent_factor1_Feature5, replicate F-statistic = 60.3857000568657
## Finished feature ranking procedure!
## Calculating explained variance of selected feature sets!
# plot a heatmap of the top 5 selected features 
plotSelectedFeatures(res)

We again visualize the selected features by plotting their correlation matrix. As expected, the top 5 selected features are each from a different latent factor and low correlated. This suggests that we were able to obtain a compressed version of the data, while keeping most of the contained information.

4 Session Info

sessionInfo()
## R Under development (unstable) (2025-01-20 r87609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
##  [5] IRanges_2.41.2              S4Vectors_0.45.2           
##  [7] BiocGenerics_0.53.6         generics_0.1.3             
##  [9] MatrixGenerics_1.19.1       matrixStats_1.5.0          
## [11] pheatmap_1.0.12             FeatSeekR_1.7.2            
## [13] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9              SparseArray_1.7.4       lattice_0.22-6         
##  [4] magrittr_2.0.3          pracma_2.4.4            digest_0.6.37          
##  [7] evaluate_1.0.3          grid_4.5.0              RColorBrewer_1.1-3     
## [10] bookdown_0.42           fastmap_1.2.0           jsonlite_1.8.9         
## [13] Matrix_1.7-2            tinytex_0.54            BiocManager_1.30.25    
## [16] httr_1.4.7              UCSC.utils_1.3.1        scales_1.3.0           
## [19] jquerylib_0.1.4         abind_1.4-8             cli_3.6.3              
## [22] rlang_1.1.5             crayon_1.5.3            XVector_0.47.2         
## [25] munsell_0.5.1           cachem_1.1.0            DelayedArray_0.33.4    
## [28] yaml_2.3.10             S4Arrays_1.7.1          tools_4.5.0            
## [31] colorspace_2.1-1        GenomeInfoDbData_1.2.13 R6_2.5.1               
## [34] magick_2.8.5            lifecycle_1.0.4         MASS_7.3-64            
## [37] bslib_0.8.0             gtable_0.3.6            Rcpp_1.0.14            
## [40] glue_1.8.0              xfun_0.50               knitr_1.49             
## [43] farver_2.1.2            htmltools_0.5.8.1       rmarkdown_2.29         
## [46] compiler_4.5.0