The rSWeeP package is an R implementation of the SWeeP method. The main function of this package is to provide a vector representation of biological sequences (nucleotides or amino acids), and thus favor alignment-free phylogenetic studies. Each sequence provided is represented by a compact numerical vector which is easier to analyze. The method is based on k-mers counting and random projection of high-dimensional vectors into low-dimensional vectors. In addition, the package allows general dimensionality reduction of RNAseq data and generic matrices.
rSWeeP 1.19.1
The “Spaced Words Projection (SWeeP)” is a method for representing biological sequences in compact vectors. The sequences are scanned by a reading mask that corresponds to a given k-mer. This reading generates a high-dimensional vector (HDV) which, by random projection, is reduced to a low-dimensional vector (LDV), preserving the relative distances between the high-dimensional vectors (preserving the information).
The SWeeP function is the original implementation of the method as in Pierri et al. (2019), which uses a random orthonormal basis generated by the orthBase
function.
The SWeePlite
function is optimized for processing with larger masks (larger k-mer) and for larger volumes of data (long sequences or large numbers of sequences).
These functions are suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The MATLAB version of the method is available at sWeeP (PIERRI, 2019). This tool has it’s main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.
Tutorials and more information are available at https://aibialab.github.io/rSWeeP.
The package has six functions:
The SWeeP and SWeePlite functions are responsible for vectorizing the sequences using the referenced SWeeP method (PIERRI, 2019).
The page https://aibialab.github.io/rSWeeP provides basic tutorials on using and parameterizing the rSWeeP package. Below you can find a quickstart of the package.
Consider a set of 13 mitochondrial proteomes (translated CDSs) deposited in the folder at the address path
in FASTA format. These sequences can be vectorized as below:
library(rSWeeP)
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeePlite(path,seqtype='AA',mask=c(4),psz=1000)
## Starting projection. Please wait.
## starting sequence 1 of 14 - complete
## starting sequence 2 of 14 - complete
## starting sequence 3 of 14 - complete
## starting sequence 4 of 14 - complete
## starting sequence 5 of 14 - complete
## starting sequence 6 of 14 - complete
## starting sequence 7 of 14 - complete
## starting sequence 8 of 14 - complete
## starting sequence 9 of 14 - complete
## starting sequence 10 of 14 - complete
## starting sequence 11 of 14 - complete
## starting sequence 12 of 14 - complete
## starting sequence 13 of 14 - complete
## starting sequence 14 of 14 - complete
In sw$proj
you will find the SWeeP vectors in matrix format with 13 rows (each row a sample) and 1000 columns (coordinates). In sw$info
other processing information is stored, which may be important in subsequent steps.
sw$info
## $headers
## [1] "01_Pan_troglodytes" "02_Capra_aegagrus"
## [3] "03_Homo_sapiens" "04_Bos_taurus"
## [5] "05_Ara_ararauna" "06_Mus_musculus"
## [7] "07_Brotogeris_cyanoptera" "08_Homo_sapiens_neanderthalensis"
## [9] "09_Gazella_gazella" "10_Rattus_norvegicus"
## [11] "11_Pan_paniscus" "12_Psittacara_rubritorquis"
## [13] "13_Apodemus_sylvaticus" "14_Rhazya_stricta"
##
## $ProjectionSize
## [1] 1000
##
## $bin
## [1] "counting (FALSE)"
##
## $mask
## [1] 1 1 1 1
##
## $SequenceType
## [1] "AA"
##
## $extension
## [1] ""
##
## $version
## [1] '1.19.1'
##
## $norm
## [1] "none"
##
## $timeElapsed
## [1] 7.605
$ProjectionSize
- length of the output vector (length of the projection)$mask
- mask used (provided in extended binary format)$SequenceType
- data type (amino acid or nucleotide)$concatenate
- if the files were concatenated into one$bin
- whether binary (TRUE) or count (FALSE)$version
- version of rSWeeP used$norm
- type of normalization used (‘none’, ‘log’ or ‘logNeg’)$timeElapsed
- time spent in processing$headers
- name of the vectorized files/samplesWe can obtain the phylogenetic relationship between the vectorized organisms using the Neighbour Joining (NJ) method.
library(ape)
##
## Attaching package: 'ape'
## The following object is masked from 'package:Biostrings':
##
## complement
# get the distance matrix
mdist = dist(sw$proj,method='euclidean')
# use the NJ algorithm to build the tree
tr = nj(mdist)
# root the tree in the plant sample
tr = root(tr,outgroup='14_Rhazya_stricta')
# plot
plot(tr)
To visualize the vectorized data graphically and evaluate the phylogenetic tree, we provide the metadata with the classes at the family taxonomic level.
pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv")
mt = read.csv(pathmetadata,header=TRUE)
We evaluated the phylogenetic tree in terms of taxon groupings with the PCCI metric, and checked the percentage of consistent taxa, mono or paraphyletic with the PMPG metric.
data = data.frame(sp=mt$fileName,family=mt$family)
PCCI(tr,data) # PhyloTaxonomic Consistency Cophenetic Index
## $tab
## taxa cost
## 1 Hominidae 1
## 2 Bovidae 1
## 3 Psittacidae 1
## 4 Muridae 1
## 5 Apocynaceae 1
##
## $mean
## [1] 1
PMPG(tr,data) # Percentage of Mono or Paraphyletic Groups
## $tab
## taxa mono para
## 1 Hominidae TRUE FALSE
## 2 Bovidae TRUE FALSE
## 3 Psittacidae TRUE FALSE
## 4 Muridae TRUE FALSE
## 5 Apocynaceae TRUE FALSE
##
## $percMono
## [1] 100
##
## $percPara
## [1] 0
##
## $metric
## [1] 100
We obtain the PCA and visualize the first components.
pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1],pca_output$x[,2],xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = mt$id)
legend("bottomright",unique(mt$family),col=as.character(c(1:length(unique(mt$family)))),pch=20)
plot(pca_output$x[,3],pca_output$x[,4],xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = mt$id)
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ape_5.8 rSWeeP_1.19.1 Biostrings_2.75.1
## [4] GenomeInfoDb_1.43.1 XVector_0.47.0 IRanges_2.41.1
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.3 generics_0.1.3
## [10] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 compiler_4.5.0 BiocManager_1.30.25
## [4] crayon_1.5.3 tinytex_0.54 Rcpp_1.0.13-1
## [7] magick_2.8.5 jquerylib_0.1.4 yaml_2.3.10
## [10] fastmap_1.2.0 lattice_0.22-6 R6_2.5.1
## [13] knitr_1.49 bookdown_0.41 GenomeInfoDbData_1.2.13
## [16] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
## [19] stringi_1.8.4 xfun_0.49 sass_0.4.9
## [22] cli_3.6.3 magrittr_2.0.3 zlibbioc_1.53.0
## [25] grid_4.5.0 digest_0.6.37 nlme_3.1-166
## [28] lifecycle_1.0.4 evaluate_1.0.1 codetools_0.2-20
## [31] rmarkdown_2.29 httr_1.4.7 tools_4.5.0
## [34] htmltools_0.5.8.1 UCSC.utils_1.3.0