BiocNeighbors 1.14.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 4505 4168 617 8623 4240 6847 2259 5830 4422 1608
## [2,] 1539 6123 9034 8664 9644 2079 6207 5303 2202 8149
## [3,] 2978 6267 4601 1804 5566 1079 2886 5003 8680 9352
## [4,] 2691 9637 5359 8149 5062 5112 5903 274 2143 1410
## [5,] 5874 1426 133 8065 1319 3159 7696 8543 5047 8534
## [6,] 7955 3359 549 5782 4825 6797 2841 1235 6215 7083
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8825449 0.9915909 1.0092245 1.0188426 1.0392540 1.0516332 1.0519289
## [2,] 0.8622014 0.9186255 0.9317672 1.0342002 1.0359803 1.0780544 1.0795301
## [3,] 0.8337301 0.8998619 0.9341834 0.9551983 0.9601048 0.9736210 0.9805866
## [4,] 0.8187793 0.8193409 0.9103844 0.9208554 0.9429098 0.9466539 0.9617505
## [5,] 0.8705071 0.9711886 0.9909576 1.0005188 1.0054835 1.0277821 1.0325969
## [6,] 0.9291477 0.9332741 0.9387993 0.9396378 0.9666235 0.9844171 0.9942077
## [,8] [,9] [,10]
## [1,] 1.0579325 1.058985 1.060058
## [2,] 1.0945244 1.105723 1.114920
## [3,] 1.0039030 1.012025 1.012901
## [4,] 0.9816138 1.012709 1.028672
## [5,] 1.0444629 1.049320 1.053118
## [6,] 1.0305445 1.050729 1.057779
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 7059 7607 2129 2347 6307
## [2,] 6159 8596 6062 8976 9349
## [3,] 5669 9320 7673 6012 7089
## [4,] 4221 4691 3187 7418 4053
## [5,] 8709 1907 4231 4425 4741
## [6,] 8317 1419 849 9364 4506
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9419826 0.9975957 1.0318391 1.0445652 1.0455526
## [2,] 0.8231115 0.8697896 0.9104496 0.9648659 0.9910955
## [3,] 0.9634522 1.0049822 1.0079889 1.0189093 1.0468636
## [4,] 0.8537957 0.8811178 0.9886814 1.0454019 1.0553904
## [5,] 0.8202045 0.9335378 1.0480314 1.0797175 1.0798849
## [6,] 0.8650038 0.9380068 0.9622783 0.9756745 1.0326613
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpayapNL/file1b8bdd5f9f8fb5.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.14.0 knitr_1.38 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 magrittr_2.0.3 BiocGenerics_0.42.0
## [4] BiocParallel_1.30.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.2 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.2.0 parallel_4.2.0 grid_4.2.0
## [13] xfun_0.30 cli_3.3.0 jquerylib_0.1.4
## [16] htmltools_0.5.2 yaml_2.3.5 digest_0.6.29
## [19] bookdown_0.26 Matrix_1.4-1 BiocManager_1.30.17
## [22] S4Vectors_0.34.0 sass_0.4.1 evaluate_0.15
## [25] rmarkdown_2.14 stringi_1.7.6 compiler_4.2.0
## [28] bslib_0.3.1 stats4_4.2.0 jsonlite_1.8.0