The DNABarcodes
package offers a function to create DNA barcode sets capable
of correcting substitution errors or insertion, deletion, and substitution
errors. Existing barcodes can be analysed regarding their minimal, maximal and
average distances between barcodes. Finally, reads that start with a (possibly
mutated) barcode can be demultiplexed, i.e., assigned to their original
reference barcode.
The most common use cases are:
create.dnabarcodes
is used to create sets of DNA barcodes with error correction properties.
demultiplex
is used to demultiplex a set of reads based on the used set of DNA barcodes
analyse.barcodes
is used to analyse the properties of an existing set of DNA barcodes
and to assess its error correction capabilities
Suppose we want to generate a set of 5bp long barcodes with default settings. The default enforces a minimum Hamming distance of 3 between each DNA barcode, which is sufficient to correct at least one substitution error:
library("DNABarcodes")
## Loading required package: Matrix
## Loading required package: parallel
mySet <- create.dnabarcodes(5)
## 1) Creating pool ... of size 592
## 2) Conway closing... done
show(mySet)
## [1] "GAGAA" "AGCAA" "CCTAA" "CAAGA" "ACGGA" "GTCGA" "TGTGA" "GGACA" "CTGCA"
## [10] "TACCA" "CGAAG" "TCGAG" "GTTAG" "ATAGG" "AAGCG" "GAATG" "TGCTG" "ACTTG"
## [19] "ACAAC" "CACAC" "TAGGC" "CTTGC" "TTACC" "GATCC" "AGGTC" "GCCAT" "TCAGT"
## [28] "AACGT" "TGGCT" "CAGTT"
In default mode, the Conway lexicographic algorithm is used to generate the set. Conway’s algorithm is simple and most of the time efficient enough. When sufficient computing power is available, Ashlock’s evolutionary algorithm should be used. The parameter to change the set generation algorithm is named heuristic.
mySetAshlock <- create.dnabarcodes(5, heuristic="ashlock")
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySetAshlock)
## [1] "GGTAA" "TCTGA" "TGGTA" "CAGAA" "ACCAA" "GAAGA" "ATGGA" "AGACA" "TACCA"
## [10] "CTTCA" "CCATA" "GTCTA" "CGAAG" "TCGAG" "GACAG" "ACAGG" "TTCGG" "CATGG"
## [19] "GTACG" "AAGCG" "TGTCG" "CTGTG" "AGCTG" "GCTTG" "GCAAC" "AGGAC" "CTCAC"
## [28] "TGAGC" "AACGC" "GTTGC" "CAACC" "TTGCC" "ACTCC" "GAGTC" "TCCTC" "CGTTC"
## [37] "GTGAT" "TGCAT" "CCTAT" "CTAGT" "TAGGT" "AGTGT" "TCACT" "ATCCT" "GATCT"
## [46] "GGATT" "ACGTT" "CACTT"
Importantly, the set of DNA barcodes is most often larger when Ashlock’s algorithm is used.
Finally, we want to create a set of 5bp long DNA barcodes that support the correction of 2 substitutions. For that, we enforce a minimum distance of 5 using the parameter dist.
mySetDist5 <- create.dnabarcodes(5, dist=5, heuristic="ashlock")
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySetDist5)
## [1] "CGTTA" "GTACG" "ACGAT" "TACGC"
The number of errors \(k_c\) that a set of DNA barcodes can correct is expressed by the following formula. The variable \(dist\) gives the minimum distance of the set.
\[k_c \leq \left\lfloor \frac{dist - 1}{2} \right\rfloor\]
The number of \(k_d\) of detectable errors is given as:
\[k_d \leq dist - 1\]
The following table shows common distances \(dist\) and their error correction (\(k_c\)) and detection (\(k_d\)) properties:
\(dist\) | \(k_c\) | \(k_d\) |
---|---|---|
3 | 1 | 2 |
4 | 1 | 3 |
5 | 2 | 4 |
6 | 2 | 5 |
7 | 3 | 6 |
8 | 3 | 7 |
9 | 4 | 8 |
To obtain a large enough set for the targeted number of samples, it is often necessary to increase barcode length. Here, we want to correct 2 substitution errors and want to target at least 20 samples:
show(length(create.dnabarcodes(5, dist=5, heuristic="ashlock")))
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## [1] 4
show(length(create.dnabarcodes(6, dist=5, heuristic="ashlock")))
## 1) Creating pool ... of size 1160
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## [1] 8
show(length(create.dnabarcodes(7, dist=5, heuristic="ashlock")))
## 1) Creating pool ... of size 7568
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## [1] 21
Therefore, 7bp long DNA barcodes should be used.
To generate sets of DNA barcodes that support the correction of insertions, deletions, and substitutions (e.g., for the PacBio platform), a different distance metric must be used. In a DNA context (i.e., the DNA barcode is surrounded by other DNA nucleotides), the Sequence-Levenshtein distance is the right choice. Here, we generate a set of 5bp long DNA barcodes capable of correcting one indel or substitution error. The parameter to change the distance metric is named metric:
mySeqlevSet <- create.dnabarcodes(5, metric="seqlev", heuristic="ashlock")
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySeqlevSet)
## [1] "GTCAC" "ACTCT" "TCGAA" "AGCAA" "GGAGA" "CATGA" "TTAGG" "AACGG" "TATCC"
## [10] "TGTGT" "CCGTT" "GGCTT"
The requirements of the Sequence-Levenshtein distance are more stringent than those of the Hamming distance. The number of DNA barcodes in the set is therefore smaller, but the set is more robust.
By default, create.dnabarcodes
filters all those DNA sequences that contain
triplets, show a GC bias, or are self-complementary. Some of these filters may
be unnecessary on current or future platforms. For example, with the Illumina’s
Sequencing by Synthesis technology, the triplet filter is unnecessary. Here, we
generate a default set of DNA barcodes of length 5bp without filtering triplets:
mySetTriplets <- create.dnabarcodes(5, heuristic="ashlock", filter.triplets=FALSE)
## 1) Creating pool ... of size 640
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySetTriplets)
## [1] "GCTAA" "TGTGA" "ACTCT" "CGAAA" "GAAGA" "ACGGA" "CTCGA" "TCACA" "CAGCA"
## [10] "AGCCA" "GGGTA" "ACAAG" "GAGAG" "TGCAG" "CTTAG" "TTAGG" "AACGG" "ATGCG"
## [19] "TATCG" "CAATG" "TCGTG" "GTCTG" "AGTTG" "GTAAC" "AGGAC" "CACAC" "TAGGC"
## [28] "ATTGC" "AAACC" "TTCCC" "TGATC" "CTGTC" "ACCTC" "GATTC" "CCGAT" "AGAGT"
## [37] "GTGGT" "TCCGT" "CATGT" "CTACT" "TGGCT" "GACCT" "GCATT" "CGCTT"
Note that the size of the pool of candidate barcodes is larger (640) than for the default setting that includes the triplet filter (592). Potentially, this allows the creation of larger sets, which we observed for longer DNA barcodes.
In the following, we describe methods to generate subsets of existing sets of DNA barcodes. Often, a researcher already has a pre-existing set of DNA barcodes, for example in chemical form as indexes from his supplier. When a smaller number of samples needs to multiplexed, he can choose a more robust subset of the existing indexes to get better error correction capabilities.
The set of candidate barcodes is given to create.dnabarcodes
through the
parameter pool.
For example, the researcher may have a RNASeq library preparation kit that includes 48 indexes. The set can already be used to correct 1 substitution. He creates a robust subset for the correction of 2 substitutions the following way:
data(supplierSet)
myRobustSet <- create.dnabarcodes(7, dist=5, pool=supplierSet, heuristic="ashlock")
## 1) Creating pool ... of size 48
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(myRobustSet)
## [1] "CCGGTTA" "CGATAGC" "TCTTGAG" "GGAAGAA" "GATCACA" "AAGATGG" "AGACCTG"
## [8] "ATCGGAC" "CAAGCAT" "TGCAACT" "CTTCGTT"
Alternatively, we may want to create a subset that is capable of correcting indels in addition to substitutions:
myRobustSetSeqlev <- create.dnabarcodes(7, metric="seqlev", pool=supplierSet, heuristic="ashlock")
## 1) Creating pool ... of size 48
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(myRobustSetSeqlev)
## [1] "TTCGGTA" "GTGGTCA" "GGAAGAA" "CCAGGAA" "CTGCAGA" "GCTTAGA" "CCTCTGA"
## [8] "GATCACA" "ATAGGCA" "CCAATCA" "TGCACTA" "CCGGTTA" "TGTCCAG" "AAGATGG"
## [15] "TTATGCG" "ATCGTCG" "CAGTGTG" "AGACCTG" "CTCCTTG" "CTAGTAC" "CGATAGC"
## [22] "CAACTGC" "TCGATCT" "GGTCTCT"
Demultiplexing is processing step where reads are assigned to their samples.
Demultiplexing is achieved easily with the demultiplex
function. In the
following example we assume to have a file that contains all reads that start
with a DNA barcode. The used set contains 48 7nt long DNA barcodes and was
generated to correct up to one substitution.
data(mutatedReads)
demultiplex(head(mutatedReads), supplierSet)
## read barcode distance
## CCTCAAC CGTCAACCCGTCAACACGTCAACACGTCAACGCGTCAACT CCTCAAC 1
## GGAAGAA CGAAGAACCGAAGAAACGAAGAACCGAAGAAGCGAAGAAA GGAAGAA 1
## GTTAGCA GTTAGCAGGTTAGCATGTTAGCAGGTTAGCATGTTAGCAG GTTAGCA 0
## CCAGGAA TCAGGAACTCAGGAACTCAGGAAGTCAGGAATTCAGGAAT CCAGGAA 1
## CTAGTAC CAAGTACTCAAGTACCCAAGTACCCAAGTACTCAAGTACG CTAGTAC 1
## GGTCTCT GGTCACTGGGTCACTAGGTCACTAGGTCACTAGGTCACTT GGTCTCT 1
It is important to supply the correct distance metric to demultiplex
using
the parameter metric. Otherwise the result will change unintentionally and in
unexpected ways.
Suppose we obtained a set of DNA barcodes in form of sample indexes from our
library preparation supplier. We want to analyse the set to understand the
errors that can be corrected or detected. The function analyse.barcodes
easily does just that:
analyse.barcodes(supplierSet)
## Description hamming seqlev levenshtein
## 1 Mean Distance 5.242908 3.656915 4.560284
## 2 Median Distance 5.000000 4.000000 5.000000
## 3 Minimum Distance 3.000000 1.000000 2.000000
## 4 Maximum Distance 7.000000 7.000000 7.000000
## 5 Guaranteed Error Correction 1.000000 0.000000 0.000000
## 6 Guaranteed Error Detection 2.000000 0.000000 1.000000
The output table lists mean, median, minimum, and maximum distances of each pair of DNA barcodes in the set. Finally, it lists the guaranteed error correction and detection capabilities that can be reached for the set. The analysis is presented for a choice of distance metrics: Hamming, Sequence-Levenshtein, and Levenshtein.
The DNABarcodes
package currently supports four distance metrics. Their
capabilities and properties are as follows:
A high enough Hamming Distance (metric = "hamming"
) allows the
correction/detection of substitutions. Due to the ignorance of insertions and
deletions, any changes to the length of the barcode as well as DNA context are
ignored, which makes the Hamming distance a simple choice.
A high enough Sequence Levenshtein distance (metric = "seqlev"
) allows the
correction/detection of insertions, deletions, and substitutions in scenarios
where the barcode was attached to a DNA sequence. Your sequence read, coming
from a Illumina, Roche, PacBio NGS machine of your choice, should then start
with that barcode, followed by the insert or an adapter or some random base
calls.
A high enough Levenshtein distance (metric = "levenshtein"
) allows the
correction/detection of insertions, deletions, and substitutions in scenarios
where the barcode was not attached anywhere, respective where the exact
beginning and end of the barcode is known. This is as far as we know in no
current NGS technology the case. Do not use this distance metric, except you
know what you are doing.
A high enough Phaseshift distance (metric = "phaseshift"
) allows the
correction/detection of substitutions as well as insertions or deletions that
occurred to the front of the DNA barcode. The intended target technology of
this distance is the Illumina Sequencing by Synthesis platform. We consider
this metric to be highly experimental.
We support four different algorithms for the generation of sets of DNA barcodes. Each has its particular advantages and disadvantages.
The heuristics conway
and clique
produce fast results but are not nearly as
good as the heuristics sampling
and ashlock
. The clique
heuristic is
marginally slower and needs more memory than the conway
heuristic because it
first constructs a graph representation of the pool.
The heuristic ashlock
is assumed to produce the best heuristic results with a
reasonable parameter configuration. New users should first try conway
and
then ashlock
with default parameters.
Ashlock
HeuristicThe Ashlock heuristic is an evolutionary algorithm. Briefly, it iterates over a population of so-called chromosomes (not to be confused with actual genomic chromosomes), while trying to improve set creation results in each step. In each iteration, it retains the best chromosomes and replaces the rest with mutated copies of the best ones.
The algorithm has two parameters:
iterations
)population
)More iterations and a larger population will (possibly) increase the size of the resulting set of DNA Barcodes but will also increase running time considerably.
We believe that 100 is a very good default for the number of iterations. The best chance, in our opinion, is to increase the number of chromosomes in the population considerably, for example up to 500 or even 1000.
The following examples show the increase in set size:
length(create.dnabarcodes(10, metric="seqlev", heuristic="ashlock",cores=32))
## 1) Creating pool ... of size 488944
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## 2126
length(create.dnabarcodes(10, metric="seqlev", heuristic="ashlock",cores=32, population=500, iterations=250))
## 1) Creating pool ... of size 488944
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## 2133