Key differences between DNAcycP2 and DNAcycP
Following the release of DNAcycP, it was found that the intrinsic cyclizability scores derived from Basu et al. (2021) retained residual bias from the biotin effect, resulting in inaccuracies (Kendall et al., 2025). To address this, we employed a data augmentation + moving average smoothing method to produce unbiased estimates of intrinsic DNA cyclizability for each sequence in the original training dataset. A new model, trained on this corrected data but using the same architecture as DNAcycP, was developed, resulting in DNAcycP2. This version also introduces improved computational efficiency through parallelization options. Further details are available in Kendall et al. (2025).
To demonstrate the differences, we compared predictions from DNAcycP and DNAcycP2 in a yeast genomic region at base-pair resolution (Figure 1). The predicted biotin-dependent scores (\(\tilde C_{26}\), \(\tilde C_{29}\), and $ C_{31}$, model trained separately) show 10-bp periodic oscillations due to biotin biases, each with distinct phases. DNAcycP’s predictions improved over the biotin-dependent scores, while still show substantial local fluctuations likely caused by residual bias in the training data (the called intrinsic cyclizability score \(\hat C_0\) from Basu et al. 2021). In contrast, DNAcycP2, trained on corrected intrinsic cyclizability scores, produces much smoother local-scale predictions, indicating a further improvement in removing the biotin bias.
The DNAcycP2 package retains all prediction functions from the original DNAcycP. The improved prediction model, based on smoothed data, can be accessed using the argument smooth=TRUE in the main function (see usage below).
Usage
Main Functions
The DNAcycP2 R package provides two primary functions for cyclizability prediction:
cycle
: Takes an R object (vector of strings) as input. Each element in the vector is a DNA sequence.
cycle_fasta
: Takes the path of a fasta file as input.
Selecting the Prediction Model
Both functions use the smooth
argument to specify the prediction model:
smooth=TRUE
: DNAcycP2 (trained on smoothed data, recommended).
smooth=FALSE
: DNAcycP (trained on original data).
Parallelization with cycle_fasta
The cycle_fasta
function is designed for handling larger files and supports parallelization. To enable parallelization, use the following arguments:
n_cores
: Number of cores to use (default: 1).
chunk_length
: Sequence length (in bp) each core processes at a time (default: 100,000).
The cycle_fasta
function is designed for larger files, so it has added parallelization capability. To utilize this capability, specify the number of cores to be greater than 1 using the n_cores
argument (default 1). You can also specify the length of the sequence that each core will predict on at a given time using the chunk_length
argument (default 100000).
For reference, on a personal computer (16 Gb RAM, M1 chip with 8-core CPU), prediction at full parallelization directly on the yeast genome FASTA file completes in 12 minutes, and on the hg38 human genome Chromosome I FASTA file in just over 4 hours. In our experience, selection of parallelization parameters (n_cores
and chunk_length
) has little affect when making predictions on a personal computer, but if using the package on a high- performance compute cluster, prediction time should decrease as the number of cores increases. If you do run into memory issues, we first suggest reducing chunk_length
.
Prediction on a list/vector of sequences
ex2_file <- system.file("extdata", "ex2.txt", package = "DNAcycP2")
ex2 <- read.csv(ex2_file, header = FALSE)
ex2_smooth <- DNAcycP2::cycle(ex2$V1, smooth=TRUE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
#> Completed 10 out of 100 total sequences
#> Completed 20 out of 100 total sequences
#> Completed 30 out of 100 total sequences
#> Completed 40 out of 100 total sequences
#> Completed 50 out of 100 total sequences
#> Completed 60 out of 100 total sequences
#> Completed 70 out of 100 total sequences
#> Completed 80 out of 100 total sequences
#> Completed 90 out of 100 total sequences
#> Completed 100 out of 100 total sequences
ex2_original <- DNAcycP2::cycle(ex2$V1, smooth=FALSE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
#> Completed 10 out of 100 total sequences
#> Completed 20 out of 100 total sequences
#> Completed 30 out of 100 total sequences
#> Completed 40 out of 100 total sequences
#> Completed 50 out of 100 total sequences
#> Completed 60 out of 100 total sequences
#> Completed 70 out of 100 total sequences
#> Completed 80 out of 100 total sequences
#> Completed 90 out of 100 total sequences
#> Completed 100 out of 100 total sequences
cycle
takes the sequences themselves as input, so we first read the file (ex2_file
) and then provide the sequences as input (ex2$V1
)
The output (ex2_smooth
or ex2_original
) is a list with indices corresponding to each sequence from the sequences
argument (here it is ex2$V1
). For example, ex2.txt
contains 100 sequences. Therefore, both ex2_smooth
and ex2_original
will be lists of length 100,
where each entry in the list corresponds to the sequence with its same index.
Each item in the list (e.g. ex2_smooth[[1]]
) is a data.frame object with three columns. The first columns is always position
. When smooth=TRUE
, the second and third columns are C0S_norm
and C0S_unnorm
, and when smooth=FALSE
the second and third columns are C0_norm
and C0_unnorm
.
DNAcycP2 prediction – Normalized vs unnormalized
Both cycle_fasta
and cycle
output the prediction results in normalized (C0_norm
,C0S_norm
) and unnomralized (C0_unnorm
,C0S_unnorm
) version.
In DNAcycP2, the predicted cyclizability always contains normalized and unnormalized values. the unnormalized results were based on the model trained on unnormalized \(\hat C_0\) or \(\hat C_0^s\) scores. In contrast, the normalized results were predicted by the model trained on the normalized \(\hat C_0\) or \(\hat C_0^s\) values. The cyclizability score from different loop-seq libraries may be subject to a systematic library-specific constant difference due to its definition (see
Basu et al 2021), and hence it’s a relative measure and not directly comparable between libraries. The normalization will force the training data to have mean = 0 and standard deviation = 1 such that the 50 bp sequences from yeast genome roughly have mean = 0 and standard deviation = 1 for intrinsic cyclizabilty score. Thus for any sequence under prediciton, the normalized C-score can be more informative in terms of its cyclizabilty relative to the population. For example, the C-score provides statisitcal significance indicator, i.e. a C-score of 1.96 indicates 97.5% in the distribution.
Save DNAcycP2 prediciton to external file
Both cycle_fasta
and cycle
provides an argument save_path_prefix
to save the prediction results onto local hard drive. For example:
ex2_smooth <- DNAcycP2::cycle(
ex2$V1,
smooth=TRUE,
save_path_prefix="ex2_smooth"
)
This will execute the same predictions as previously, and additionally save two files named ‘ex2_smooth_C0S_norm.txt’ and ‘ex2_smooth_C0S_unnorm.txt’ to the current working directory. The output files from cycle_fasta
have the same format as the function output, but for consistency with the Python pacakge it is important to note that the output files from cycle
have a different format than the function output. Namely, rather than writing a single file for every sequence, the function always writes two files (regardless of the number of sequences), one containing normalized predictions for every sequence (ending in ‘C0S_norm.txt’ or ‘C0_norm.txt’) and the other containing unnormalized predictions for every sequence (ending in ‘C0S_unnorm.txt’ or ‘C0_unnorm.txt’). C-scores in each line correspond to the sequence from the sequences
input in the same order.
For any input sequence, DNAcycP2 predicts the C-score for every 50 bp. Regardless of the input sequence format the first C-score in the output file corresponds to the sequence from position 1-50, second for 2-51 and so forth.
Example 3 (Single Sequence):
If you want the predict C-scores for a single sequence, you can follow the same protocol as Example 1 or 2, depending on the input format. We have included two example files representing the same 1000bp stretch of S. Cerevisiae sacCer3 Chromosome I (1:1000) in .fasta and .txt format.
First, we will consider the .fasta format:
ex3_fasta_file <- system.file(
"extdata", "ex3_single_seq.fasta", package = "DNAcycP2"
)
ex3_fasta_smooth <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=TRUE)
#> Sequence length for ID 1: 1000
#> Predicting cyclizability...
#> Chunk size: 100000, num threads: 1
ex3_fasta_original <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=FALSE)
#> Sequence length for ID 1: 1000
#> Predicting cyclizability...
#> Chunk size: 100000, num threads: 1
The output (ex3_fasta_smooth
or ex3_fasta_original
) is a list with 1 entry named “cycle_1”.
Let’s say we are interested only in the smooth (DNAcycP2), normalized predictions for the subsequence defined by the first 100bp (corresponding to subsequences defined by regions [1,50], [2,51], …, and [51-100], or position
s 25, 26, …, and 75). We can access the outputs for this subsequence using the following command:
ex3_fasta_smooth[[1]][1:51,c("position", "C0S_norm")]
#> position C0S_norm
#> 1 25 0.94794828
#> 2 26 0.88397902
#> 3 27 1.05313075
#> 4 28 1.31837559
#> 5 29 1.43364513
#> 6 30 1.47490740
#> 7 31 1.68857002
#> 8 32 1.79380059
#> 9 33 1.78517485
#> 10 34 1.51873457
#> 11 35 1.83588874
#> 12 36 1.81811345
#> 13 37 1.89403594
#> 14 38 1.94740200
#> 15 39 1.67138946
#> 16 40 1.55992007
#> 17 41 1.66846406
#> 18 42 1.62409890
#> 19 43 1.54238129
#> 20 44 1.46230042
#> 21 45 1.41134250
#> 22 46 1.27757108
#> 23 47 1.02153277
#> 24 48 1.01221037
#> 25 49 0.93556404
#> 26 50 1.00608754
#> 27 51 1.09152973
#> 28 52 1.32879055
#> 29 53 1.25343049
#> 30 54 1.34268761
#> 31 55 1.13988316
#> 32 56 1.00737357
#> 33 57 1.13837850
#> 34 58 1.11469960
#> 35 59 0.96556181
#> 36 60 0.85989267
#> 37 61 0.91100568
#> 38 62 0.87353712
#> 39 63 0.55437237
#> 40 64 0.60366714
#> 41 65 0.41495007
#> 42 66 0.22158629
#> 43 67 0.14434627
#> 44 68 0.08841624
#> 45 69 -0.09656694
#> 46 70 -0.19039956
#> 47 71 -0.34425002
#> 48 72 -0.41227332
#> 49 73 -0.38969830
#> 50 74 -0.33039862
#> 51 75 -0.44366446
Or, equivalently,
ex3_fasta_smooth$cycle_1[1:51,c("position", "C0S_norm")]
#> position C0S_norm
#> 1 25 0.94794828
#> 2 26 0.88397902
#> 3 27 1.05313075
#> 4 28 1.31837559
#> 5 29 1.43364513
#> 6 30 1.47490740
#> 7 31 1.68857002
#> 8 32 1.79380059
#> 9 33 1.78517485
#> 10 34 1.51873457
#> 11 35 1.83588874
#> 12 36 1.81811345
#> 13 37 1.89403594
#> 14 38 1.94740200
#> 15 39 1.67138946
#> 16 40 1.55992007
#> 17 41 1.66846406
#> 18 42 1.62409890
#> 19 43 1.54238129
#> 20 44 1.46230042
#> 21 45 1.41134250
#> 22 46 1.27757108
#> 23 47 1.02153277
#> 24 48 1.01221037
#> 25 49 0.93556404
#> 26 50 1.00608754
#> 27 51 1.09152973
#> 28 52 1.32879055
#> 29 53 1.25343049
#> 30 54 1.34268761
#> 31 55 1.13988316
#> 32 56 1.00737357
#> 33 57 1.13837850
#> 34 58 1.11469960
#> 35 59 0.96556181
#> 36 60 0.85989267
#> 37 61 0.91100568
#> 38 62 0.87353712
#> 39 63 0.55437237
#> 40 64 0.60366714
#> 41 65 0.41495007
#> 42 66 0.22158629
#> 43 67 0.14434627
#> 44 68 0.08841624
#> 45 69 -0.09656694
#> 46 70 -0.19039956
#> 47 71 -0.34425002
#> 48 72 -0.41227332
#> 49 73 -0.38969830
#> 50 74 -0.33039862
#> 51 75 -0.44366446
Next, we will consider the .txt format:
ex3_txt_file <- system.file(
"extdata",
"ex3_single_seq.txt",
package = "DNAcycP2"
)
ex3_txt <- read.csv(ex3_txt_file, header = FALSE)
ex3_txt_smooth <- DNAcycP2::cycle(ex3_txt$V1, smooth=TRUE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
ex3_txt_original <- DNAcycP2::cycle(ex3_txt$V1, smooth=FALSE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
The output (ex3_txt_smooth
or ex3_txt_original
) is a list with 1 entry (unnamed).
Note, that ex3_fasta_smooth
and ex3_txt_smooth
are essentially equivalent. The only exceptions are perhaps slight rounding differences that come from the computation, and that the list ex3_fasta_smooth
has named entries (‘cycle_1’) while ex3_txt_smooth
does not. The same applies for ex3_fasta_original
and ex3_txt_original
.
Therefore, we can use a similar command to access the outputs for our subsequence of interest:
ex3_txt_smooth[[1]][1:51,c("position", "C0S_norm")]
If there is a sequence (or group of sequences) we want to make predictions on, we can also input them directly as strings. For example:
input_seq1 =
"CATGACTGCAGCTAAAACGTTGACCTAGTCGTCAGTCTACGTACTAGCGTAGCTATATCGAGTCTAGCGTCTAG"
input_seq2 = "ATCTTTTGTATATCAAAAGACTAGATCGATTAGCGTACGCCCCTGACTAGATAGATCG"
seq1_smooth = DNAcycP2::cycle(c(input_seq1), smooth=TRUE)
both_seqs_smooth = DNAcycP2::cycle(c(input_seq1, input_seq2), smooth=TRUE)