Contents

1 Introduction

Calculating pairwise distances of either DNA or AA sequences is a common task for evolutionary biologist. The distance calculations are either based on specific nucleotide, codon or amino acid models or on a scoring matrix.

Note: Sequences need to be pre-aligned into so called multiple sequence alignments (MSA), which can be done with a multitude of existing software. Just to mention for example mafft, muscle or the R package msa.

The R package ape for example offers the ape::dist.dna() function, which has implemented a collection of different evolutionary models. MSA2dist extends the possibility to directly calculate pairwise nucloetide distances of an Biostrings::DNAStringSet object or pairwise amino acid distances of an Biostrings::AAStringSet object. The scoring matrix based calculations are implemented in c++ with RcppThread to parallelise pairwise combinations.

It is a non-trivial part to resolve haploid (1n) sequences from a diploid (2n) individual (aka phasing) to further use the haploid sequences for distance calculations. To cope with this situation, MSA2dist uses a literal distance (Chang et al. 2017) which can be directly applied on IUPAC nucleotide ambiguity encoded sequences with the dnastring2dist() function. IUPAC sequences can be for example obtained directly from mapped BAM files and the angsd -doFasta 4 option (Korneliussen, Albrechtsen, and Nielsen 2014).

The Grantham’s score (Grantham 1974) attempts to predict the distance between two amino acids, in an evolutionary sense considering the amino acid composition, polarity and molecular volume. MSA2dist offers with the aastring2dist() function the possibility to obtain pairwise distances of all sequences in an Biostrings::AAStringSet (needs to be pre-aligned). The resulting distance matrix can be used to calculate neighbor-joining trees via the R package ape.

Calculating synonymous (Ks) and nonsynonymous (Ka) substitutions from coding sequences and its ratio Ka/Ks can be used as an indicator of selective pressure acting on a protein. The dnastring2kaks() function can be applied on pre-aligned Biostrings::DNAStringSet() objects to calculate these values either according to (Li 1993) via the R package seqinr or according to the model of (Nei and Gojobori 1986).

Further, all codons can be evaluated among the coding sequence alignment and be plotted to for example protein domains with substitutions or indels with the codonmat2xy() function.

2 Installation

To install this package, start R (version “4.1”) and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSA2dist")

3 Load MSA2dist

# load MSA2dist
library(MSA2dist)
# load example data
data(hiv, package="MSA2dist")
data(AAMatrix, package="MSA2dist")
data(woodmouse, package="ape")

4 Sequence Format conversion

To be able to use distance calculation functions from other R packages, like ape or seqinr, it is necessary to have dedicated sequence format conversion functions. Here, some examples are shown, how to convert from and to a Biostrings::DNAStringSet object.

?Biostrings::DNAStringSet() >>> ?seqinr::as.alignment()

## define two cds sequences
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATG---CATTGC")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
## define names
names(cds1.cds2.aln) <- c("seq1", "seq2")
## convert into alignment
cds1.cds2.aln |> dnastring2aln()
## $nb
## [1] 2
## 
## $nam
## [1] "seq1" "seq2"
## 
## $seq
## [1] "atgcaacattgc" "atg---cattgc"
## 
## $com
## [1] NA
## 
## attr(,"class")
## [1] "alignment"

?seqinr::as.alignment() >>> ?Biostrings::DNAStringSet()

## convert back into DNAStringSet
cds1.cds2.aln |> dnastring2aln() |> aln2dnastring()
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]    12 ATGCAACATTGC                                      seq1
## [2]    12 ATG---CATTGC                                      seq2

?Biostrings::DNAStringSet() >>> ?ape::DNAbin()

## convert into alignment
cds1.cds2.aln |> dnastring2dnabin()
## 2 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 12 
## 
## Labels:
## seq1
## seq2
## 
## Base composition:
##     a     c     g     t 
## 0.286 0.238 0.190 0.286 
## (Total: 24 bases)

?ape::DNAbin() >>> ?Biostrings::DNAStringSet()

## convert back into DNAStringSet
cds1.cds2.aln |> dnastring2dnabin() |> dnabin2dnastring()
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]    12 ATGCAACATTGC                                      seq1
## [2]    12 ATG---CATTGC                                      seq2
## use woodmouse data
woodmouse |> dnabin2dnastring()
## DNAStringSet object of length 15:
##      width seq                                              names               
##  [1]   965 NTTCGAAAAACACACCCACTACT...GCCCAATTACTCAGACCCTATA No305
##  [2]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCTGTN No304
##  [3]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCTATA No306
##  [4]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAATACNNNN No0906S
##  [5]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCNNNN No0908S
##  ...   ... ...
## [11]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCNNNN No1007S
## [12]   965 NNNNNNNNNNNNNNNNNNNNNNN...GCCCAATTACTCAAACCCNNNN No1114S
## [13]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCNNNN No1202S
## [14]   965 ATTCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCNNNN No1206S
## [15]   965 NNNCGAAAAACACACCCACTACT...GCCCAATTACTCAAACCCNNNN No1208S

?Biostrings::AAStringSet() >>> ?seqinr::as.alignment()

## translate cds into aa
aa1.aa2.aln <- cds1.cds2.aln |> cds2aa()
## convert into alignment
aa1.aa2.aln |> aastring2aln()
## $nb
## [1] 2
## 
## $nam
## [1] "seq1" "seq2"
## 
## $seq
## [1] "mqhc" "mxhc"
## 
## $com
## [1] NA
## 
## attr(,"class")
## [1] "alignment"

?seqinr::as.alignment() >>> ?Biostrings::AAStringSet()

## convert back into AAStringSet
aa1.aa2.aln |> aastring2aln() |> aln2aastring()
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]     4 MQHC                                              seq1
## [2]     4 MXHC                                              seq2

?Biostrings::AAStringSet() >>> ?ape::as.AAbin()

## convert into AAbin
aa1.aa2.aln |> aastring2aabin()
## 2 amino acid sequences in a list
## 
## All sequences of the same length: 4

?ape::as.AAbin() >>> ?Biostrings::AAStringSet()

## convert back into AAStringSet
aa1.aa2.aln |> aastring2aabin() |> aabin2aastring()
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]     4 MQHC                                              seq1
## [2]     4 MXHC                                              seq2

5 Frame aware Biostrings::DNAStringSet translation (cds2aa())

To be able to translate a coding sequence into amino acids, sometimes the sequences do not start at the first frame. The cds2aa function can take an alternative codon start site into account (frame = 1 or frame = 2 or frame = 3). However, sometimes it is also necessary that the resulting coding sequence length is a multiple of three. This can be forced by using the shorten = TRUE option.

Simple translation:

## define two cds sequences
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATG---CATTGC")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
## define names
names(cds1.cds2.aln) <- c("seq1", "seq2")
## translate cds into aa
cds1.cds2.aln |> cds2aa()
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]     4 MQHC                                              seq1
## [2]     4 MXHC                                              seq2
aa1.aa2.aln <- cds1.cds2.aln |> cds2aa()

Translation keeping multiple of three sequence length:

## translate cds into aa using frame = 2
## result is empty due to not multiple of three
cds1.cds2.aln |> cds2aa(frame=2)
## AAStringSet object of length 0
## translate cds into aa using frame = 2 and shorten = TRUE
cds1.cds2.aln |> cds2aa(frame=2, shorten=TRUE)
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]     3 CNI                                               seq1
## [2]     3 XXI                                               seq2
## translate cds into aa using frame = 3 and shorten = TRUE
cds1.cds2.aln |> cds2aa(frame=3, shorten=TRUE)
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]     3 ATL                                               seq1
## [2]     3 XXL                                               seq2
## use woodmouse data
woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE)
## AAStringSet object of length 15:
##      width seq                                              names               
##  [1]   321 XRKTHPLLKXISHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTL No305
##  [2]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTL No304
##  [3]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTL No306
##  [4]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQIX No0906S
##  [5]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTX No0908S
##  ...   ... ...
## [11]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTX No1007S
## [12]   321 XXXXXXXXXXXXXXXIDLPAPSN...LLPFLHTSKQRSLIFRPITQTX No1114S
## [13]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTX No1202S
## [14]   321 IRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTX No1206S
## [15]   321 XRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLIFRPITQTX No1208S

Translation using alternative genetic code:

As you can see from the above example, the initial amino acids I will change into M due to the mitochondrial translation code and also some * stop codons will change into a W amino acid.

## alternative genetic code
## use woodmouse data
woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE,
    genetic.code=Biostrings::getGeneticCode("2"))
## AAStringSet object of length 15:
##      width seq                                              names               
##  [1]   321 XRKTHPLLKXISHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTL No305
##  [2]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTL No304
##  [3]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTL No306
##  [4]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQMX No0906S
##  [5]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTX No0908S
##  ...   ... ...
## [11]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTX No1007S
## [12]   321 XXXXXXXXXXXXXXXIDLPAPSN...LLPFLHTSKQRSLMFRPITQTX No1114S
## [13]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTX No1202S
## [14]   321 MRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTX No1206S
## [15]   321 XRKTHPLLKIINHSFIDLPAPSN...LLPFLHTSKQRSLMFRPITQTX No1208S

6 Pairwise sequence comparison

6.1 Calculate pairwise AA distances (aastring2dist())

6.1.1 Grantham’s distance

## calculate pairwise AA distances based on Grantham's distance
aa.dist <- hiv |> cds2aa() |> aastring2dist(score=granthamMatrix())
## 
Computing: [========================================] 100% (done)
## obtain distances
head(aa.dist$distSTRING)
##           U68496   U68497    U68498    U68499    U68500   U68501   U68502
## U68496  0.000000  4.43956 11.516484  9.879121 13.098901 17.05495 19.71429
## U68497  4.439560  0.00000 12.681319 10.406593 13.626374 18.21978 19.35165
## U68498 11.516484 12.68132  0.000000  8.769231  8.571429 13.25275 16.23077
## U68499  9.879121 10.40659  8.769231  0.000000  4.780220 15.58242 15.21978
## U68500 13.098901 13.62637  8.571429  4.780220  0.000000 15.38462 14.35165
## U68501 17.054945 18.21978 13.252747 15.582418 15.384615  0.00000 16.43956
##          U68503    U68504   U68505   U68506   U68507   U68508
## U68496 17.82418 13.857143 14.24176 14.92308 18.01099 19.01099
## U68497 19.80220 13.890110 14.27473 14.82418 18.54945 18.90110
## U68498 16.15385  9.516484 10.27473 11.04396 12.15385 15.56044
## U68499 16.58242 10.483516 12.47253 12.01099 14.24176 13.52747
## U68500 14.95604 10.571429 12.56044 12.06593 13.84615 10.89011
## U68501 17.50549 10.758242 13.08791 11.78022 14.91209 13.86813
## obtain pairwise sites used
head(aa.dist$sitesUsed)
##        U68496 U68497 U68498 U68499 U68500 U68501 U68502 U68503 U68504 U68505
## U68496     91     91     91     91     91     91     91     91     91     91
## U68497     91     91     91     91     91     91     91     91     91     91
## U68498     91     91     91     91     91     91     91     91     91     91
## U68499     91     91     91     91     91     91     91     91     91     91
## U68500     91     91     91     91     91     91     91     91     91     91
## U68501     91     91     91     91     91     91     91     91     91     91
##        U68506 U68507 U68508
## U68496     91     91     91
## U68497     91     91     91
## U68498     91     91     91
## U68499     91     91     91
## U68500     91     91     91
## U68501     91     91     91
## create and plot bionj tree
aa.dist.bionj <- ape::bionj(as.dist(aa.dist$distSTRING))
plot(aa.dist.bionj)

To use a different score matrix, here as an example the AAMatrix from the R package alakazam is used:

## use AAMatrix data
head(AAMatrix)
##   A B C D E F G H I J K L M N P Q R S T V W X Y Z * - .
## A 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0
## B 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 0
## C 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0
## D 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0
## E 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0
## F 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0
aa.dist.AAMatrix <- hiv |> cds2aa() |> aastring2dist(score=AAMatrix)
## 
Computing: [========================================] 100% (done)
head(aa.dist.AAMatrix$distSTRING)
##            U68496     U68497    U68498     U68499     U68500    U68501
## U68496 0.00000000 0.06593407 0.1758242 0.17582418 0.23076923 0.2417582
## U68497 0.06593407 0.00000000 0.2197802 0.19780220 0.25274725 0.2857143
## U68498 0.17582418 0.21978022 0.0000000 0.13186813 0.16483516 0.1868132
## U68499 0.17582418 0.19780220 0.1318681 0.00000000 0.08791209 0.2307692
## U68500 0.23076923 0.25274725 0.1648352 0.08791209 0.00000000 0.2637363
## U68501 0.24175824 0.28571429 0.1868132 0.23076923 0.26373626 0.0000000
##           U68502    U68503    U68504    U68505    U68506    U68507    U68508
## U68496 0.2857143 0.2527473 0.2417582 0.2307692 0.2527473 0.2747253 0.3186813
## U68497 0.2967033 0.2747253 0.2527473 0.2417582 0.2637363 0.3076923 0.3186813
## U68498 0.2087912 0.2417582 0.1758242 0.1758242 0.2087912 0.1868132 0.2637363
## U68499 0.2197802 0.2417582 0.1758242 0.1758242 0.2087912 0.2087912 0.2197802
## U68500 0.2527473 0.2527473 0.2087912 0.2087912 0.2307692 0.2417582 0.2197802
## U68501 0.2527473 0.2637363 0.2087912 0.2307692 0.2197802 0.2417582 0.2747253

6.2 Calculate pairwise DNA distances (dnastring2dist())

6.2.1 ape::dist.dna models

## use hiv data
dna.dist <- hiv |> dnastring2dist(model="K80")
## 
Computing: [========================================] 100% (done)
## obtain distances
head(dna.dist$distSTRING)
##            U68496     U68497     U68498     U68499     U68500    U68501
## U68496 0.00000000 0.03381189 0.07731910 0.08135801 0.11058044 0.1022214
## U68497 0.03381189 0.00000000 0.09396372 0.08960527 0.11916567 0.1194893
## U68498 0.07731910 0.09396372 0.00000000 0.05333488 0.07342797 0.0773191
## U68499 0.08135801 0.08960527 0.05333488 0.00000000 0.04143570 0.1022214
## U68500 0.11058044 0.11916567 0.07342797 0.04143570 0.00000000 0.1237391
## U68501 0.10222140 0.11948926 0.07731910 0.10222140 0.12373909 0.0000000
##           U68502    U68503     U68504     U68505    U68506     U68507    U68508
## U68496 0.1503488 0.1282034 0.11948926 0.11991759 0.1157063 0.13779195 0.1684152
## U68497 0.1594427 0.1280155 0.13725343 0.12865927 0.1381014 0.15620300 0.1682063
## U68498 0.1105804 0.1067010 0.09823725 0.09004207 0.1110678 0.08987206 0.1366493
## U68499 0.1150839 0.1157063 0.09808005 0.09412740 0.1108796 0.10254762 0.1232278
## U68500 0.1457404 0.1286593 0.12357099 0.11948926 0.1323187 0.12415474 0.1234296
## U68501 0.1150839 0.1241547 0.10670096 0.10711515 0.1108796 0.11570627 0.1368233
## obtain pairwise sites used
head(dna.dist$sitesUsed)
##        U68496 U68497 U68498 U68499 U68500 U68501 U68502 U68503 U68504 U68505
## U68496    273    273    273    273    273    273    273    273    273    273
## U68497    273    273    273    273    273    273    273    273    273    273
## U68498    273    273    273    273    273    273    273    273    273    273
## U68499    273    273    273    273    273    273    273    273    273    273
## U68500    273    273    273    273    273    273    273    273    273    273
## U68501    273    273    273    273    273    273    273    273    273    273
##        U68506 U68507 U68508
## U68496    273    273    273
## U68497    273    273    273
## U68498    273    273    273
## U68499    273    273    273
## U68500    273    273    273
## U68501    273    273    273
## create and plot bionj tree
dna.dist.bionj <- ape::bionj(as.dist(dna.dist$distSTRING))

It is also possible to compare the amino acid and nucleotide based trees:

## creation of the association matrix:
association <- cbind(aa.dist.bionj$tip.label, aa.dist.bionj$tip.label)
## cophyloplot
ape::cophyloplot(aa.dist.bionj,
    dna.dist.bionj,
    assoc=association,
    length.line=4,
    space=28,
    gap=3,
    rotate=FALSE)

6.2.2 IUPAC distance

## use hiv data
hiv.dist.iupac <- head(hiv |> dnastring2dist(model="IUPAC"))
## 
Computing: [========================================] 100% (done)
head(hiv.dist.iupac$distSTRING)
##            U68496     U68497     U68498     U68499     U68500     U68501
## U68496 0.00000000 0.03296703 0.07326007 0.07692308 0.10256410 0.09523810
## U68497 0.03296703 0.00000000 0.08791209 0.08424908 0.10989011 0.10989011
## U68498 0.07326007 0.08791209 0.00000000 0.05128205 0.06959707 0.07326007
## U68499 0.07692308 0.08424908 0.05128205 0.00000000 0.04029304 0.09523810
## U68500 0.10256410 0.10989011 0.06959707 0.04029304 0.00000000 0.11355311
## U68501 0.09523810 0.10989011 0.07326007 0.09523810 0.11355311 0.00000000
##           U68502    U68503     U68504     U68505    U68506     U68507    U68508
## U68496 0.1355311 0.1172161 0.10989011 0.10989011 0.1062271 0.12454212 0.1501832
## U68497 0.1428571 0.1172161 0.12454212 0.11721612 0.1245421 0.13919414 0.1501832
## U68498 0.1025641 0.0989011 0.09157509 0.08424908 0.1025641 0.08424908 0.1245421
## U68499 0.1062271 0.1062271 0.09157509 0.08791209 0.1025641 0.09523810 0.1135531
## U68500 0.1318681 0.1172161 0.11355311 0.10989011 0.1208791 0.11355311 0.1135531
## U68501 0.1062271 0.1135531 0.09890110 0.09890110 0.1025641 0.10622711 0.1245421
## run multi-threaded
system.time(hiv |> dnastring2dist(model="IUPAC", threads=1))
## 
Computing: [========================================] 100% (done)
##    user  system elapsed 
##   0.005   0.000   0.004
system.time(hiv |> dnastring2dist(model="IUPAC", threads=2))
## 
Computing: [========================================] 100% (done)
##    user  system elapsed 
##   0.005   0.000   0.004

Woodmouse data example:

## use woodmouse data
woodmouse.dist <- woodmouse |> dnabin2dnastring() |> dnastring2dist()
## 
Computing: [========================================] 100% (done)
head(woodmouse.dist$distSTRING)
##              No305       No304       No306     No0906S    No0908S    No0909S
## No305   0.00000000 0.016684046 0.013541667 0.018789144 0.01670146 0.01670146
## No304   0.01668405 0.000000000 0.005208333 0.013555787 0.01147028 0.01564129
## No306   0.01354167 0.005208333 0.000000000 0.009384776 0.00729927 0.01147028
## No0906S 0.01878914 0.013555787 0.009384776 0.000000000 0.01248699 0.01664932
## No0908S 0.01670146 0.011470282 0.007299270 0.012486993 0.00000000 0.01456816
## No0909S 0.01670146 0.015641293 0.011470282 0.016649324 0.01456816 0.00000000
##             No0910S     No0912S     No0913S     No1103S     No1007S    No1114S
## No305   0.017745303 0.014613779 0.018789144 0.012526096 0.016701461 0.01531729
## No304   0.012513034 0.013555787 0.005213764 0.011470282 0.015641293 0.01642935
## No306   0.008342023 0.009384776 0.005213764 0.007299270 0.011470282 0.01533406
## No0906S 0.009365245 0.014568158 0.012486993 0.012486993 0.016649324 0.02076503
## No0908S 0.011446410 0.012486993 0.012486993 0.010405827 0.014568158 0.02076503
## No0909S 0.015608741 0.010405827 0.016649324 0.008324662 0.002081165 0.02076503
##             No1202S     No1206S     No1208S
## No305   0.016701461 0.016701461 0.018828452
## No304   0.011470282 0.012513034 0.017782427
## No306   0.007299270 0.008342023 0.013598326
## No0906S 0.008324662 0.011446410 0.018789144
## No0908S 0.010405827 0.009365245 0.016701461
## No0909S 0.014568158 0.015608741 0.002087683

7 Coding sequences

7.1 Calculating synonymous and nonsynonymous substitutions (dnastring2kaks())

## use hiv data
## model Li
head(hiv |> dnastring2kaks(model="Li"))
## Joining with `by = join_by(seq1, seq2)`
## Joining with `by = join_by(seq1, seq2)`
## Joining with `by = join_by(seq1, seq2)`
##   Comp1 Comp2   seq1   seq2         ka         ks          vka          vks
## 1     1     2 U68496 U68497 0.03026357 0.03170319 0.0003051202 0.0004007730
## 2     1     3 U68496 U68498 0.09777332 0.01761416 0.0009314173 0.0005091970
## 3     1     4 U68496 U68499 0.10295875 0.01767311 0.0009595527 0.0005787387
## 4     1     5 U68496 U68500 0.13461355 0.04639690 0.0013731885 0.0020497481
## 5     1     6 U68496 U68501 0.12607831 0.02844294 0.0013277282 0.0006310195
## 6     1     7 U68496 U68502 0.17441037 0.10926532 0.0017871934 0.0040766687
## model NG86
head(hiv |> dnastring2kaks(model="NG86", threads=1))
##          Comp1 Comp2   seq1   seq2 Codons Compared Ambigiuous Indels Ns
## result.1     1     2 U68496 U68497     91       91          0      0  0
## result.2     1     3 U68496 U68498     91       91          0      0  0
## result.3     1     4 U68496 U68499     91       91          0      0  0
## result.4     1     5 U68496 U68500     91       91          0      0  0
## result.5     1     6 U68496 U68501     91       91          0      0  0
## result.6     1     7 U68496 U68502     91       91          0      0  0
##                        Sd               Sn                S                N
## result.1                3                6               57              216
## result.2              1.5             18.5             57.5            215.5
## result.3              1.5             19.5             56.5            216.5
## result.4              2.5             25.5 56.1666666666667 216.833333333333
## result.5              2.5             23.5 57.3333333333333 215.666666666667
## result.6 5.83333333333333 31.1666666666667             57.5            215.5
##                          ps                 pn             pn/ps
## result.1 0.0526315789473684 0.0277777777777778 0.527777777777778
## result.2 0.0260869565217391   0.08584686774942   3.2907965970611
## result.3 0.0265486725663717 0.0900692840646651  3.39260969976905
## result.4 0.0445103857566766  0.117601844734819  2.64212144504228
## result.5 0.0436046511627907  0.108964451313756  2.49891808346213
## result.6  0.101449275362319  0.144624903325599  1.42558833278091
##                          ds                 dn             dn/ds
## result.1 0.0545695157118212 0.0283052459871352 0.518700699793888
## result.2  0.026551445288187 0.0911703470876381  3.43372445823884
## result.3 0.0270299523623977 0.0959537646568416  3.54990505977829
## result.4 0.0458858673563109  0.127915513677956  2.78768869474935
## result.5 0.0449236061858017  0.117741219733924  2.62092093067847
## result.6  0.108999740568461  0.160668709585171  1.47402836692312

7.2 Codon comparison

As an example for the codon comparison data from the Human Immunodeficiency Virus Type 1 is used (Ganeshan et al. 1997), (Yang et al. 2000).

The window plots are constructed with the R package ggplot2.

7.2.1 Create codon matrix (dnastring2codonmat())

## define two cds sequences
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATG---CATTGC")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
## convert into alignment
cds1.cds2.aln |> dnastring2codonmat()
##      [,1]  [,2] 
## [1,] "ATG" "ATG"
## [2,] "CAA" "---"
## [3,] "CAT" "CAT"
## [4,] "TGC" "TGC"

Like the cds2aa() function, also the dnastring2codonmat() function is implemented to handle different frames.

## use frame 2 and shorten to circumvent multiple of three error
cds1 <- Biostrings::DNAString("-ATGCAACATTGC-")
cds2 <- Biostrings::DNAString("-ATG---CATTGC-")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
cds1.cds2.aln |> dnastring2codonmat(frame=2, shorten=TRUE)
##      [,1]  [,2] 
## [1,] "ATG" "ATG"
## [2,] "CAA" "---"
## [3,] "CAT" "CAT"
## [4,] "TGC" "TGC"

7.2.2 Calculate average behavior of each codon (codonmat2xy())

## use hiv data
## calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()
## Joining with `by = join_by(Codon)`
## Joining with `by = join_by(Codon)`
## Joining with `by = join_by(Codon)`

7.2.3 Plot average behavior of each codon

print(hiv.xy |> dplyr::select(Codon,SynMean,NonSynMean,IndelMean) |>
    tidyr::gather(variable, values, -Codon) |>
    ggplot2::ggplot(aes(x=Codon, y=values)) + 
    ggplot2::geom_line(aes(colour=factor(variable))) + 
    ggplot2::geom_point(aes(colour=factor(variable))) + 
    ggplot2::ggtitle("HIV-1 sample 136 patient 1 from
        Sweden envelope glycoprotein (env) gene"))

8 References

Chang, Peter L, Emily Kopania, Sara Keeble, Brice AJ Sarver, Erica Larson, Annie Orth, Khalid Belkhir, et al. 2017. “Whole Exome Sequencing of Wild-Derived Inbred Strains of Mice Improves Power to Link Phenotype and Genotype.” Mammalian Genome 28 (9): 416–25.

Ganeshan, Shanthi, Ruth E Dickover, BT Korber, Yvonne J Bryson, and Steven M Wolinsky. 1997. “Human Immunodeficiency Virus Type 1 Genetic Evolution in Children with Different Rates of Development of Disease.” Journal of Virology 71 (1): 663–77.

Goldman, Nick, and Ziheng Yang. 1994. “A Codon-Based Model of Nucleotide Substitution for Protein-Coding Dna Sequences.” Molecular Biology and Evolution 11 (5): 725–36.

Grantham, Richard. 1974. “Amino Acid Difference Formula to Help Explain Protein Evolution.” Science 185 (4154): 862–64.

Harr, Bettina, Emre Karakoc, Rafik Neme, Meike Teschke, Christine Pfeifle, Željka Pezer, Hiba Babiker, et al. 2016. “Genomic Resources for Wild Populations of the House Mouse, Mus Musculus and Its Close Relative Mus Spretus.” Scientific Data 3 (1): 1–14.

Korneliussen, Thorfinn Sand, Anders Albrechtsen, and Rasmus Nielsen. 2014. “ANGSD: Analysis of Next Generation Sequencing Data.” BMC Bioinformatics 15 (1): 1–13.

Li, Wen-Hsiung. 1993. “Unbiased Estimation of the Rates of Synonymous and Nonsynonymous Substitution.” Journal of Molecular Evolution 36 (1): 96–99.

Li, Wen-Hsiung, Chung-I Wu, and Chi-Cheng Luo. 1985. “A New Method for Estimating Synonymous and Nonsynonymous Rates of Nucleotide Substitution Considering the Relative Likelihood of Nucleotide and Codon Changes.” Molecular Biology and Evolution 2 (2): 150–74.

Nei, Masatoshi, and Takashi Gojobori. 1986. “Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions.” Molecular Biology and Evolution 3 (5): 418–26.

Pamilo, Pekka, and Nestor O Bianchi. 1993. “Evolution of the Zfx and Zfy Genes: Rates and Interdependence Between the Genes.” Molecular Biology and Evolution 10 (2): 271–81.

Tzeng, Yun-Huei, Runsun Pan, and Wen-Hsiung Li. 2004. “Comparison of Three Methods for Estimating Rates of Synonymous and Nonsynonymous Nucleotide Substitutions.” Molecular Biology and Evolution 21 (12): 2290–8.

Wang, Da-Peng, Hao-Lei Wan, Song Zhang, and Jun Yu. 2009. “\(\gamma\)-Myn: A New Algorithm for Estimating Ka and Ks with Consideration of Variable Substitution Rates.” Biology Direct 4 (1): 1–18.

Yang, Ziheng, Rasmus Nielsen, Nick Goldman, and Anne-Mette Krabbe Pedersen. 2000. “Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites.” Genetics 155 (1): 431–49.

Zhang, Zhang, Jun Li, and Jun Yu. 2006. “Computing Ka and Ks with a Consideration of Unequal Transitional Substitutions.” BMC Evolutionary Biology 6 (1): 1–10.

Appendix

Chang, Peter L, Emily Kopania, Sara Keeble, Brice AJ Sarver, Erica Larson, Annie Orth, Khalid Belkhir, et al. 2017. “Whole Exome Sequencing of Wild-Derived Inbred Strains of Mice Improves Power to Link Phenotype and Genotype.” Mammalian Genome 28 (9): 416–25.

Ganeshan, Shanthi, Ruth E Dickover, BT Korber, Yvonne J Bryson, and Steven M Wolinsky. 1997. “Human Immunodeficiency Virus Type 1 Genetic Evolution in Children with Different Rates of Development of Disease.” Journal of Virology 71 (1): 663–77.

Goldman, Nick, and Ziheng Yang. 1994. “A Codon-Based Model of Nucleotide Substitution for Protein-Coding Dna Sequences.” Molecular Biology and Evolution 11 (5): 725–36.

Grantham, Richard. 1974. “Amino Acid Difference Formula to Help Explain Protein Evolution.” Science 185 (4154): 862–64.

Harr, Bettina, Emre Karakoc, Rafik Neme, Meike Teschke, Christine Pfeifle, Željka Pezer, Hiba Babiker, et al. 2016. “Genomic Resources for Wild Populations of the House Mouse, Mus Musculus and Its Close Relative Mus Spretus.” Scientific Data 3 (1): 1–14.

Korneliussen, Thorfinn Sand, Anders Albrechtsen, and Rasmus Nielsen. 2014. “ANGSD: Analysis of Next Generation Sequencing Data.” BMC Bioinformatics 15 (1): 1–13.

Li, Wen-Hsiung. 1993. “Unbiased Estimation of the Rates of Synonymous and Nonsynonymous Substitution.” Journal of Molecular Evolution 36 (1): 96–99.

Li, Wen-Hsiung, Chung-I Wu, and Chi-Cheng Luo. 1985. “A New Method for Estimating Synonymous and Nonsynonymous Rates of Nucleotide Substitution Considering the Relative Likelihood of Nucleotide and Codon Changes.” Molecular Biology and Evolution 2 (2): 150–74.

Nei, Masatoshi, and Takashi Gojobori. 1986. “Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions.” Molecular Biology and Evolution 3 (5): 418–26.

Pamilo, Pekka, and Nestor O Bianchi. 1993. “Evolution of the Zfx and Zfy Genes: Rates and Interdependence Between the Genes.” Molecular Biology and Evolution 10 (2): 271–81.

Tzeng, Yun-Huei, Runsun Pan, and Wen-Hsiung Li. 2004. “Comparison of Three Methods for Estimating Rates of Synonymous and Nonsynonymous Nucleotide Substitutions.” Molecular Biology and Evolution 21 (12): 2290–8.

Wang, Da-Peng, Hao-Lei Wan, Song Zhang, and Jun Yu. 2009. “\(\gamma\)-Myn: A New Algorithm for Estimating Ka and Ks with Consideration of Variable Substitution Rates.” Biology Direct 4 (1): 1–18.

Yang, Ziheng, Rasmus Nielsen, Nick Goldman, and Anne-Mette Krabbe Pedersen. 2000. “Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites.” Genetics 155 (1): 431–49.

Zhang, Zhang, Jun Li, and Jun Yu. 2006. “Computing Ka and Ks with a Consideration of Unequal Transitional Substitutions.” BMC Evolutionary Biology 6 (1): 1–10.

A Session Info

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_3.4.2       tidyr_1.3.0         dplyr_1.1.2        
##  [4] ape_5.7-1           Biostrings_2.68.0   GenomeInfoDb_1.36.0
##  [7] XVector_0.40.0      IRanges_2.34.0      S4Vectors_0.38.0   
## [10] BiocGenerics_0.46.0 MSA2dist_1.4.0      BiocStyle_2.28.0   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.3            xfun_0.39               bslib_0.4.2            
##  [4] lattice_0.21-8          vctrs_0.6.2             tools_4.3.0            
##  [7] bitops_1.0-7            generics_0.1.3          parallel_4.3.0         
## [10] tibble_3.2.1            fansi_1.0.4             highr_0.10             
## [13] pkgconfig_2.0.3         lifecycle_1.0.3         GenomeInfoDbData_1.2.10
## [16] farver_2.1.1            compiler_4.3.0          stringr_1.5.0          
## [19] munsell_0.5.0           codetools_0.2-19        htmltools_0.5.5        
## [22] sass_0.4.5              RCurl_1.98-1.12         yaml_2.3.7             
## [25] pillar_1.9.0            crayon_1.5.2            jquerylib_0.1.4        
## [28] seqinr_4.2-30           MASS_7.3-59             cachem_1.0.7           
## [31] magick_2.7.4            iterators_1.0.14        foreach_1.5.2          
## [34] nlme_3.1-162            tidyselect_1.2.0        digest_0.6.31          
## [37] stringi_1.7.12          purrr_1.0.1             bookdown_0.33          
## [40] labeling_0.4.2          ade4_1.7-22             fastmap_1.1.1          
## [43] grid_4.3.0              colorspace_2.1-0        cli_3.6.1              
## [46] magrittr_2.0.3          utf8_1.2.3              withr_2.5.0            
## [49] scales_1.2.1            rmarkdown_2.21          evaluate_0.20          
## [52] knitr_1.42              GenomicRanges_1.52.0    doParallel_1.0.17      
## [55] rlang_1.1.0             Rcpp_1.0.10             glue_1.6.2             
## [58] BiocManager_1.30.20     jsonlite_1.8.4          R6_2.5.1               
## [61] zlibbioc_1.46.0