---
title: "Get started-with GenomAutomorphism"
author:
- name: Robersy Sanchez
affiliation: Department of Biology.
Pennsylvania State University, University Park, PA 16802
email: rus547@psu.edu
date: "`r format(Sys.time(), '%d %B %Y')`"
fontsize: 11pt
fontfamily: "serif"
output:
BiocStyle::html_document:
toc: true
toc_depth: 4
toc_float:
collapsed: false
smooth_scroll: true
number_sections: true
theme: united
geometry: margin=0.8in
highlight: tango
toc_float: true
toc_depth: 4
abstract: |
A fast introduction into the analysis of DNA mutational events
by means of automorphisms between two DNA sequences algebraically
represented as Abelian finite group.
vignette: >
%\VignetteIndexEntry{Get started-with GenomAutomorphism}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r set-options, echo=FALSE, cache=FALSE}
options(width = 120)
```
# Overview
This is a R package to compute the automorphisms between pairwise aligned DNA
sequences represented as elements from a Genomic Abelian group as described in
reference ([1](#1)). In a general scenario, whole chromosomes or genomic
regions from a population (from any species or close related species) can be
algebraically represented as a direct sum of cyclic groups or more specifically
Abelian *p*-groups. Basically, we propose the representation of multiple
sequence alignments (MSA) of length _N_ as a finite Abelian group created by
the direct sum of Abelian group of _prime-power order_:
$$
\qquad G = (\mathbb{Z}_{p^{\alpha_{1}}_1})^{n_1} \oplus
(\mathbb{Z}_{p^{\alpha_{2}}_1})^{n_2} \oplus \dots \oplus
(\mathbb{Z}_{p^{\alpha_{k}}_k})^{n_k}
$$
Where, the $p_i$'s are prime numbers, $\alpha_i \in \mathbb{N}$ and
$\mathbb{Z}_{p^{\alpha_{i}}_i}$ is the group of integer modulo
$p^{\alpha_{i}}_i$.
For the purpose of estimating the automorphism between two aligned DNA
sequences, $p^{\alpha_{i}}_i \in \{5, 2^6, 5^3 \}$.
## Automorphisms
Herein, automorphisms are considered algebraic descriptions of mutational event
observed in codon sequences represented on different Abelian groups. In
particular, as described in references ([3-4](#3)), for each representation of
the codon set on a defined Abelian group there are 24 possible isomorphic
Abelian groups. These Abelian groups can be labeled based on the DNA base-order
used to generate them. The set of 24 Abelian groups can be described as a group
isomorphic to the symmetric group of degree four ($S_4$, see reference
([4](#4))).
For further support about the symmetric group on the 24 Abelian group of
genetic-code cubes, users can also see
[Symmetric Group of the Genetic-CodeCubes.](https://is.gd/pTl8Js),
specifically the Mathematica notebook
_IntroductionToZ5GeneticCodeVectorSpace.nb_ and interact with it using
Wolfram Player, freely available (for Windows and Linux OS) at,
.
## Load the R libraries
```{r lib,warning=FALSE,message=FALSE}
library(Biostrings)
library(GenomAutomorphism)
```
# Read the alignment _FASTA_ and encode the sequences
A pairwise sequence alignment of protein coding regions SARS coronavirus GZ02
(GenBank: AY390556.1) and Bat SARS-like coronavirus isolate Rs7327 (GenBank:
KY417151.1) is provided with the package.
```{r fasta, message=FALSE}
data(covid_aln, package = "GenomAutomorphism")
covid_aln
```
# Group representations
Group operations defined on the sets of DNA bases and codons are associated to
physicochemical or/and biophysical relationships between DNA bases and between
codons and aminoacids. In other words, a proper definition of a group
operation on the set of bases or on the set of codons will encode the
physicochemical or/and biophysical relationships between the set’s elements.
Thus, *by group operations defined on the set of bases or on the set of
codons, we understand an encoding applied to represent specified
physicochemical or/and biophysical relationships as group operations between
the elements of the set*. Then, we shall say that _*such an encoding permits
the representation of DNA bases, codons, genes, and genomic sequences as
elements from algebraic structures*_.
The DNA base set can be represented in 24 possible base orders, which leads to
24 possible representations of the genetic code. Each genetic code
representation base-triplets on the Galois field *GF(4)* (or in *GF(5)*) leads
to genetic code vector 3D-space, which is mathematically equivalent to a cube
inserted in the 3D space (1). Each cube is denoted according to the
corresponding base order.
Given a base-order, say 'ACGT', the Abelian group defined on this ordered set
is isomorphic to the Abelian group defined on the set of integers modulo 4
($\mathbb{Z}_{4}$). In practical terms, this is equivalent to replace each DNA
base by the corresponding integer element. The base replacement in cube "ACGT
and group "Z4" ($\mathbb{Z}_{4}$) is:
```{r int}
base2int("ACGT", group = "Z4", cube = "ACGT")
```
The base replacement in cube "ACGT and group 'Z5' ($\mathbb{Z}_{5}$):
```{r int2}
base2int("ACGT", group = "Z5", cube = "ACGT")
```
After the DNA sequence is read, the corresponding codon sequences can be
represented in the Abelian group $\mathbb{Z}_{64}$ (i.e., the set of integers
remainder modulo 64). The codon coordinates are requested on the cube ACGT.
Following reference ([4](#4))), cubes are labeled based on the order of DNA
bases used to define the sum operation.
```{r url, message=FALSE}
codons <- codon_coord(
codon = covid_aln,
cube = "ACGT",
group = "Z64",
chr = 1L,
strand = "+",
start = 1,
end = 750)
codons
```
The codon sequences (_seq1_ and _seq2_) with their corresponding coordinates
(left) are returned, as well as the coordinated representation on
$\mathbb{Z}_{64}$ (_coord1_ and _coord2_).
## _"Dual"_ genetic-code cubes
The particular interest are the coordinate representation on _"dual"_
genetic-code cubes. These are cubes where codons with complementary base pairs
have the same coordinates in the corresponding cubes, as shown in reference
([4](#4))). Each pair of _"dual"_ cubes integrates a group.
For example, let's consider the complementary codons "ACG"
and "TGC", with complementary base pairs: A::T, C:::G, and G:::C, where symbol
":" denotes the hydrogen bonds between the bases.
```{r bp}
x0 <- c("ACG", "TGC")
x1 <- DNAStringSet(x0)
x1
```
Their representations on the dual cubes "ACGT" and "TGCA" on $\mathbb{Z}_{4}$
are:
```{r bp_coord}
x2 <- base_coord(x1, cube = "ACGT")
x2
x2. <- base_coord(x1, cube = "TGCA")
x2.
```
The sum of base coordinates modulo $\mathbb{Z}_{4}$ is 3.
```{r bp_sum.}
## cube "ACGT"
(x2$coord1 + x2$coord2) %% 4
## cube "TGCA"
(x2.$coord1 + x2.$coord2) %% 4
```
The same result for the same codon on different cubes
```{r bp_sum}
## Codon ACG
(x2$coord1 + x2.$coord1) %% 4
## Codon TGC
(x2$coord2 + x2.$coord2) %% 4
```
Their codon representation on $\mathbb{Z}_{64}$ are:
```{r codons}
## cube ACGT
x3 <- codon_coord(codon = x2, group = "Z64")
x3
## cube TGCA
x3. <- codon_coord(codon = x2., group = "Z64")
x3.
```
The sum of base coordinates modulo $\mathbb{Z}_{64}$ is 63.
```{r bp_sum2}
## cube "ACGT"
(as.numeric(x3$coord1) + as.numeric(x3$coord2)) %% 64
## cube "TGCA"
(as.numeric(x3.$coord1) + as.numeric(x3.$coord2)) %% 64
```
The same result for the same codon on different cubes
```{r bp_sum3}
## Codon ACG
(as.numeric(x3$coord1) + as.numeric(x3.$coord1)) %% 64
## Codon TGC
(as.numeric(x3$coord2) + as.numeric(x3.$coord2)) %% 64
```
# Automorphisms on $\mathbb{Z}_{64}$
Automorphisms can be computed starting directly from the _FASTA_ file. Notice
that we can work only with genomic regions of our interest by giving the
_start_ and _end_ alignment coordinates. In $\mathbb{Z}_{64}$ automorphisms
are described as functions $f(x) = k\,x\quad mod\,64$, where $k$ and $x$ are
elements from the set of integers modulo 64. Below, in function
[automorphism](https://is.gd/WMxUKz) three important arguments are given
values: _group = "Z64"_, _cube = c("ACGT", "TGCA")_, and _cube_alt = c("CATG",
"GTAC")_.
In groups "Z64" and "Z125" not all the mutational events can be described as
automorphisms from a given cube. The analysis of automorphisms is then
accomplished in the set of _dual_ genetic-code cubes. A character string
denoting pairs of _dual_ genetic-code cubes, is given as argument for _cube_.
Setting for group specifies on which group the automorphisms will be
computed. These groups can be: "Z5", "Z64", "Z125", and "Z5^3".
If automorphisms are not found in first set of dual cubes, then the algorithm
search for automorphisms in a alternative set of dual cubes.
```{r aut}
autm <- automorphisms(
seqs = covid_aln,
group = "Z64",
cube = c("ACGT", "TGCA"),
cube_alt = c("CATG", "GTAC"),
start = 1,
end = 750,
verbose = FALSE)
autm
```
Observe that two new columns were added, the automorphism coefficient $k$
(named as _autm_) and the genetic-code cube where the automorphism was found.
By convention the DNA sequence is given for the positive strand. Since the
_dual cube_ of **"ACGT"** corresponds to the complementary base order
**TGCA**, automorphisms described by the cube **TGCA** represent mutational
events affecting the DNA negative strand (-).
The last result can be summarized by gene regions as follow:
```{r range}
aut_range <- automorphismByRanges(autm)
aut_range
```
That is, function
[automorphismByRanges](https://is.gd/5RWHFF) permits the classification
of the pairwise alignment of protein-coding sub-regions based on the
mutational events observed on it quantitatively represented as automorphisms
on genetic-code cubes.
Searching for automorphisms on $\mathbb{Z}_{64}$ permits us a quantitative
differentiation between mutational events at different codon positions from a
given DNA protein-encoding region. As shown in reference ([4](#4)) a set of
different cubes can be applied to describe the best evolutionary aminoacid
scale highly correlated with aminoacid physicochemical properties describing
the observed evolutionary process in a given protein.
More information about this subject can be found in the supporting material
from reference ([4](#4))) at GitHub
[GenomeAlgebra_SymmetricGroup](https://is.gd/pTl8Js),
particularly by interacting with the Mathematica notebook
[Genetic-Code-Scales_of_Amino-Acids.nb](https://is.gd/LVd06G).
# Automorphisms between whole genomes of SARS-CoV-2 related coronaviruses
Next, the automorphism for the whole pairwise alignment of SARS-CoV-2 related
coronaviruses:
```{r aut_1, eval=FALSE}
## Do not need to run it.
covid_autm <- automorphisms(
seq = covid_aln,
group = "Z64",
cube = c("ACGT", "TGCA"),
cube_alt = c("CATG", "GTAC"),
verbose = FALSE)
```
This data is available with the package
```{r dat}
data(covid_autm, package = "GenomAutomorphism")
covid_autm
```
And the summary by range
```{r range_2}
aut_range <- automorphismByRanges(covid_autm)
aut_range
```
Regions no described by automorphism can be described as translations (labeled
"Trnl") and they can be shown as follow:
```{r non-aut}
idx = which(covid_autm$cube == "Trnl")
covid_autm[ idx ]
```
These codon positions cover insertion-deletion (_indel_) mutational events.
The wholes regions can be summarized typing:
```{r range_3}
idx = which(aut_range$cube == "Trnl")
aut_range[ idx ]
```
Only one indel mutation was found in the region where the spike glycoprotein is
located: 7076 - 8331. That is, the pairwise alignment of SARS coronavirus GZ02
and Bat SARS-like coronavirus (bat-SL-CoVZC45) reveals 8 single _indel_
mutational events, four regions with two _indel_ mutations and one region with
3 _indel_ mutations.
```{r indels}
data.frame(aut_range[idx])
## region width
width(aut_range[ idx ])
```
In general, _indel_ mutational event can be modeled as translations on
$\mathbb{Z}_{64}$.
## Bar plot automorphism distribution by cubes
The automorphism distribution by cubes can be summarized in the bar-plot
graphic
```{r barplot, fig.height = 5, fig.width = 6}
counts <- table(covid_autm$cube[ covid_autm$autm != 1 |
is.na(covid_autm$autm) ])
par(family = "serif", cex = 0.9, font = 2, mar=c(4,6,4,4))
barplot(counts, main="Automorphism distribution",
xlab="Genetic-code cube representation",
ylab="Fixed mutational events",
col=c("darkblue","red", "darkgreen"),
border = NA, axes = FALSE,
cex.lab = 2, cex.main = 1.5, cex.names = 2)
axis(2, at = c(0, 200, 400, 600, 800), cex.axis = 1.5)
mtext(side = 1,line = -1.5, at = c(0.7, 1.9, 3.1, 4.3, 5.5),
text = paste0( counts ), cex = 1.4,
col = c("white","yellow", "black"))
```
## Grouping automorphism by automorphism's coefficients. Types of mutations
```{r autby}
autby_coef <- automorphism_bycoef(covid_autm)
autby_coef <- autby_coef[ autby_coef$autm != 1 & autby_coef$autm != -1 ]
```
Barplot of frequency of mutation types greater than 2.
```{r barplot_2, fig.height = 12, fig.width = 14}
counts <- table(autby_coef$mut_type)
counts <- sort(counts, decreasing = TRUE)
count. <- counts[ counts > 2 ]
par(family = "serif", cex.axis = 2, font = 2, las = 1,
cex.main = 1.4, mar = c(6,2,4,4))
barplot(count., main="Automorphism distribution per Mutation type",
col = colorRampPalette(c("red", "yellow", "blue"))(36),
border = NA, axes = FALSE,las=2)
axis(side = 2, cex.axis = 2, line = -1.8 )
```
Every single base mutational event across the MSA was classified according
IUPAC nomenclature: 1) According to the number of hydrogen bonds (on DNA/RNA
double helix): strong S={C, G} (three hydrogen bonds) and weak W={A, U} (two
hydrogen bonds). According to the chemical type: purines R={A, G} and
pyrimidines Y={C, U}. 3). According to the presence of amino or keto groups on
the base rings: amino M={C, A} and keto K={G, T}. Constant (hold) base
positions were labeled with letter H. So, codon positions labeled as HKH means
that the first and third bases remains constant and mutational events between
bases G and T were found in the MSA.
```{r ct}
counts
```
The analysis of the frequency of mutational events (automorphisms, COVID:
human SARS coronavirus GZ02 vs Bat SARS-like coronavirus isolate
at-SL-CoVZC45) by mutation types is shown in the last figure. Results are
consistent with the well-known observation highlighted by Crick: the highest
mutational rate is found in the third base of the codon (HHY: 425, HHR: 189,
HHW: 88), followed by YHH: 34 in the first base, and the lowest rate is found
in the second one ([5](#5)).
## Conserved and non-conserved regions
### Conserved regions
Conserved and non-conserved gene regions can be easily observed in most of
MSA editing bioinformatic tools. However, here were interesting into get the
regions coordinates for further downstream analysis.
Conserved regions from pairwise comparisons are obtain with function
conserved_regions:
```{r conserved_regions}
conserv <- conserved_regions(covid_autm)
conserv
```
Several regions are similar for more than one comparison.
```{r uniq}
conserv_unique <- conserved_regions(covid_autm, output = "unique")
conserv_unique
```
# Automorphisms on $\mathbb{Z}_{125}$
Alternatively, we can use the algebraic representation on on
$\mathbb{Z}_{125}$.
```{r aut_2, eval=FALSE}
autm_z125 <- automorphisms(
seq = covid_aln,
group = "Z125",
cube = c("ACGT", "TGCA"),
cube_alt = c("CATG", "GTAC"),
verbose = FALSE)
```
For the sake of reducing computational time in this example, 'autm_z125'
is available with the package.
```{r autm_z125}
data(autm_z125, package = "GenomAutomorphism")
autm_z125
```
And the summary by range
```{r range_4}
aut_range_2 <- automorphismByRanges(autm_z125)
aut_range_2
```
The whole genome can be described by automorphisms on $\mathbb{Z}_{125}$.
```{r barplot_3, fig.height = 5, fig.width = 3}
counts <- table(autm_z125$cube[ autm_z125$autm != 1 ])
par(family = "serif", cex = 1, font = 2)
barplot(counts, main="Automorphism distribution",
xlab="Genetic-code cube representation",
ylab="Fixed mutational events",
col=c("darkblue","red"),
ylim = c(0, 1300),
border = NA, axes = TRUE)
mtext(side = 1,line = -2, at = c(0.7, 1.9, 3.1),
text = paste0( counts ), cex = 1.4,
col = c("white","red"))
```
# Automorphisms on the Genetic-code Cube Representation on GF(5)
The Genetic-code Cube Representations on the Galois Field GF(5) were studied in
([4](#4)). Each codon is represented by each coordinate in the 3D space.
Automorphisms are represented by diagonal matrices, with elements $x$ in
$x \in \mathbb{Z}_5$.
```{r aut_3, eval=FALSE}
autm_3d <- automorphisms(
seq = covid_aln,
group = "Z5^3",
cube = c("ACGT", "TGCA"),
cube_alt = c("CATG", "GTAC"),
verbose = FALSE)
```
The result is available with package
```{r autm_3d}
data(autm_3d, package = "GenomAutomorphism")
autm_3d
```
## Grouping automorphism by automorphism's coefficients
Automorphisms that preserved codons (DNA base-triplets) are represented by the
identity matrix, i.e., the matrix with diagonal elements "1,1,1".
```{r autby_3}
autby_coef_3d <- automorphism_bycoef(autm_3d)
autby_coef_3d <- autby_coef_3d[ autby_coef_3d$autm != "1,1,1" ]
autby_coef_3d
```
Conserved regions from pairwise comparisons are obtain with function
conserved_regions:
```{r conserved_regions_3}
conserv <- conserved_regions(autm_3d)
conserv
```
The whole genome mutational events represented as automorphisms on the 3D space
$\mathbb{Z}_{5}^3$, specifically on the cube ACGT (see [4](#4)).
```{r barplot_4, fig.height = 5, fig.width = 3}
counts <- table(autby_coef_3d$cube[ autby_coef_3d$autm != "1,1,1"])
par(family = "serif", cex = 1, font = 2, cex.main = 1)
barplot(counts, main="Automorphism distribution",
xlab="Genetic-code cube representation",
ylab="Fixed mutational events",
col=c("darkblue","red"),
ylim = c(0, 1300),
border = NA, axes = TRUE)
mtext(side = 1,line = -2, at = c(0.7, 1.9),
text = paste0( counts ), cex = 1.4,
col = c("white"))
```
# References
1. R. Sanchez. Symmetric Group of the Genetic-Code Cubes.
Effect of the Genetic-Code Architecture on the Evolutionary Process MATCH
Commun. Math. Comput. Chem. 79 (2018) 527-560.
[PDF](https://bit.ly/2Z9mjM7).
2. Sanchez R, Morgado E, Grau R. Gene algebra from a
genetic code algebraic structure. J Math Biol. 2005 Oct;51(4):431-57.
doi:10.1007/s00285-005-0332-8. Epub 2005 Jul 13. PMID: 16012800.
([PDF](https://arxiv.org/pdf/q-bio/0412033.pdf)).
3. Robersy Sanchez, Jesus Barreto (2021) Genomic Abelian
Finite Groups.
[doi:10.1101/2021.06.01.446543](https://doi.org/10.1101/2021.06.01.446543)
4. M. V Jose, E.R. Morgado, R. Sanchez, T. Govezensky,
The 24 possible algebraic representations of the standard genetic code in
six or in three dimensions, Adv. Stud. Biol. 4 (2012) 119-152.
[PDF](https://is.gd/na9eap).
5. Crick FHC. The Origin of the Genetic Code. J Mol Biol.
1968;38: 367–379.
# Session info {.unnumbered}
Here is the output of `sessionInfo()` on the system on which this document was
compiled:
```{r sessionInfo, echo=FALSE}
sessionInfo()
```