Pipeline of generating pairwise whole genome alignment.
CNEr 1.36.0
CNEr identifies conserved noncoding elments (CNEs) from pairwise whole genome alignment (net.axt files) of two species. UCSC has provided alignments between many species on the downloads, hence it is highly recommended to use their alignments when available. When the alignments of some new assemblies/species are not availble from UCSC yet, this vignette describes the pipeline of generating the alignments merely from soft-masked 2bit files or fasta files. This vignette is based on the genomewiki from UCSC.
Note:
List of external softwares have to be installed on the machine: * The sequence alignment program LASTZ * Kent Utilities. In this pipeline, lavToPsl, axtChain, chainMergeSort, chainPreNet, chainNet, netSyntenic, netToAxt, axtSort are essential. netClass is optional.
Here, as an example, we will get the pairwise alignment only on “chr1”, “chr2” and “chr3” between zebrafish(danRer10) and human(hg38).
First we need to download the 2bit files from UCSC, and set the appropriate paths of assemblyTarget
, assemblyQuery
and intermediate files.
Then we can run lastz
function to generate the lav files.
## lastz aligner
assemblyDir <- "/Users/gtan/OneDrive/Project/CSC/CNEr/2bit"
axtDir <- "/Users/gtan/OneDrive/Project/CSC/CNEr/axt"
assemblyTarget <- file.path(system.file("extdata",
package="BSgenome.Drerio.UCSC.danRer10"),
"single_sequences.2bit")
assemblyQuery <- file.path(system.file("extdata",
package="BSgenome.Hsapiens.UCSC.hg38"),
"single_sequences.2bit")
lavs <- lastz(assemblyTarget, assemblyQuery,
outputDir=axtDir,
chrsTarget=c("chr1", "chr2", "chr3"),
chrsQuery=c("chr1", "chr2", "chr3"),
distance="far", mc.cores=4)
## lav files to psl files conversion
psls <- lavToPsl(lavs, removeLav=FALSE, binary="lavToPsl")
One essential argument here is the distance
. It determines the scoring matrix to use in lastz
aligner. See ?scoringMatrix
for more details.
Note:
This step may encounter difficulties if the two assemblies are too fragmented, because there can be millions of combinations of the chromosomes/scaffolds. The lastz
is overwhelmed with reading small pieces from assembly for each combination, rather than doing actual alignment.
In this case, another aligner last is recommended and introduced in nex section.
last
aligner is considered faster and memory efficient.
It creates maf file, which can converted to psl files.
Then the same following processes can be used on psl files.
Different from lastz
, last
aligner starts with fasta files.
The target genome sequence has to build the index file first, and then align with the query genome sequence.
## Build the lastdb index
system2(command="lastdb", args=c("-c", file.path(assemblyDir, "danRer10"),
file.path(assemblyDir, "danRer10.fa")))
## Run last aligner
lastal(db=file.path(assemblyDir, "danRer10"),
queryFn=file.path(assemblyDir, "hg38.fa"),
outputFn=file.path(axtDir, "danRer10.hg38.maf"),
distance="far", binary="lastal", mc.cores=4L)
## maf to psl
psls <- file.path(axtDir, "danRer10.hg38.psl")
system2(command="maf-convert", args=c("psl",
file.path(axtDir, "danRer10.hg38.maf"),
">", psls))
Another alternative of alignment software is YASS. It may be added into this pipeline after we test the performance.
If two matching alignments next to each other are close enough, they are joined into one fragment. Then these chain files are sorted and combined into one big file.
## Join close alignments
chains <- axtChain(psls, assemblyTarget=assemblyTarget,
assemblyQuery=assemblyQuery, distance="far",
removePsl=FALSE, binary="axtChain")
## Sort and combine
allChain <- chainMergeSort(chains, assemblyTarget, assemblyQuery,
allChain=file.path(axtDir,
paste0(sub("\\.2bit$", "", basename(assemblyTarget),
ignore.case=TRUE), ".",
sub("\\.2bit$", "", basename(assemblyQuery),
ignore.case=TRUE), ".all.chain")),
removeChains=FALSE, binary="chainMergeSort")
In this step, first we filter out chains that are unlikely to be netted by chainPreNet
.
During the alignment, every genomic fragment can match with several others, and certainly we want to keep the best one.
This is done by chainNet
.
Then we add the synteny information with netSyntenic
.
## Filtering out chains
allPreChain <- chainPreNet(allChain, assemblyTarget, assemblyQuery,
allPreChain=file.path(axtDir,
paste0(sub("\\.2bit$", "",
basename(assemblyTarget),
ignore.case = TRUE), ".",
sub("\\.2bit$", "",
basename(assemblyQuery),
ignore.case = TRUE),
".all.pre.chain")),
removeAllChain=FALSE, binary="chainPreNet")
## Keep the best chain and add synteny information
netSyntenicFile <- chainNetSyntenic(allPreChain, assemblyTarget, assemblyQuery,
netSyntenicFile=file.path(axtDir,
paste0(sub("\\.2bit$", "",
basename(assemblyTarget),
ignore.case = TRUE), ".",
sub("\\.2bit$", "",
basename(assemblyQuery),
ignore.case = TRUE),
".noClass.net")),
binaryChainNet="chainNet", binaryNetSyntenic="netSyntenic")
As the last step, we create the .net.axt file from the previous net and chain files.
netToAxt(netSyntenicFile, allPreChain, assemblyTarget, assemblyQuery,
axtFile=file.path(axtDir,
paste0(sub("\\.2bit$", "",
basename(assemblyTarget),
ignore.case = TRUE), ".",
sub("\\.2bit$", "",
basename(assemblyQuery),
ignore.case = TRUE),
".net.axt")),
removeFiles=FALSE,
binaryNetToAxt="netToAxt", binaryAxtSort="axtSort")
Here is the output of sessionInfo()
on the system on which this
document was compiled:
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Gviz_1.44.0 BSgenome.Ggallus.UCSC.galGal3_1.4.0
## [3] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.68.0
## [5] rtracklayer_1.60.0 Biostrings_2.68.0
## [7] XVector_0.40.0 GenomicRanges_1.52.0
## [9] GenomeInfoDb_1.36.0 IRanges_2.34.0
## [11] S4Vectors_0.38.0 BiocGenerics_0.46.0
## [13] CNEr_1.36.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.14
## [3] jsonlite_1.8.4 magrittr_2.0.3
## [5] GenomicFeatures_1.52.0 magick_2.7.4
## [7] farver_2.1.1 rmarkdown_2.21
## [9] BiocIO_1.10.0 zlibbioc_1.46.0
## [11] vctrs_0.6.2 memoise_2.0.1
## [13] Rsamtools_2.16.0 RCurl_1.98-1.12
## [15] base64enc_0.1-3 htmltools_0.5.5
## [17] progress_1.2.2 curl_5.0.0
## [19] Formula_1.2-5 sass_0.4.5
## [21] pracma_2.4.2 bslib_0.4.2
## [23] htmlwidgets_1.6.2 plyr_1.8.8
## [25] cachem_1.0.7 GenomicAlignments_1.36.0
## [27] lifecycle_1.0.3 pkgconfig_2.0.3
## [29] Matrix_1.5-4 R6_2.5.1
## [31] fastmap_1.1.1 GenomeInfoDbData_1.2.10
## [33] MatrixGenerics_1.12.0 digest_0.6.31
## [35] colorspace_2.1-0 AnnotationDbi_1.62.0
## [37] Hmisc_5.0-1 RSQLite_2.3.1
## [39] filelock_1.0.2 labeling_0.4.2
## [41] fansi_1.0.4 httr_1.4.5
## [43] compiler_4.3.0 bit64_4.0.5
## [45] withr_2.5.0 htmlTable_2.4.1
## [47] backports_1.4.1 BiocParallel_1.34.0
## [49] DBI_1.1.3 highr_0.10
## [51] R.utils_2.12.2 biomaRt_2.56.0
## [53] rappdirs_0.3.3 poweRlaw_0.70.6
## [55] DelayedArray_0.26.0 rjson_0.2.21
## [57] tools_4.3.0 foreign_0.8-84
## [59] nnet_7.3-18 R.oo_1.25.0
## [61] glue_1.6.2 restfulr_0.0.15
## [63] checkmate_2.1.0 cluster_2.1.4
## [65] reshape2_1.4.4 generics_0.1.3
## [67] gtable_0.3.3 tzdb_0.3.0
## [69] R.methodsS3_1.8.2 ensembldb_2.24.0
## [71] data.table_1.14.8 hms_1.1.3
## [73] xml2_1.3.3 utf8_1.2.3
## [75] pillar_1.9.0 stringr_1.5.0
## [77] vroom_1.6.1 dplyr_1.1.2
## [79] BiocFileCache_2.8.0 lattice_0.21-8
## [81] deldir_1.0-6 bit_4.0.5
## [83] annotate_1.78.0 biovizBase_1.48.0
## [85] tidyselect_1.2.0 GO.db_3.17.0
## [87] knitr_1.42 gridExtra_2.3
## [89] bookdown_0.33 ProtGenerics_1.32.0
## [91] SummarizedExperiment_1.30.0 xfun_0.39
## [93] Biobase_2.60.0 matrixStats_0.63.0
## [95] stringi_1.7.12 lazyeval_0.2.2
## [97] yaml_2.3.7 evaluate_0.20
## [99] codetools_0.2-19 interp_1.1-4
## [101] archive_1.1.5 tibble_3.2.1
## [103] BiocManager_1.30.20 cli_3.6.1
## [105] rpart_4.1.19 xtable_1.8-4
## [107] munsell_0.5.0 jquerylib_0.1.4
## [109] dichromat_2.0-0.1 Rcpp_1.0.10
## [111] dbplyr_2.3.2 png_0.1-8
## [113] XML_3.99-0.14 parallel_4.3.0
## [115] ggplot2_3.4.2 readr_2.1.4
## [117] blob_1.2.4 prettyunits_1.1.1
## [119] jpeg_0.1-10 latticeExtra_0.6-30
## [121] AnnotationFilter_1.24.0 bitops_1.0-7
## [123] VariantAnnotation_1.46.0 scales_1.2.1
## [125] crayon_1.5.2 rlang_1.1.0
## [127] KEGGREST_1.40.0