%\VignetteIndexEntry{Getting Started DECIPHERing} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\C}{{\textsf{C}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \newcommand\ytl[2]{ \parbox[b]{8em}{\hfill{\bfseries\sffamily #1}~$\cdots\cdots$~}\makebox[0pt][c]{$\bullet$}\vrule\quad \parbox[c]{13cm}{\vspace{7pt}\raggedright\sffamily #2.\\[7pt]}\\[-3pt]} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Getting Started DECIPHERing} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) @ \tableofcontents %------------------------------------------------------------ \section{About DECIPHER} %------------------------------------------------------------ \Rpackage{DECIPHER} is a software that can be used for deciphering and managing biological sequences efficiently in the \R{} programming language. The program features tools falling into seven categories: \begin{itemize} \item Sequence databases: import, maintain, view, export, and interact with a massive number of sequences. \item Homology finding: rapidly query sequences for homologous hits among a set of target sequences or genomes. Cluster into groups of related sequences. \item Sequence alignment: accurately align thousands of DNA, RNA, or amino acid sequences. Quickly find and align the syntenic regions of multiple genomes. \item Oligo design: test oligos in silico, or create new primer and probe sequences optimized for a variety of objectives. \item Manipulate sequences: trim low quality regions, correct frameshifts, reorient nucleotides, determine consensus, or digest with restriction enzymes. \item Analyze sequences: find chimeras, classify into a taxonomy of organisms or functions, detect repeats, predict secondary structure, create phylogenetic trees, and reconstruct ancestral states. \item Gene finding: predict coding and non-coding genes in a genome, extract them from the genome, and export them to a file. \end{itemize} \Rpackage{DECIPHER} is available under the terms of the \underline{\href{http://www.gnu.org/copyleft/gpl.html}{GNU Public License version 3}}. %------------------------------------------------------------ \section{Design Philosophy} %------------------------------------------------------------ \Rpackage{DECIPHER} is designed for large-scale comparison of nucleotide and protein sequences. The package is built upon a foundation of sequence databases to curate large volumes of biological sequences and is dependent on the \Rpackage{Biostrings} package for low-level sequence storage. The guiding inspiration behind \Rpackage{DECIPHER} is to empower users with powerful tools that help to convert biological sequences into results. To this end, the package strives to be well-documented, well-maintained, multi-functional, state-of-the-art, scalable, and fast. \subsection{Curators Protect the Originals} One of the core principles of \Rpackage{DECIPHER} is the idea of the non-destructive workflow. This concept revolves around the view that the original sequence information should never be altered. Essentially, the sequence information in the database is thought of as a backup of the original sequence file and no function is able to directly alter the sequence data. All of the workflows simply \term{add} information to the database, which can be used to analyze, organize, and maintain the sequences. When it comes time to export all or part of the sequences they are preserved in their original state without alteration. Interaction with \Rpackage{DECIPHER} represents a different paradigm than typical bioinformatics pipelines. Data in \R{} is typically moved between functions without writing files. This avoids the clutter associated with most bioinformatics pipelines. Using \R{} packages, including \Rpackage{DECIPHER}, it is possible to go from input data to output results in a single series of commands, often without ever needing to leave \R{} or write intermediate results to files. This keeps the overall workflow tidier than traditional pipelines and supports repeatability and reproducibility. \subsection{Don't Reinvent the Wheel} \Rpackage{DECIPHER} makes use of the \Rpackage{Biostrings} package that is a core part of the \underline{\href{http://www.bioconductor.org/}{Bioconductor suite}}. This package contains numerous functions for common operations such as searching, manipulating, and reverse complementing sequences. Furthermore, \Rpackage{DECIPHER} makes use of the \Rpackage{Biostrings} interface for handling sequence data so that sequences are stored in \Robject{XStringSet} objects. These objects are compatible with many useful packages in the Bioconductor suite. A wide variety of user objectives necessitates that \Rpackage{DECIPHER} be extensible to customized projects. \R{} provides a simple way to place the power of thousands of packages at your fingertips. Likewise, \R{} enables direct access to the speed and efficiency of the programming language \C{} while maintaining the utility of a scripting language. Therefore, minimal code is required to solve complex new problems. Best of all, the \R{} statistical programming language is open source, and maintains a thriving user community so that direct collaboration with other \R{} users is available on \underline{\href{https://stat.ethz.ch/mailman/listinfo}{several Internet forums}}. \subsection{That Which is the Most Difficult, Make Fastest} A core objective of \Rpackage{DECIPHER} is to make massive tasks feasible in minimal time. To this end, many of the most time consuming functions are parallelized to make use of multiple processors. For example, the function \Rfunction{DistanceMatrix} gets almost an additional 1-fold speed boost for each processor core. A modern processor with \textit{N} cores can see a factor of close to \textit{N}-fold speed improvement. Similar speedups can be achieved in many other \Rpackage{DECIPHER} functions by setting the \Rfunarg{processors} argument. This is all made possible through the use of OpenMP in \C{}-level code. Other time consuming tasks are handled efficiently. The function \Rfunction{FindChimeras} can uncover sequence chimeras by searching through a reference database of over a million sequences for thousands of 30-mer fragments in a number of minutes. This incredible feat is accomplished by using the \Rclass{PDict} class provided by \Rpackage{Biostrings}. Similarly, the \Rfunction{SearchDB} function can obtain the one-in-a-million sequences that match a targeted query in a matter of seconds. Such high-speed functions enable the user to find solutions to problems that previously would have been extremely difficult or nearly impossible to solve using antiquated methods. \subsection{Stay Organized} It is no longer necessary to store related data in several different files. \Rpackage{DECIPHER} is enabled by \Rpackage{DBI}, which is an \R{} interface to a variety of databases. \Rpackage{DECIPHER} creates an organized collection of sequences and their associated information known as a sequence database. By default, \term{SQLite} databases are used if the \Rpackage{RSQLite} package is installed. \term{SQLite} databases are flat files, meaning they can be handled just like any other file. There is no setup required since \term{SQLite} does not require a server, unlike many other database engines. These attributes of \term{SQLite} databases make storing, backing-up, and sharing sequence databases relatively straightforward. However, \Rpackage{DECIPHER} supports other \term{SQL} databases, such as \term{MariaDB} via \Rpackage{RMariaDB}, if concurrency, scalability, or performance are critical. Separate projects can be stored in distinct tables in the same sequence database. Each new table is structured to include every sequence's description, identifier, and a unique key (called \term{row_names}) all in one place. The sequences are referenced by their \term{row_names} or \term{identifier} throughout most functions in the package. Using \term{row_names}, new information created with \Rpackage{DECIPHER} functions can be added as additional database columns to their respective sequences' rows in the database table. To prevent the database from seeming like a black box there is a function named \Rfunction{BrowseDB} that facilitates viewing of the database contents in a web browser. A similar function is available to view sequences called \Rfunction{BrowseSeqs}. The amount of DNA sequence information available is currently increasing at a phenomenal rate. \Rpackage{DECIPHER} stores individual sequences using a custom compression format, called \term{nbit}, so that the database file takes up much less drive space than a standard text file of sequences. The compressed sequences are stored in a hidden table that is linked to the main information table that the user interacts with regularly. For example, by default sequence information is stored in the table ``Seqs'', and the associated sequences are stored in the table ``_Seqs''. Storing the sequences in a separate table greatly improves access speed when there is a large amount of sequence information. Separating projects into distinct tables further increases query speed over that of storing every project in a single table. %------------------------------------------------------------ \section{Installation} %------------------------------------------------------------ \newlength\tindent \setlength{\tindent}{\parindent} \setlength{\parindent}{0pt} \label{sec:Installation} \subsection{Typical Installation (recommended)} \begin{enumerate} \item Install the latest version of \R{} from \underline{\url{http://www.r-project.org/}}, because the version of \Rpackage{DECIPHER} that will be installed from Bioconductor is dependent on the version of \R{}. \item Install the pre-built version of \Rpackage{DECIPHER} in R by entering: \begin{Schunk} \begin{Sinput} > if (!requireNamespace("BiocManager", quietly=TRUE)) + install.packages("BiocManager") > BiocManager::install("DECIPHER") \end{Sinput} \end{Schunk} \end{enumerate} Note that this automatic installation method does not enable multi-threading on macOS, where use of multiple processors requires manual installation. \subsection{Manual Installation} \subsubsection{All platforms} \begin{enumerate} \item Install the latest \R{} version from \underline{\url{http://www.r-project.org/}}. \item Install \Rpackage{Biostrings} in R by entering: \begin{Schunk} \begin{Sinput} > if (!requireNamespace("BiocManager", quietly=TRUE)) + install.packages("BiocManager") > BiocManager::install("Biostrings") \end{Sinput} \end{Schunk} \item Optionally, install the suggested package \Rpackage{RSQLite} in R by entering: \begin{Schunk} \begin{Sinput} > install.packages("RSQLite") \end{Sinput} \end{Schunk} \item Download \Rpackage{DECIPHER} from \underline{\url{http://DECIPHER.codes}} or \underline{\url{https://doi.org/doi:10.18129/B9.bioc.DECIPHER}{Bioconductor}}. \end{enumerate} \subsubsection{macOS} \begin{enumerate} \item First install Command Line Tools from Apple, which contains compliers that are required to build packages on the Mac. Then, in \R{} run: \begin{Schunk} \begin{Sinput} > install.packages("<>", repos=NULL) \end{Sinput} \end{Schunk} \end{enumerate} For parallelization on macOS, extra steps are required to \underline{\url{http://mac.r-project.org/openmp/}{enable OpenMP}} before install. In summary, run "clang -v" and then download the corresponding LLVM tar.gz file. Run the sudo tar command as shown in the OpenMP instructions, and then add these two lines to your $\sim$/.R/MAKEVARS file: CPPFLAGS += -Xclang -fopenmp\\ LDFLAGS += -lomp Then \Rpackage{DECIPHER} can be built and installed the usual way via the command line, as shown below for linux-alikes. \subsubsection{Linux} In a shell enter: \begin{Schunk} \begin{Sinput} R CMD build --no-build-vignettes "<>" R CMD INSTALL "<>" \end{Sinput} \end{Schunk} \subsubsection{Windows} Two options are available: the first is simplest, but requires the pre-built binary (DECIPHER.zip). \begin{enumerate} \item First Option: \begin{Schunk} \begin{Sinput} > install.packages("<>", repos=NULL) \end{Sinput} \end{Schunk} \item Second Option (more difficult): \begin{enumerate}[(a)] \item Install Rtools from \underline{\url{http://cran.r-project.org/bin/windows/Rtools/}}. Be sure to check the box that says edit PATH during installation. \item Open a MS-DOS command prompt by clicking Start -> All Programs -> Accessories -> Command Prompt. \item In the command prompt enter: \begin{Schunk} \begin{Sinput} R CMD build --no-build-vignettes "<>" R CMD INSTALL "<>" \end{Sinput} \end{Schunk} \end{enumerate} \end{enumerate} %------------------------------------------------------------ \section{Getting help} %------------------------------------------------------------ To get started we need to load the \Rpackage{DECIPHER} package, which automatically loads several other required packages: <>= library(DECIPHER) @ Help for any function can be accessed through a command such as: \begin{Schunk} \begin{Sinput} > ? DECIPHER \end{Sinput} \end{Schunk} Once \Rpackage{DECIPHER} is installed, a list of package vignettes (tutorials) can be found via: \begin{Schunk} \begin{Sinput} > browseVignettes("DECIPHER") \end{Sinput} \end{Schunk} %------------------------------------------------------------ \section{Functionality} %------------------------------------------------------------ The \Rpackage{DECIPHER} package contains many different functions. Figure \ref{f1} shows the relationships among functions based on how often they co-occur in the same \R{} script. Functions listed in the same color appear together in one of the package vignettes. Table \ref{t1} shows the timeline of publications associated with the flagship functions within \Rpackage{DECIPHER}. \begin{wrapfigure}{r}{1\textwidth} \includegraphics[width=1\textwidth]{FunctionalRelationships} \caption{\label{f1} Depiction of relationships among all functions in the package. Those near the center of tree are less related to any other function in particular, while the functions further from the center along the same branch are more related to each other.} \end{wrapfigure} \clearpage \begin{table} \caption{\label{t1} Evolution of the \Rpackage{DECIPHER} package for \R{}} \centering \begin{minipage}[t]{1.0\linewidth} \rule{\linewidth}{1pt} \ytl{2011}{\Rpackage{DECIPHER} first released in Bioconductor} \ytl{2012}{Chimera finding published \cite{Wright:2012} (\Rfunction{FindChimeras}, \Rfunction{FormGroups}, \Rfunction{FindChimeras})} \ytl{2012}{ProbeMelt published \cite{Yilmaz:2012} (\Rfunction{CalculateEfficiencyArray})} \ytl{2013}{Array design function published \cite{Wright:2013} (\Rfunction{DesignArray})} \ytl{2014}{Primer design functions published \cite{Wright:2014a} (\Rfunction{DesignPrimers}, \Rfunction{CalculateEfficiencyPCR}, \Rfunction{AmplifyDNA})} \ytl{2014}{FISH probe design functions published \cite{Wright:2014b} (\Rfunction{DesignProbes}, \Rfunction{CalculateEfficiencyFISH})} \ytl{2014}{Array analysis functions published \cite{Noguera:2014} (\Rfunction{Array2Matrix}, \Rfunction{NNLS})} \ytl{2015}{Protein alignment functions published \cite{Wright:2015} (\Rfunction{AlignSeqs}, \Rfunction{AlignProfiles}, \Rfunction{AlignTranslation}, \Rfunction{PredictHEC})} \ytl{2016}{Sequence databases published \cite{Wright:2016a} (\Rfunction{SearchDB}, \Rfunction{Codec}, \Rfunction{Add2DB}, \Rfunction{Seqs2DB}, \Rfunction{DB2Seqs}, \Rfunction{BrowseDB})} \ytl{2016}{General purpose primer design published \cite{Wright:2016b} (\Rfunction{DesignSignatures}, \Rfunction{MeltDNA}, \Rfunction{DigestDNA})} \ytl{2018}{IDTAXA for nucleotides published \cite{Murali:2018} (\Rfunction{IdTaxa}, \Rfunction{LearnTaxa})} \ytl{2020}{Nucleotide alignment functions published \cite{Wright:2020} (\Rfunction{AlignSeqs}, \Rfunction{PredictDBN})} \ytl{2021}{IDTAXA for proteins published \cite{Cooley:2021} (\Rfunction{IdTaxa}, \Rfunction{LearnTaxa})} \ytl{2022}{\Rfunction{FindNonCoding} published \cite{Wright:2022} (\Rfunction{FindNonCoding}, \Rfunction{LearnNonCoding})} \ytl{2024}{\Rfunction{Clusterize} published \cite{Wright:2024a} (\Rfunction{Clusterize})} \ytl{2024}{Search and mapping functions published \cite{Wright:2024b} (\Rfunction{IndexSeqs}, \Rfunction{SearchIndex}, \Rfunction{AlignPairs})} \bigskip \rule{\linewidth}{1pt}% \end{minipage}% \end{table} \clearpage \begin{thebibliography}{} \bibitem{Cooley:2021} {NP Cooley \& ES Wright} \newblock Accurate annotation of protein coding sequences with IDTAXA \newblock {\em NAR Genomics and Bioinformatics}, 3(3), lqab080, 2021. \bibitem{Murali:2018} {A Murali, A Bhargava, \& ES Wright} \newblock IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences \newblock {\em Microbiome}, 6, 1-14, 2018. \bibitem{Noguera:2014} {DR Noguera, ES Wright, P Camejo, \& LS Yilmaz} \newblock Mathematical tools to optimize the design of oligonucleotide probes and primers \newblock {\em Applied Microbiology and Biotechnology}, 98, 9595-9608, 2014. \bibitem{Wright:2012} {ES Wright, LS Yilmaz, \& DR Noguera} \newblock DECIPHER, a search-based approach to chimera identification for 16S rRNA sequences \newblock {\em Applied and Environmental Microbiology}, 78(3), 717-725, 2012. \bibitem{Wright:2013} {ES Wright, JM Strait, LS Yilmaz, GW Harrington, \& DR Noguera} \newblock Identification of Bacterial and Archaeal Communities From Source to Tap \newblock {\em Water Research Foundation}, 1-76. \bibitem{Wright:2014a} {ES Wright, LS Yilmaz, S Ram, JM Gasser, GW Harrington, \& DR Noguera} \newblock DECIPHER, a search-based approach to chimera identification for 16S rRNA sequences \newblock {\em Environmental Microbiology}, 16(5), 1354-1365, 2014. \bibitem{Wright:2014b} {ES Wright, LS Yilmaz, AM Corcoran, HE \"{O}kten, \& DR Noguera} \newblock Automated Design of Probes for rRNA-Targeted Fluorescence In Situ Hybridization Reveals the Advantages of Using Dual Probes for Accurate Identification \newblock {\em Applied and Environmental Microbiology}, 80(16), 5124-5133, 2014. \bibitem{Wright:2015} {ES Wright} \newblock DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment \newblock {\em BMC Bioinformatics}, 16(322), 1-14, 2015. \bibitem{Wright:2016a} {ES Wright} \newblock Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R \newblock {\em The R Journal}, 8(1), 352-359, 2016. \bibitem{Wright:2016b} {ES Wright \& KH Vetsigian} \newblock DesignSignatures: a tool for designing primers that yields amplicons with distinct signatures \newblock {\em Bioinformatics}, 32(10), 1565-1567, 2016. \bibitem{Wright:2020} {ES Wright} \newblock RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency \newblock {\em RNA}, 26(5), 531-540, 2020. \bibitem{Wright:2022} {ES Wright} \newblock FindNonCoding: rapid and simple detection of non-coding RNAs in genomes \newblock {\em Bioinformatics}, 38(3), 841-843, 2022. \bibitem{Wright:2024a} {ES Wright} \newblock Accurately clustering biological sequences in linear time by relatedness sorting \newblock {\em Nature Communications}, 15, 1-13, 2024. \bibitem{Wright:2024b} {ES Wright} \newblock Fast and Flexible Search for Homologous Biological Sequences with DECIPHER v3 \newblock {\em The R Journal}, \textit{in Press}. \bibitem{Yilmaz:2012} {LS Yilmaz, A Loy, ES Wright, M Wagner, \& DR Noguera} \newblock Modeling formamide denaturation of probe-target hybrids for improved microarray probe design in microbial diagnostics \newblock {\em PLOS ONE}, 7(8), e43862. \end{thebibliography} \end{document}