%\VignetteIndexEntry{How to forge a BSgenome data package} %\VignetteKeywords{Genome, BSgenome, DNA, Sequence, UCSC, BSgenome data package} %\VignettePackage{BSgenome} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \SweaveOpts{keep.source=TRUE} \documentclass[10pt]{article} %\usepackage{amsmath} %\usepackage[authoryear,round]{natbib} % % NOTE -- There is an obscure issue with the use of \url from the hyperref % package that will trigger a MiKTeX/pdflatex error: % ! pdfTeX error (ext4): \pdfendlink ended up in different nesting level than \pd % fstartlink. % \AtBegShi@Output ...ipout \box \AtBeginShipoutBox % \fi \fi % l.96 \end{document} % % ! ==> Fatal error occurred, no output PDF file produced! % Transcript written on BSgenomeForge1.log. % The error is hard to reproduce. I've observed it on the r34270 version of this % vignette and with the following version of the MiKTeX/pdflatex command: % MiKTeX-pdfTeX 2.7.3147 (1.40.9) (MiKTeX 2.7) \usepackage{hyperref} \usepackage{underscore} \textwidth=6.5in \textheight=8.5in \parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{\textsf{R}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\term}[1]{\emph{#1}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rmethod}[1]{\textit{#1}} \newcommand{\Rfunarg}[1]{\textit{#1}} \bibliographystyle{plainnat} \begin{document} \title{How to forge a BSgenome data package} \author{Herv\'e Pag\`es \\ Gentleman Lab \\ Fred Hutchinson Cancer Research Center \\ Seattle, WA} \date{\today} \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IMPORTANT NOTE: Starting with Bioconductor 2.14, the \Rfunction{forgeBSgenomeDataPkg} function doesn't handle masks anymore i.e. now it produces a \term{BSgenome data package} that contains only the bare sequences. A new function \Rfunction{forgeMaskedBSgenomeDataPkg} has been added to forge a \term{BSgenome data package} with masked sequences. This vignette has been updated to reflect these changes. This vignette describes the process of forging a \term{BSgenome data package}. It is intended for Bioconductor users who want to make a new \term{BSgenome data package}, not for regular users of these packages. Start {\R} (make sure you are using the latest release version), load the \Rpackage{BSgenome} package, and use the \Rfunction{available.genomes} function to get the list of \term{BSgenome data packages} available in the current release version of Bioconductor. So you confirm that none of those genomes suits your needs? And you want to make your own package? If you answer yes to these 2 questions, then you've come to the right place. Requirements: \begin{itemize} \item Some basic knowledge of the Unix/Linux command line is required. The commands that you will most likely need are: \code{cd}, \code{mkdir}, \code{mv}, \code{rmdir}, \code{tar}, \code{gunzip}, \code{unzip}, \code{ftp} and \code{wget}. Also you will need to create and edit some text files. \item You need access to a good Unix/Linux build machine with a decent amount of RAM (>= 4GB), especially if your genome is big. For smaller genomes, 2GB or even 1GB of RAM might be enough. \item You need the latest release versions of {\R} plus the \Rpackage{Biostrings} and \Rpackage{BSgenome} packages installed on the build machine. To check your installation, start {\R} and try to load the \Rpackage{BSgenome} package. \item Finally, you need to obtain the \term{source data files} of the genome that you want to build a package for. There are 2 groups of \term{source data files}: (1) the files containing the sequence data (those files are required), and (2) the files containing the mask data (those files are optional). For most organisms, these files have been made publicly available on the internet by genome providers like UCSC, NCBI, FlyBase, TAIR, etc. The next sections of this vignette explain how to obtain and prepare these files. \end{itemize} Refer to the \textit{R Installation and Administration} manual \footnote{\url{http://cran.r-project.org/doc/manuals/R-admin.html}} if you need to install {\R} or upgrade your {\R} version, and to the \textit{Bioconductor - Install} page \footnote{\url{http://bioconductor.org/install/}} on the Bioconductor website if you need to install or update the \Rpackage{Biostrings} or \Rpackage{BSgenome} packages. Questions, comments or bug reports about this vignette or about the BSgenomeForge functions are welcome. Please address them to the author (\code{hpages@fredhutch.org}) or post them on the Bioconductor support site \footnote{\url{https://support.bioconductor.org/}}. In the next section (``How to forge a BSgenome data package with bare sequences''), we describe how to forge a \term{BSgenome data package} with sequences that have no masks on them. If this is what you're after, then you'll be done at the end of the section. Only if you need to forge a package with masked sequences, you'll have to keep reading but you'll have to forge a \term{BSgenome data package} with bare sequences first. In this vignette, we call \term{target packages} the \term{BSgenome data packages} that we're going to forge (one target package with bare sequences and possibly a 2nd target package with masked sequences). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{How to forge a BSgenome data package with bare sequences} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Obtain and prepare the sequence data} \subsubsection{What do I need?} The sequence data must be in a single twoBit file (e.g. \textit{musFur1.2bit}) or in a collection of FASTA files (possibly gzip-compressed). If the latter, then you need 1 FASTA file per sequence that you want to put in the \term{target package}. In that case the name of each FASTA file must be of the form \textit{}\textit{}\textit{} where \textit{} is the name of the sequence in it and \textit{} and \textit{} are a prefix and a suffix (possibly empty) that are the same for all the FASTA files. \subsubsection{Examples} UCSC provides the genome sequences for Stickleback (gasAcu1) in various forms: (a) as a twoBit file (\textit{gasAcu1.2bit} located in \url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/bigZips/}, (b) as a big compressed tarball containing one file per chromosome (\textit{chromFa.tar.gz} located in the same folder), and (c) as a collection of compressed FASTA files (\texttt{chrI.fa.gz}, \texttt{chrII.fa.gz}, \texttt{chrIII.fa.gz}, ..., \texttt{chrXXI.fa.gz}, \texttt{chrM.fa.gz} and \texttt{chrUn.fa.gz} located in \url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/chromosomes/}). To forge a \term{BSgenome data package} for gasAcu1, the easiest would be to use (a) (the twoBit file). However it would also be possible to use (c) (the collection of compressed FASTA files). In that case the suffix would need to be set to \texttt{.fa.gz}. Note that the prefix here is empty and not \texttt{chr}, because \texttt{chr} is considered to be part of the sequence names (a commonly used chromosome naming convention at UCSC). Alternatively, the big compressed tarball (\texttt{chromFa.tar.gz}) could be used. After downloading and extracting it, we should end up with the same files as with (c). You can use the \Rfunction{fasta.info} function from the \Rpackage{Biostrings} package to see what's in a FASTA file: <<>>= library(Biostrings) file <- system.file("extdata", "ce2chrM.fa.gz", package="BSgenome") fasta.info(file) @ For the \Rpackage{BSgenome.Mfuro.UCSC.musFur1} package, the twoBit file was used and downloaded with: \begin{verbatim} wget http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/musFur1.2bit \end{verbatim} For the \Rpackage{BSgenome.Rnorvegicus.UCSC.rn4} package, the big compressed tarball was used. It was downloaded and extracted with: \begin{verbatim} wget http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/chromFa.tar.gz tar zxf chromFa.tar.gz \end{verbatim} \subsubsection{The \textit{} folder} So we assume that you've downloaded the \term{sequence data files} and that they are now located in a folder on your machine\footnote{Note that checking the md5sums after download is always a good idea.}. From now on, we'll refer to this folder as the \textit{} folder. Depending on what data files you downloaded (see previous sub-subsection), you might also need to extract and possibly rename the files. Note that all the \term{sequence data files} should be located directly in the \textit{} folder, not in subfolders of this folder. For example, depending on the genome, UCSC provides either a big \texttt{chromFa.tar.gz} or \texttt{chromFa.zip} file that contains the sequence data for all the chromosomes. But it could be that, after extraction of this big file, the individual FASTA files for each chromosome end up being located one level down the \textit{} folder (granted that you were in this folder when you extracted the file). If this is the case, then you will need to move them one level up (use \code{mv -i */*.fa .} for this, then remove all the empty subfolders with \code{rmdir *}). \subsection{Prepare the BSgenome data package seed file} \subsubsection{Overview} The \term{BSgenome data package seed file} will contain all the information needed by the \Rfunction{forgeBSgenomeDataPkg} function to forge the \term{target package}. The format of this file is DCF (Debian Control File), which is also the format used for the \texttt{DESCRIPTION} file of any {\R} package. The valid fields of a \term{seed file} are divided in 3 categories: \begin{enumerate} \item Standard \texttt{DESCRIPTION} fields. These fields are actually the mandatory fields found in any \texttt{DESCRIPTION} file. They will be copied to the \texttt{DESCRIPTION} file of the \term{target package}. \item Non-standard \texttt{DESCRIPTION} fields. These fields are specific to \term{seed files} and they will also be copied to the \texttt{DESCRIPTION} file of the \term{target package}. In addition, the values of these fields will be stored in the \Rclass{BSgenome} object that will be contained in the \term{target package}. This means that the users of the \term{target package} will be able to retrieve these values via the accessor methods defined for \Rclass{BSgenome} objects. See the man page for the \Rclass{BSgenome} class (\code{{?}`BSgenome-class`}) for a description of these methods. \item Additional fields that don't fall in the first 2 categories. \end{enumerate} The 3 following sub-subsections give an extensive description of all the valid fields of a \term{seed file}. Alternatively, the reader in a hurry can go directly to the last sub-subsection of this subsection for an example of a \term{seed file}. \subsubsection{Standard \texttt{DESCRIPTION} fields} \begin{itemize} \item \code{Package}: Name to give to the \term{target package}. The convention used for the packages built by the Bioconductor project is to use a name made of 4 parts separated by a dot. Part 1 is always \code{BSgenome}. Part 2 is the abbreviated name of the organism (when the name of the organism is made of 2 words, we put together the first letter of the first word in upper case followed by the entire second word in lower case e.g. \code{Rnorvegicus}). Part 3 is the name of the organisation who provided the genome (e.g. \code{UCSC}). Part 4 is the release string or number used by this organisation to identify this version of the genome (e.g. \code{rn4}). \item \code{Title}: The title of the \term{target package}. E.g. \code{Full genome sequences for Rattus norvegicus (UCSC version rn4)}. \item \code{Description}, \code{Version}, \code{Author}, \code{Maintainer}, \code{License}: Like the 2 previous fields, these are mandatory fields found in any \texttt{DESCRIPTION} file. Please refer to the \textit{The DESCRIPTION file} subsection of the \textit{Writing R Extensions} manual \footnote{\url{http://cran.r-project.org/doc/manuals/R-exts.html\#The-DESCRIPTION-file}} for more information about these fields. If you plan to distribute the \term{target package} that you are about to forge, please pickup the license carefully and make sure that it is compatible with the license of the \term{sequence data files} if any. \item \code{Suggests}: [OPTIONAL] If you're going to add examples to the \code{Examples} section of the \term{target package} (see \code{PkgExamples} field below for how to do this), use this field to list the packages used in these examples. \end{itemize} \subsubsection{Non-standard \texttt{DESCRIPTION} fields} \begin{itemize} \item \code{organism}: The scientific name of the organism in the format \code{Genus species} (e.g. \code{Rattus norvegicus}, \code{Homo sapiens}) or \code{Genus species subspecies} (e.g. \code{Homo sapiens neanderthalensis}). \item \code{common\_name}: The common name of the organism (e.g. \code{Rat} or \code{Human}). For organisms that have more than one commmon name (e.g. \code{Bovine} or \code{Cow} for Bos taurus), choose one. Note that for the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{SPECIES} column of the \textit{List of UCSC genome releases} table \footnote{\url{http://genome.ucsc.edu/FAQ/FAQreleases\#release1}}. \item \code{provider}: The provider of the \term{sequence data files} e.g. \code{UCSC}, \code{NCBI}, \code{BDGP}, \code{FlyBase}, etc... Should preferably match part 3 of the package name (field \code{Package}). \item \code{provider\_version}: The provider-side version of the genome. Should preferably match part 4 of the package name (field \code{Package}). For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{UCSC VERSION} field of the \textit{List of UCSC genome releases} table. \item \code{release\_date}: When this assembly of the genome was released. For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{RELEASE DATE} field of the \textit{List of UCSC genome releases} table. \item \code{release\_name}: The release name or build number of this assembly of the genome. For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{RELEASE NAME} field of the \textit{List of UCSC genome releases} table. \item \code{source\_url}: The permanent URL where the \term{sequence data files} used to forge the \term{target package} can be found. \item \code{organism\_biocview}: The official biocViews term for this organism. This is generally the same as the \code{organism} field except that spaces should be replaced by underscores. The value of this field matters only if the \term{target package} is going to be added to a Bioconductor repository since it will determine under which subview of the \textit{Bioconductor Task View} for \textit{Organism} \footnote{\url{http://bioconductor.org/packages/release/Organism.html}} the package will appear. Note that this is the only field in this category that won't be stored in the \Rclass{BSgenome} object that will be contained in the \term{target package}. \end{itemize} \subsubsection{Other fields} \begin{itemize} \item \code{BSgenomeObjname}: Should match part 2 of the package name (see \code{Package} field above). \item \code{seqnames}: [OPTIONAL] Needed only if you are using a collection of \term{sequence data files}. In that case \code{seqnames} must be an {\R} expression returning the names of the single sequences to forge (in a character vector). E.g. \code{paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), "\_random", sep="")), sep="")}. \item \code{circ\_seqs}: [OPTIONAL] An {\R} expression returning the names of the circular sequences (in a character vector). If the \code{seqnames} field is specified, then \code{circ\_seqs} must be a subset of it. E.g. \code{"chrM"} for rn4 or \code{c("chrM", "2micron")} for the sacCer2 genome (Yeast) from UCSC. The default value for \code{circ\_seqs} is \code{NULL} (no circular sequence). \item \code{mseqnames}: [OPTIONAL] An {\R} expression returning the names of the multiple sequences to forge (in a character vector). E.g. \code{c("random", "chrUn")} for \code{rn5}. The default value for \code{mseqnames} is \code{NULL} (no multiple sequence). \item \code{PkgDetails}: [OPTIONAL] Some arbitrary text that will be copied to the \code{Details} section of the man page of the \term{target package}. \item \code{SrcDataFiles}: [OPTIONAL] Some arbitrary text that will be copied to the \code{Note} section of the man pages of the \term{target package}. \code{SrcDataFiles} should describe briefly where the \term{sequence data files} are coming from. Will typically contain URLs (permanent URLs are a must). \item \code{PkgExamples}: [OPTIONAL] Some {\R} code (possibly with comments) that will be added to the \code{Examples} section of the man page of the \term{target package}. Don't forget to list the packages used in these examples in the \code{Suggests} field. \item \code{seqs\_srcdir}: The absolute path to the folder containing the \term{sequence data files}. \item \code{seqfile\_name}: Required if the \term{sequence data files} is a single twoBit file (e.g. \code{musFur1.2bit}). Ignored otherwise. \item \code{seqfiles\_prefix}, \code{seqfiles\_suffix}: [OPTIONAL] Needed only if you are using a collection of \term{sequence data files}. The common prefix and suffix that need to be added to all the sequence names (fields \code{seqnames} and \code{mseqnames}) to get the name of the corresponding FASTA file. Default values are the empty prefix for \code{seqfiles\_prefix} and \code{.fa} for \code{seqfiles\_suffix}. \item \code{ondisk\_seq\_format}: [OPTIONAL] Specifies how the single sequences should be stored in the \term{target package}. Can be \code{2bit}, \code{rda}, \code{fa.rz}, or \code{fa}. If \code{2bit} (the default), then all the single sequences are stored in a single twoBit file. If \code{rda}, then each single sequence is stored in a separated serialized \Rclass{XString} object (one per single sequence). If \code{fa.rz} or \code{fa}, then all the single sequences are stored in a single FASTA file (compressed in the RAZip format if \code{fa.rz}). \end{itemize} \subsubsection{An example} The \term{seed files} used for the packages forged by the Bioconductor project are included in the \Rpackage{BSgenome} package: <<>>= library(BSgenome) seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome") tail(list.files(seed_files, pattern="-seed$")) ## Display seed file for musFur1: musFur1_seed <- list.files(seed_files, pattern="\\.musFur1-seed$", full.names=TRUE) cat(readLines(musFur1_seed), sep="\n") ## Display seed file for rn4: rn4_seed <- list.files(seed_files, pattern="\\.rn4-seed$", full.names=TRUE) cat(readLines(rn4_seed), sep="\n") @ From now we assume that you have managed to prepare the \term{seed file} for your \term{target package}. \subsection{Forge the \term{target package}} To forge the \term{target package}, start \R, load the \Rpackage{BSgenome} package, and call the \Rfunction{forgeBSgenomeDataPkg} function on your \term{seed file}. For example, if the path to your \term{seed file} is \texttt{"path/to/my/seed"}, do: <>= library(BSgenome) forgeBSgenomeDataPkg("path/to/my/seed") @ Depending on the size of the genome and your hardware, this can take between 2 minutes and 1 or 2 hours. By default \Rfunction{forgeBSgenomeDataPkg} will create the source tree of the \term{target package} in the current directory. Once \Rfunction{forgeBSgenomeDataPkg} is done, ignore the warnings (if any), quit \R, and build the source package (tarball) with \begin{verbatim} R CMD build \end{verbatim} where \code{} is the path to the source tree of the package. Then check the package with \begin{verbatim} R CMD check \end{verbatim} where \code{} is the path to the tarball produced by \code{R CMD build}. Finally install the package with \begin{verbatim} R CMD INSTALL \end{verbatim} and use it! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{How to forge a BSgenome data package with masked sequences} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Obtain and prepare the mask data} \subsubsection{What do I need?} The mask data are not available for all organisms. What you download exactly depends of course on what's available and also on what built-in masks you want to have in the \term{2nd target package}. 4 kinds of built-in masks are currently supported by BSgenomeForge: \begin{itemize} \item the masks of assembly gaps, aka ``the AGAPS masks''; \item the masks of intra-contig ambiguities, aka ``the AMB masks''; \item the masks of repeat regions that were determined by the RepeatMasker software, aka ``the RM masks''; \item the masks of repeat regions that were determined by the Tandem Repeats Finder software (where only repeats with period less than or equal to 12 were kept), aka ``the TRF masks''. \end{itemize} For the AGAPS masks, you need UCSC ``gap'' or NCBI ``agp'' files. It can be one file per chromosome or a single big file containing the assembly gap information for all the chromosomes together. In the former case, the name of each file must be of the form \textit{}\textit{}\textit{}. Like for the FASTA files in group 1, \textit{} must be the name of the sequence (sequence names for FASTA files and AGAPS masks must match) and \textit{} and \textit{} must be a prefix and a suffix (possibly empty) that are the same for all the files (this prefix/suffix doesn't need to be, and typically is not, the same as for the FASTA files in group 1). You don't need any file for the AMB masks. For the RM masks, you need RepeatMasker \texttt{.out} files. Like for the AGAPS masks, it can be one file per chromosome or a single big file containing the RepeatMasker information for all the chromosomes together. In the former case, the name of each file must also be of the form \textit{}\textit{}\textit{}. Same comments apply as for the AGAPS masks above. For the TRF masks, you need Tandem Repeats Finder \texttt{.bed} files. Again, it can be one file per chromosome or a single big file. In the former case, the name of each file must also be of the form \textit{}\textit{}\textit{}). Same comments apply as for the AGAPS masks above. Again, for some organisms none of the masks above are available or only some of them are. \subsubsection{An example} Here is how the \term{mask data files} for the \Rpackage{BSgenome.Rnorvegicus.UCSC.rn4} package were obtained and prepared: \begin{itemize} \item AGAPS masks: all the \texttt{chr*\_gap.txt.gz} files (UCSC ``gap'' files) were downloaded from the UCSC \textit{database} folder \footnote{\url{http://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/}} for \code{rn4}. This was done with the standard Unix/Linux \code{ftp} command: \begin{verbatim} ftp hgdownload.cse.ucsc.edu # login as "anonymous" cd goldenPath/rn4/database prompt mget chr*_gap.txt.gz \end{verbatim} Then all the downloaded files were uncompressed with: \begin{verbatim} for file in chr*_gap.txt.gz; do gunzip $file ; done \end{verbatim} \item RM masks: file \texttt{chromOut.tar.gz} was downloaded from the UCSC \textit{bigZips} folder and extracted with: \begin{verbatim} tar zxf chromOut.tar.gz \end{verbatim} \item TRF masks: file \texttt{chromTrf.tar.gz} was downloaded from the UCSC \textit{bigZips} folder and extracted with: \begin{verbatim} tar zxf chromTrf.tar.gz \end{verbatim} \end{itemize} \subsubsection{The \textit{} folder} From now we assume that you've downloaded (checking the md5sums is always a good idea) and possibly extracted all the \term{mask data files}, and that they are located in the \textit{} folder. Note that all the \term{mask data files} should be located directly in the \textit{} folder, not in subfolders of this folder. \subsection{Prepare the seed file for the masked BSgenome data package} \subsubsection{Overview} The \term{seed file} for the \term{BSgenome data package} with masked sequences (i.e. \term{2nd target package}) is similar to the \term{seed file} for the \term{BSgenome data package} containing the corresponding bare sequences (a.k.a. the \term{reference target package} i.e. the package you forged in the previous section). It will contain all the information needed by the \Rfunction{forgeMaskedBSgenomeDataPkg} function to forge the \term{2nd target package}. The fields of this \term{seed file} are described in the 3 following sub-subsections. Alternatively, the reader in a hurry can go directly to the last sub-subsection of this subsection for an example of \term{seed file}. \subsubsection{Standard \texttt{DESCRIPTION} fields} \begin{itemize} \item \code{Package}: Name to give to the \term{2nd target package}. The convention used for the packages built by the Bioconductor project is to use the same name as the \term{reference target package} with the \code{.masked} suffix added to it. \item \code{Title}: The title of the package. E.g. \code{Full masked genome sequences for Rattus norvegicus (UCSC version rn4)}. \item \code{Description}, \code{Version}, \code{Author}, \code{Maintainer}, \code{License}: See previous section. \end{itemize} \subsubsection{Non-standard \texttt{DESCRIPTION} fields} \begin{itemize} \item \code{organism\_biocview}: Same as for the \term{reference target package}. \item \code{source\_url}: The permanent URL where the \term{mask data files} used to forge this package can be found. \end{itemize} \subsubsection{Other fields} \begin{itemize} \item \code{RefPkgname}: The name of the \term{reference target package}. \item \code{nmask\_per\_seq}: The number of masks per sequence (1 to 4). \item \code{PkgDetails}, \code{PkgExamples}: See previous section. \item \code{SrcDataFiles}: [OPTIONAL] Some arbitrary text that will be copied to the \code{Note} section of the man pages of the \term{target package}. \code{SrcDataFiles} should describe briefly where the \term{mask data files} are coming from. Will typically contain URLs (permanent URLs are a must). \item \code{masks\_srcdir}: The path to the \textit{} folder. \item \code{AGAPSfiles\_type}: [OPTIONAL] Must be \code{gap} (the default) if the \term{source data files} containing the AGAPS masks information are UCSC ``gap'' files, or \code{agp} if they are NCBI ``agp'' files. \item \code{AGAPSfiles\_name}: [OPTIONAL] Omit this field if you have one \term{source data file} per single sequence for the AGAPS masks and use the \code{AGAPSfiles\_prefix} and \code{AGAPSfiles\_suffix} fields below instead. Otherwise, use this field to specify the name of the single big file. \item \code{AGAPSfiles\_prefix}, \code{AGAPSfiles\_suffix}: [OPTIONAL] Omit these fields if you have one single big \term{source data file} for all the AGAPS masks and use the \code{AGAPSfiles\_name} field above instead. Otherwise, use these fields to specify the common prefix and suffix that need to be added to all the single sequence names (field \code{seqnames}) to get the name of the file that contains the corresponding AGAPS mask information. Default values are the empty prefix for \code{AGAPSfiles\_prefix} and \code{\_gap.txt} for \code{AGAPSfiles\_suffix}. \item \code{RMfiles\_name}, \code{RMfiles\_prefix}, \code{RMfiles\_suffix}: [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above but for the RM masks. Default values are the empty prefix for \code{RMfiles\_prefix} and \texttt{.fa.out} for \code{RMfiles\_suffix}. \item \code{TRFfiles\_name}, \code{TRFfiles\_prefix}, \code{TRFfiles\_suffix}: [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above but for the TRF masks. Default values are the empty prefix for \code{TRFfiles\_prefix} and \texttt{.bed} for \code{TRFfiles\_suffix}. \end{itemize} \subsubsection{An example} The \term{seed files} used for the packages forged by the Bioconductor project are included in the \Rpackage{BSgenome} package: <<>>= library(BSgenome) seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome") tail(list.files(seed_files, pattern="\\.masked-seed$")) rn4_masked_seed <- list.files(seed_files, pattern="\\.rn4\\.masked-seed$", full.names=TRUE) cat(readLines(rn4_masked_seed), sep="\n") @ From now we assume that you have managed to prepare the \term{seed file} for the \term{2nd target package}. \subsection{Forge the \term{2nd target package}} To forge the package, start \R, make sure the \term{reference target package} is installed (try to load it), load the \Rpackage{BSgenome} package, and call the \Rfunction{forgeMaskedBSgenomeDataPkg} function on the \term{seed file} you made for the \term{2nd target package}. For example, if the path to your \term{seed file} is \texttt{"path/to/my/seed"}, do: <>= library(BSgenome) forgeMaskedBSgenomeDataPkg("path/to/my/seed") @ Depending on the size of the genome and your hardware, this can take between 2 minutes and 1 or 2 hours. By default \Rfunction{forgeMaskedBSgenomeDataPkg} will create the source tree of the \term{2nd target package} in the current directory. Once \Rfunction{forgeMaskedBSgenomeDataPkg} is done, ignore the warnings (if any), quit \R, and build the source package (tarball) with \begin{verbatim} R CMD build \end{verbatim} where \code{} is the path to the source tree of the package. Then check the package with \begin{verbatim} R CMD check \end{verbatim} where \code{} is the path to the tarball produced by \code{R CMD build}. Finally install the package with \begin{verbatim} R CMD INSTALL \end{verbatim} and use it! If you want to distribute this package, you need to distribute it with the \term{reference target package} since it depends on it. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session information} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The output in this vignette was produced under the following conditions: <<>>= sessionInfo() @ \end{document}