% \VignetteIndexEntry{SAGElyzer}
% \VignetteDepends{tkWidgets, annotate}
% \VignetteKeywords{Expression Analysis}
% \VignettePackage{SAGElyzer}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\documentclass[12pt]{article}
\usepackage{hyperref}
\usepackage{graphicx}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in


\begin{document}
\author{Jianhua Zhang}

\title{An introduction to SAGElyzer}

\maketitle

\copyright{2003 Bioconductor}

\section{Introduction}

\Rpackage{SAGElyzer} is a system for SAGE data management, analysis,
and annotation built upon interfaces built using R tcltk. The
functionalities are never complete but can be expanded easily when
needed. At the time of this release (1.3.0), \Rpackage{SAGElyzer}
allows users to manage data from SAGE libraries, analyze data, and
annotate SAGE tags identified by the analysis.

As SAGE libraries are potentially large, a database is required to
support data storage and retrieval. The current version of
\Rpackage{SAGElyse} has been tested against a PostgreSQL database
under both Windows and Unix. However, \Rpackage{SAGElyzer} is supposed
to work together with any database management system as long as a
connection to the database can be made.

\section{Setting up a database}

When database management system has been chosen, read the manual or tutorials
in the Intenet for setting up the database. For a PostgreSQL database on
linux, readers are referred to \url{
  http://techrepublic.com.com/5100-6261-1054332.html} for the
procedures involved to set up a database.

Additional steps are required to set up a DSN for Windows
users.
\url{http://www.webwizguide.info/asp/tutorials/setting_up_dsn.asp}
provides step by step instructions on how to make connections to an
existing database through DSN.

\section{Using SAGElyzer}

\Rfunction{SAGElyzer} operates on potentially large data sets. Memory
may be an issue for Windows operating system. Windows users are
advised to (a) keep as few applicatons open as possible and (b)
increase the memory allocated to R. One way to increase the memory
allocated to R is to right click the R icon and then click properties and
append --max-mem-size=XXXM (XXX is the memory size desired that may
vary between systems) to the end of the text in the entry box for
\textit{Target}. For example, suppose we have
\begin{verbatim}
"C\Program Files\R\rw1080\bin\Rgui.exe"
\end{verbatim}
in the entry box for \textit{Target}, the text will become
\begin{verbatim}
"C\Program Files\R\rw1080\bin\Rgui.exe"--max-mem-size=512M
\end{verbatim}
if we would like to have 512M allocated to R.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer1}
     \caption{A snapshot of the widget when \Rfunction{SAGElyzer} is loaded}
     \label{Figure 1}
   \end{center}
\end{figure}

When \Rpackage{SAGElyzer} is loaded by (checking for interactiveness
is added to the code to turn off the code execution for automatic package
building):

<<>>=
if(interactive()){
    SAGElyzer()
}
@

A widget shown by Figure 1 will appear. The widget contains a
\textit{Connect} button in the upper right corner, two boxes in the
middle, and a status bar at the bottom. As a connection to a database
is always required, the \textit{Connect} button is the only
interactive feature at this moment. When the \textit{Connect} button
is clicked, a widget will pop up prompting users for inputs to make a
connection to an existing database. The widget has different looks
depending on the operating system. Figure 2 shows the widget for
Unix. The one for Windows does not have the entry boxes for
\textit{Database, User, Password, or Host} but has an entry box for
\textit{DSN} instead. Entry boxes that are blank have to be filled by
a user (if an argument is not required, e. g. password, the entry box
for that argument can be left empty). Some of the entry boxes have
already been filled with default values. Users may change the values
for those entry boxes. However, if a user is not sure about what to
enter, it is often safe to stay with the default values.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer2}
     \caption{A snapshot of the widget for taking inputs for database
     connection}
     \label{Figure 2}
   \end{center}
\end{figure}

For both Windows and Unix systems, three database tables named by
entries in the last three entry boxes (\textit{Counts, Info, and
  Map}) will be created and maintained until updated later. The table
for \textit{Counts} contains counts of SAGE tags across libraries. The
table for \textit{Info} contains information about the orginal SAGE
libraries and the database table that stores counts for SAGE tags
across libraries. The table for \textit{Map} contains mappings between
SAGE tags and UniGene ids.

After a connection to an existing database has been made, the tasks
that can be perfomed using SAGElyzer will be made available to users
through the \textit{Tasks} box. Each of the buttons in the box is
clickable and that in turn will have the procedures involved to perform
the task listed as buttons in the \textit{Procedures} box. Each button
in the \textit{Procedures} allows users to perform certain job related
to the task. Figure 3 show the results when a database connection was
made and the \textit{Manage Data} button in \textit{Tasks} box was
clicked.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer4}
     \caption{A snapshot of the widget when the
       \textit{Manage Data} button in \textit{Tasks} was clicked after
       a connection to an existing database had been made}
     \label{Figure 3}
   \end{center}
\end{figure}

If \Rfunction{SAGElyzer} is installed for the first time, users have
to go through the procedures of \textit{Manage Data} to have database
tables created for later use. Below are descriptions of the three
procedures of \textit{Manage Data}:

\begin{description}
\item[Get GEO SAGE] download all the SAGE libraries available at Gene
  Expression Omnibus (GEO. \url{http://www.ncbi.nlm.nih.gov/geo/}) for
  a given organism. The downloaded SAGE library data file will be saved
  in a local directory specified by a user and used to create database
  tables later.
\item[Integrate SAGE] integrate data from SAGE libraries stored
    locally and write the integrated data to database tables for
    later use. The procedure will populate the table for
    \textit{Counts} and \textit{Info}.
\item[Map SAGE] download data that map ASGE tags to UniGene ids and
    store the mappings in the database table for \textit{Map} for
    later use.
\end{description}

Now, supose we would like to get SAGE libraries for human from
\textit{GEO} and click \textit{Get GEO SAGE}. We will see a resulting
widget (Figure 4) prompting for inputs for the name of the organism
(\textit{Organism}) we are interested in, a name for an existing
directory to save the downloaded SAGE libraries to (\textit{Save To}),
and the URL from which SAGE libraries are available (\textit{Source
  URL}). The default value for \textit{Source URL} was correct at the
time of the writing.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer5}
     \caption{A snapshot of the widget when the
       \textit{Get GEO SAGE} button in \textit{Procedures} was clicked}
     \label{Figure 4}
   \end{center}
\end{figure}

If we stay with the default and click \textit{Continue}, it will take
a while for the procedure to finish because there are quite a few files
to be downloaded from the source. SAGE library files downloaded will
have an extension ".sage". If you do not want to wait, you may
skip this procedure as we have two sample SAGE data files stored in
the temp directory (with an extension ".test") for you to use for now.

When we have SAGE libraries (downloaded or created) stored in a local
directory, we can invoke \textit{Integrate SAGE} to integrate the data
and write them to database tables. Clicking \textit{Integrate SAGE} invokes
another widget (Figure 5) that taks inputs for 4 arguments. The input
for \textit{Library} can be a file name or the name for a directory
containing SAGE library data files. The radio buttons for
\textit{Directory} have to be set to TRUE if \textit{Library} is the
name for a directory or FALSE for a file. The value for \textit{Skip}
determines how many rows to skip from the top of each of the SAGE
library data files when the \Rfunction{SAGElyzer} reads data from
files and that for \textit{Pattern} tells \Rfunction{SAGElyzer} only
to process data files with certain extension or pattens that match the
one defined by a user.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer6}
     \caption{A snapshot of the widget when the
       \textit{Integrate SAGE} button in \textit{Procedures} was clicked}
     \label{Figure 5}
   \end{center}
\end{figure}

In our example, we have saved the downloaded (or stored) SAGE library
data from \textit{GEO} to the library defined by the defulat value for
\textit{Library}. As SAGE library data from \textit{GEO} have no
headers, we set the value for \textit{Skip} to 0 and set
\textit{Directory} to TRUE. If you have downloaded SAGE library files
from \textit{GEO} and willing to wait, you stay with the default value
for \textit{Pattern} and click \textit{Continue}. Otherwise, change
the value for \textit{Pattern} from ".sage" to ".test" to only process
the two test sample files and click \textit{Continue}. the system will
integrate the data and write them to the database. Again, it will take a while
for the procedure to finish if you choose to process data files
downloaded from \textit{GEO}. The status bar at the bottom of the main
\Rfunction{SAGElyzer} widget wil have something reads "Running
procedure 'XXX'. Please wait." when any of the procedures is running.

\textit{Integrate SAGE} requires that all the data files to be
processed should be stored in the same directory, either with or
without a header, and have the same pattern for their names. These
usually may not be a problem when only SAGE data that are downloaded
from \textit{GEO} will be used. If both dowloaded and local data will
be used, efforts have to be made to satisfy the requirements before
integrating the data.

Now, we have the SAGE libraries downloaded from \textit{GEO} written
to the database. Another thing we need to do is to get the mappings
from SAGE tags to UniGene ids. If we click \textit{Map SAGE}, we will
be prompted by another widget (Figure 6) for three arguments. \textit{DB
  Table Name} is for the name of the database table where the mappings
will be stored. \textit{Map} indicates whether the mapping are from
SAGE tag to UniGene ids or UniGene id to SAGE tag. \textit{Source URL}
is the URL for the source file (SAGEmap\_tag\_ug-rel.zip) containing
mappings between SAGE tags and UniGene ids. In our example, we stay
with the defaults and click \textit{Continue}. The procedure will
finish after a while.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer7}
     \caption{A snapshot of the widget when the
       \textit{Map SAGE} button in \textit{Procedures} was clicked}
     \label{Figure 6}
   \end{center}
\end{figure}

Task \textit{Manage Data} only needs to be performed when
\Rfunction{SAGElyzer} is first installed or the existing database
tables need to be updated. The database tables created will be available for
later use any time when \Rfunction{SAGElyzer} is loaded thereafter.

Once the database tables have been created in a previous session, we
can invoke the procedures that relies on the database tables for
analyse or annotation. If we click \textit{knn} in \textit{Tasks}, the
\textit{Procedures} box will be populated with 5 buttons with the
fillowing functionalities:

\begin{description}
\item[Set arguments] set values for arguments that are reguired to
    perform knn.
\item[Run knn] permorm knn using arguments set by \textit{Set
      arguments}.
\item[Get counts] get counts for tags found by \textit{knn} across
    selected SAGE libraries.
\item[Map SAGE] map the tags found by \textit{knn} to UniGene ids and
    link the UniGene ids to UniGene web site.
\item[Find neighbor genes] find genes neighboring tags found by
    \textit{knn} within a range up and down streams of a
    chromosome. This procedure requires that a data package
    \textit{XXXCHRLOC} (available from Bioconductor) be installed,
    where XXX is the name of the organism of interest (i. g. "human")
\end{description}

\textit{Set Arguments} and \textit{Run knn} have to be run in sequence
before other procedures can show any useful results. Lets run
\textit{Set Arguments} first. The arguments needed for knn will be
listed by a resulting widget shown by Figure 7.

\begin{figure}[htb]
   \begin{center}
     \includegraphics{SLyzer8}
     \caption{A snapshot of the widget when the
       \textit{Set Arguments} button in \textit{Procedures} was
       clicked. The snapshot was from an example using the two test
       SAGE data files.}
     \label{Figure 7}
   \end{center}
\end{figure}

The widget has an entry box for \textit{Target tag} for which a given
number of tags (defined by \textit{k value}) that have a similar pattern
of expression will be found across the libraries selected by a
user. \Rfunction{SAGElyzer} currently only takes a ten-letter tag for
NlaII. If a users does not select any library using the widget, all
the libraries will be used to identify the tags with a similar pattern
of expression. Values for \textit{Normalization, Distance, k value,
  and Transformation} can be chosen from a set of options. Lets type
in "aaaaaaaaaa" for \textit{Target tag}, select two libraries, set
\textit{k value} to 50, keep the default values for the others, and
click \textit{Continue}.

When the arguments have been set, we can click \textit{Run knn} to
find similarly expressed tags. Depending on how big a data set will be
involved, the procedure will take various lenths of time to
finish. The result of the execution will be a widget showing a list of
tags and the calculated distances. A user have the option to save or
not save the result.

Users are also allowed to view the counts for tags identified by
\textit{Run knn} across the selected libraries the \textit{Get
  Counts}. Again, the result will be widget that lists the tags and
their counts across the selected libraries. The procedure may take a
while to finish when the database table is large or a large number of
SAGE libraries have been selected.

\textit{Map SAGE} generates an HTML file that maps the tags identified
tags to UniGene ids. Clicking the UniGene id mapped to any of the tags
will bring the user to UniGene web site where information about the
UniGene cluster is available.

\textit{Find neighbor genes} invokes a widget with the tags identified
by \textit{Run knn} mapped to UniGene ids. The mappings are listed in
a set of list boxes. Clinking any of the entries in the list boxes
will show the LocusLink ids of genes that are within a specified range
up and down stream from the tag along the chromosome the tag is located.

\end{document}