---
title: "How To Use GOstats and Category to do Hypergeometric testing
with unsupported model organisms"
authors:
  - name: "M. Carlson"
  - name: "Sonali Kumari"
    affiliation: "Vignette translation from Sweave to Rmarkdown / HTML"
date: "`r format(Sys.time(), '%B %d , %Y')`"
package: GOstats
vignette: >
  %\VignetteIndexEntry{How To Use GOstats and Category to do Hypergeometric testing with unsupported model organisms}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---

# Introduction

This vignette is meant as an extension of what already exists in the
[GOstatsHyperG](GOstatsHyperG.html) vignette. It is intended to explain how a
user can run hypergeometric testing on GO terms or KEGG terms when the organism
in question is not supported by an annotation package. The 1st thing for a user
to determine then is whether or not their organism is supported by an organism
package through `r Biocpkg("AnnotationForge")`. In order to do that, they need
only to call the available.dbschemas() method from `r Biocpkg("AnnotationForge")`.

```{r available Schemas, message=FALSE, warning=FALSE}
library("AnnotationForge")
available.dbschemas()
```

If the organism that you are using is listed here, then your organism is
supported. If not, then you will need to find a source or GO (org KEGG) to gene
mappings. One source for GO to gene mappings is the blast2GO project. But you
might also find such mappings at Ensembl or NCBI. If your organism is not a
typical model organism, then the GO terms you will find are probably going to be
predictions based on sequence similarity measures instead of direct
measurements. This is something that you might want to bear in mind when you
draw conclusions later.

In preparation for our subsequent demonstrations, lets get some data to work
with by borrowing from an organism package. We will assume that you will use
something like `read.table` to load your own annotation packages into a
data.frame object. The starting object needs to be a data.frame with the GO Id's
in the 1st col, the evidence codes in the 2nd column and the gene Id's in the
3rd.

```{r Acquire annotation data, message=FALSE}
library("org.Hs.eg.db")
frame = toTable(org.Hs.egGO)
goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id)
head(goframeData)
```

## Preparing GO to gene mappings

When using GO mappings, it is important to consider the data structure of the GO
ontologies. The terms are organized into a directed acyclic graph. The structure
of the graph creates implications about the mappings of less specific terms to
genes that are mapped to more specific terms. The `r Biocpkg("Category")` and 
`r Biocpkg("GOstats")` packages normally deal with this kind of complexity by 
using a special GO2All mapping in the annotation packages. You won't have one of
those, so instead you will have to make one. `r Biocpkg("AnnotationDbi")`
provides some simple tools to represent your GO term to gene mappings so that
this process will be easy. First you need to put your data into a GOFrame
object. Then the simple act of casting this object to a GOAllFrame object will
tap into the GO.db package and populate this object with the implicated GO2All
mappings for you.

```{r transformGOFrame, message=FALSE, warning=FALSE}
goFrame=GOFrame(goframeData,organism="Homo sapiens")
goAllFrame=GOAllFrame(goFrame)
```

In an effort to standardize the way that we pass this kind of custom
information around, we have chosen to support geneSetCollection objects
from `r Biocpkg("GSEABase")` package. You can generate one of these objects
in the following way:

```{r Make GSC, message=FALSE}
library("GSEABase")
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
```

## Setting up the parameter object

Now we can make a parameter object for `r Biocpkg("GOstats")` by using a special
constructor function. For the sake of demonstration, I will just use all the EGs
as the universe and grab some arbitrarily to be the geneIds tested. For your
case, you need to make sure that the gene IDs you use are unique and that they
are the same type for both the universe, the geneIds and the IDs that are part
of your geneSetCollection.

```{r make parameter, message=FALSE}
library("GOstats")
universe = Lkeys(org.Hs.egGO)
genes = universe[1:500]
params <- GSEAGOHyperGParams(name="My Custom GSEA based annot Params", 
                             geneSetCollection=gsc, 
                             geneIds = genes, 
                             universeGeneIds = universe, 
                             ontology = "MF", 
                             pvalueCutoff = 0.05, 
                             conditional = FALSE, 
                             testDirection = "over")
```

And finally we can call `hyperGTest` in the same way we always have before.

```{r call HyperGTest}
Over <- hyperGTest(params)
head(summary(Over))
```

## Preparing KEGG to gene mappings

This is much the same as what you did with the GO mappings except for two
important simplifications. First of all you will no longer need to track
evidence codes, so the object you start with only needs to hold KEGG Ids and
gene IDs. Seconly, since KEGG is not a directed graph, there is no need for a
KEGG to All mapping. Once again I will borrow some data to use as an example.
Notice that we have to put the KEGG Ids in the left hand column of our initial
two column data.frame.

```{r KEGGFrame object}
frame = toTable(org.Hs.egPATH)
keggframeData = data.frame(frame$path_id, frame$gene_id)
head(keggframeData)
keggFrame=KEGGFrame(keggframeData,organism="Homo sapiens")
```

The rest of the process should be very similar.

```{r KEGG Parameters}
gsc <- GeneSetCollection(keggFrame, setType = KEGGCollection())
universe = Lkeys(org.Hs.egGO)
genes = universe[1:500]
kparams <- GSEAKEGGHyperGParams(name="My Custom GSEA based annot Params", 
                               geneSetCollection=gsc, 
                               geneIds = genes, 
                               universeGeneIds = universe,  
                               pvalueCutoff = 0.05, 
                               testDirection = "over")
kOver <- hyperGTest(params)
head(summary(kOver))
```

```{r info}
sessionInfo()
```