---
title: "Exporatory data analysis by querying the ToppGene Suite"
shorttitle: "toppgene"
author:
  - name: Pariksheet Nanda
    affiliation: University of Pittsburgh
    email: pan79@pitt.edu
package: toppgene
abstract: >  
  The ToppGene Suite is a one-stop portal for gene list enrichment analysis and
  candidate gene prioritization based on functional annotations and protein
  interactions network.  Although the ToppCluster web application provides
  convenient graphical access to the ToppGene Suite, the OpenAPI 3.0 compliant
  interface of ToppGene is better suited for automation and reproducibility.
  This package was initial generated from OpenAPI Generator and supplemented
  with Bioconductor class interfaces and more relevant biological examples.
bibliography: references.bib
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{toppgene}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r knitr-opts}
#| include = FALSE
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

<!-- markdownlint-disable MD025 -->

# Overview

The `r BiocStyle::Biocpkg("toppgene")` package
is a client for the ToppGene Suite webserver
that takes as input a gene list
to perform enrichment analysis.

To demonstrate the use of ToppGene,
below are the two test cases from the publication [@chen_improved_2007]
of congenital heart disease (CHD) and diabetic retinopathy (DR).

# Installation

To install this package, start R and enter:

```{r install-from-bioc, eval = FALSE}
if (! require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("toppgene")
```

# Usage

## Prepare the gene lists

A query may contain one or more genes.
ToppGene `enrich()` requires gene Entrez ID integers.
However, symbol conversion with ToppGene
is more permissive than Bioconductor,
therefore use ToppGene's `lookup()` function
to convert gene symbols to Entrez IDs.
The published example provides gene symbols for CHD (n = 28) and DR (n = 27)
that we will also use here.

```{r setup}
genes_chd_sym <- c(
    "ADD1", "CITED2", "DTNA", "CKM", "GATA4", "GJA1", "HAND1", "HAND2", "HEY2",
    "HOXC4", "HOXC5", "ITGB3", "JARID2", "MTHFD1", "MTHFR", "MTRR", "NKX2-5",
    "NOS3", "NPPA", "NPPB", "RFC1", "SALL4", "TBX1", "TBX5", "TBX20",
    "TGFB1", "ZFPM2", "ZIC3")
genes_dr_sym <- c(
    "ACE", "ADRB3", "AGT", "AGTR2", "AKR1B1", "APOE", "AR", "CMA1", "EDN1",
    "GNB3", "HFE", "HLA-DPB1", "HLA-DRB1", "ICAM1", "ITGA2B", "ITGB2", "LTA",
    "NOS2A", "NOS3", "NPY", "PECAM1", "PON1", "RAGE", "SELE", "SERPINE1",
    "TIMP3", "TNF")
```

## Convert gene symbol IDs to Entrez IDs

```{r toppgene-lookup}
library(toppgene)

genes_chd <- lookup(genes_chd_sym)
genes_chd
genes_dr <- lookup(genes_dr_sym)
genes_dr
```

## Run enrichment queries

```{r toppgene-enrich}
enrich_chd <- enrich(genes_chd$Entrez)
enrich_chd
enrich_dr <- enrich(genes_dr$Entrez)
enrich_dr
```

## View enrichment of publication top-ranked gene

```{r toppgene-compare}
library(IRanges) # CharacterList
library(DFplyr)  # (DataFrame support for various dplyr functions)
## Show all DataFrame rows of top_results().
orig <- options(showHeadLines = 20L)

top_results <- function(df) {
    df |>
        group_by(Category) |>
        slice(1) |>
        ungroup() |>
        ## Shorten GeneOntology to GO.
        mutate(Category = gsub(x = Category, "GeneOntology", "GO")) |>
        select(Category, ID, Name, GenesSymbol)
}
enrich_chd |>
    filter(any(GenesSymbol %in% CharacterList("HAND2"))) |>
    top_results()
enrich_dr |>
    filter(any(GenesSymbol %in% CharacterList("HLA-DPB1"))) |>
    top_results()

options(showHeadLines = orig)
```

## Convert drug database identifiers to PubChem CIDs

```{r toppgene-pubchem}
enrich_chd |>
    lookup_pubchem()
enrich_dr |>
    lookup_pubchem()
```

## Change default limits of enrichment queries

One can change the various cut-offs of a query using the
`CategoriesDataFrame()` to limit or expand the number of results.

```{r toppgene-modify-defaults}
## Default cut-offs.
cats <- CategoriesDataFrame()
cats

## Limit to 10 results for each category, and lower PValue for GeneOntology.
cats <-
    cats |>
    mutate(
        PValue = case_when(
            grepl("GeneOntology", rownames(cats)) ~ 1e-7,
            .default = PValue),
        MaxResults = 10L)
cats

enrich_chd_mod <-
    enrich(
        genes_chd$Entrez,
        cats)
enrich_chd_mod

## MaxResults limited to at most 10.
enrich_chd_mod |>
    count(Category)

## PValue limited to below 1e-7.
enrich_chd_mod |>
    arrange(desc(PValue)) |>
    filter(grepl(x = Category, "Onto")) |>
    group_by(Category) |>
    slice(1)
```

# Session Info {.unnumbered}

```{r session-info}
sessionInfo()
```

# References {.unnumbered}