%\VignetteIndexEntry{TCC}
%\VignettePackage{TCC}

\documentclass[oneside,12pt]{article}
\usepackage[a4paper,left=2.2cm,top=2.6cm,bottom=2.3cm,right=2.2cm]{geometry}
\usepackage{cite}
\usepackage{color}
\usepackage{Sweave}
%%\usepackage{here}
\SweaveOpts{pdf=TRUE}




\definecolor{TccBlue}{cmyk}{0.74,0.26,0,0.14}
\definecolor{TccRed}{cmyk}{0,1,0.78,0.1}
\definecolor{TccGreen}{cmyk}{0.99,0,0.29,0.47}
\definecolor{TccOrange}{cmyk}{0,0.27,1,0.03}

\renewcommand{\floatpagefraction}{0.9}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\Robject{#1}}}
\newcommand{\Rpackage}[1]{\textbf{#1}}
\newcommand{\Rclass}[1]{{\texttt{#1}}}

\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=n,formatcom=\color{TccRed}, fontsize=\footnotesize}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontshape=n,formatcom=\color{TccBlue}, fontsize=\footnotesize}
\fvset{baselinestretch=0.9}

\author {\small Jianqiang Sun$^{1\S}$, Tomoaki Nishiyama$^{2\S}$, Kentaro Shimizu$^1$, and Koji Kadota$^1$\\
\texttt{\footnotesize $^1$ The University of Tokyo, Tokyo, Japan}\\
\texttt{\footnotesize $^2$ Kanazawa University, Kanazawa, Japan}\\
\texttt{\footnotesize $^\S$ Maintainer: Jianqiang Sun (sun@biunit.dev),}\\
\texttt{\footnotesize Tomoaki Nishiyama (tomoakin@staff.kanazawa-u.ac.jp)}
}
\title{TCC: Differential expression analysis for tag count data
with robust normalization strategies}

\begin{document}

\maketitle\thispagestyle{empty}

\begin{abstract}
The R/Bioconductor package, \Rpackage{TCC}, provides users with a
robust and accurate framework to perform differential expression (DE)
analysis of tag count data. We developed a multi-step
normalization method  (TbT; Kadota et al., 2012) for
two-group RNA-seq data. The strategy (called DEGES) is to remove data
that are potential differentially expressed genes (DEGs) before
performing the data normalization. DEGES in \Rpackage{TCC} is essential
for accurate normalization of tag count data, especially when the up-
and down-regulated DEGs in one of the groups are extremely biased in
their number. A major characteristic of \Rpackage{TCC} is to provide
the DEGES-based normalization methods for several kinds of count data
(two-group, multi-group, and so on)
by virtue of the use of combinations of functions in other sophisticated
packages (especially \Rpackage{edgeR}).
The appropriate combination provided by \Rpackage{TCC}
allows a more robust and accurate estimation to be performed more easily
than directly using original packages and \Rpackage{TCC} provides a simple
unified interface to perform the robust normalization.
\end{abstract}

\newpage

\tableofcontents

\newpage

\section{Introduction}

Differential expression analysis based on tag count data has become a 
fundamental task for identifying differentially expressed genes or 
transcripts (DEGs). The \Rpackage{TCC} package (Tag Count Comparison; 
Sun et al., 2013) provides users with a robust and accurate 
framework to perform differential expression analysis of tag count data. 
\Rpackage{TCC} provides integrated analysis pipelines with improved 
data normalization steps, compared with other packages such as 
\Rpackage{edgeR}
by appropriately combining their functionalities. 
The package incorporates multi-step normalization methods whose strategy
is to remove data that are potential DEGs before performing 
the data normalization.

Kadota et al. (2012) recently reported that 
the normalization methods implemented in R packages 
(such as \Rpackage{edgeR} (Robinson et al., 2010), 
and \Rpackage{DESeq} (Anders and Huber, 2010) for
differential expression (DE) analysis between samples 
are inadequate when the up- and down-regulated DEGs in one of the samples 
are extremely biased in their number (i.e., biased DE). This is because the 
current methods implicitly assume a balanced DE, 
wherein the numbers of highly and lowly expressed DE entities in 
samples are (nearly) equal. As a result, 
methods assuming unbiased DE will not work well on data with biased DE. 
Although a major purpose of data normalization is to detect such DE entities, 
their existence themselves consequently interferes with their opportunity to 
be top-ranked. Conventional procedures for identifying DEGs from tag count 
data consisting of two steps (i.e., data normalization and identification of 
DEGs) cannot in principle eliminate the potential DE entities before 
data normalization.

To normalize data that potentially has various scenarios (including unbiased 
and biased DE), we recently proposed a multi-step normalization strategy 
(called TbT, an acronym for the TMM-baySeq-TMM pipeline; Kadota 
et al., 2012), in which the TMM normalization method 
(Robinson and Oshlack, 2010) is used in steps 1 and 3 
and an empirical Bayesian method implemented 
in the baySeq package 
(Hardcastle and Kelly, 2010) is used in step 2.
Although this multi-step DEG elimination strategy (called "DEGES" 
for short) can successfully remove potential DE entities identified in 
step 2 prior to the estimation of the normalization factors using the TMM 
normalization method in step 3, the baySeq package used in step 
2 of the TbT method is much more computationally intensive than competing 
packages like \Rpackage{edgeR} and \Rpackage{DESeq2}. While the three-step 
TbT normalization method performed best on simulated and real tag count data, 
it is practically possible to make different choices for the methods in 
each step. A more comprehensive study regarding better choices for DEGES 
is needed. 

This package provides tools to perform multi-step normalization methods 
based on DEGES and enables differential expression analysis of tag count 
data without having to worry much about biased distributions of DEGs. 
The DEGES-based normalization function implemented in \Rpackage{TCC} includes 
the TbT method based on DEGES for two-group data, 
much faster method, and methods for multi-group comparison. \Rpackage{TCC} 
provides a simple unified interface to perform data normalization with 
combinations of functions provided by \Rpackage{DESeq2}
and \Rpackage{edgeR}. Functions to produce simulation data under various 
conditions and to plot the data are also provided.

\subsection{Installation}
\label{section-1-1}

This package is available from the Bioconductor website
(http://bioconductor.org/). 
To install the package, enter the following command after starting R:


\begin{Schunk}
\begin{Sinput}
> if (!requireNamespace("BiocManager", quietly=TRUE))
    > install.packages("BiocManager")
> BiocManager::install("TCC")
\end{Sinput}
\end{Schunk}


\subsection{Citations}
\label{section-1-2}

This package internally uses many of the functions implemented in the other 
packages. This is because our normalization procedures consist, in part, of 
combinations of existing normalization methods and differential expression 
(DE) methods. 

For example, the TbT normalization method 
(Kadota et al., 2012), which is 
a particular functionality of the \Rpackage{TCC} package 
(Sun et al., 2013), 
consists of the TMM normalization method 
(Robinson and Oshlack, 2010) implemented 
in the \Rpackage{edgeR} package 
(Robinson et al., 2010) and the empirical Bayesian 
method implemented in the \Rpackage{baySeq} package 
(Hardcastle and Kelly, 2010). 
Therefore, please cite the appropriate references 
when you publish your results.

\begin{Schunk}
\begin{Sinput}
> citation("TCC")
\end{Sinput}
\end{Schunk}


\subsection{Quick start}
\label{section-1-3}

Let us begin by showing a example of identifying DEGs 
between two groups from tag count data consisting of $1,000$ 
genes and a total of six samples 
(each group has three biological replicates). The hypothetical 
count data (termed "\Robject{hypoData}") is stored in this package 
(for details, see section \ref{section-2-1}).

DE analysis for two-group count data with replicates by using 
the F-test (see \Rfunction{glmQLFTest} in \Rpackage{edgeR} package) coupled with
iterative DEGES/edgeR normalization (i.e., the iDEGES/edgeR-edgeR 
combination). This is an alternative pipeline designed to reduce 
the runtime (approx. $20$ sec.), yet its performance is comparable to 
the above pipeline. Accordingly, we recommend using this pipeline as 
a default when analyzing tag count data with replicates. 
A notable advantage of this pipeline is that the multi-step 
normalization strategy only needs the methods 
implemented in the \Rpackage{edgeR} package. The suggested citations are as 
follows: \Rpackage{TCC} (Sun et al., 2013), 
TMM (Robinson and Oshlack, 2010), 
and \Rpackage{edgeR} (Robinson et al., 2010). 
For details, see section \ref{section-3-1-3} and \ref{section-4-1-1}.

<<3408021901>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
@

\newpage

\section{Preparations}

\subsection{Preparing the count data}
\label{section-2-1}

Similar to the other packages, \Rpackage{TCC} typically starts the DE analysis
with a count table matrix where each row indicates a gene (or transcript),
each column indicates a sample (or library),
and each cell indicates the number of counts for a gene in a sample.
There are many ways to obtain the count data from aligned read files
such as BAM (Binary Alignment/Map).
This includes the \Rfunction{islandCounts} function in
\Rpackage{htSeqTools} (Planet et al., 2012),
\Rfunction{summarizeOverlaps} in \Rpackage{GenomicRanges}
(Lawrence et al., 2013), \Rfunction{qCount} in \Rpackage{QuasR}, and so on.
For Windows end users, we recommend to use the \Rpackage{QuasR} package.
It provides a comprehensive workflow for many kinds of NGS data including
ChIP-seq, RNA-seq, smallRNA-seq, and BS-seq.
The main functionalities are: (1) preprocessing raw sequence reads,
(2) aligning reads to the reference genome or transcriptome using
\Rpackage{Rbowtie}, and (3) producing count matrix from aligned reads.
\Rpackage{TCC} uses the raw count data (a matrix of integer values) as input.
As also clearly mentioned in \Rpackage{DESeq2},
the raw count data corresponds to a matrix of integer values.
Please DO NOT use any normalized counts such as RPM (reads per million),
CPM (counts per million), and RPKM. 


\subsection{Reading the count data}
\label{ss_reading_the_count_data}

Here, we assume a hypothetical count matrix consisting of
1,000 rows (or genes) and a total of six columns:
the first three columns are produced from biological replicates
of Group 1 (G1\_rep1, G1\_rep2, and G1\_rep3) and the remaining
columns are from Group 2 (G2\_rep1, G2\_rep2, and G2\_rep3).
We start by loading the hypothetical data (\Robject{hypoData})
from \Rpackage{TCC} and giving a numeric vector (\Robject{group})
indicating which group each sample belongs to.


<<4378919285>>=
library(TCC)
data(hypoData)
head(hypoData)
dim(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
@

If you want to analyze another count matrix consisting of nine columns 
(e.g., the first four columns are produced from biological replicates 
of G1, and the remaining five columns are from G2), the \Robject{group} 
vector should be indicated as follows.

<<1209835923>>=
group <- c(1, 1, 1, 1, 2, 2, 2, 2, 2)
@


\subsection{Constructing TCC class object}
\label{section-2-2}

The \Rfunction{new} function has to be used to perform the main 
functionalities of \Rpackage{TCC}. 
This function constructs a \Rpackage{TCC} class object, 
and subsequent analyses are performed on this class object. The object 
is constructed from i) a count matrix (\Robject{hypoData}) and ii) the 
corresponding numeric vector (\Robject{group}) as follows.

<<4389570100>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc
@

The count matrix and group vector information can be retrieved from 
the stored class object by using \Robject{tcc\$count} 
and \Robject{tcc\$group}, respectively.

<<5048219812>>=
head(tcc$count)
tcc$group
@


\subsection{Filtering low-count genes (optional)}
\label{section-2-3}

The way to filter out genes with low-count tags across samples depends 
on the user's philosophy. Although we recommend removing tags with zero 
counts across samples as a minimum filtering, this effort is optional. 
The \Rfunction{filterLowCountGenes} function performs this filtering.

<<4085249378>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- filterLowCountGenes(tcc)
dim(tcc$count)
@

It can be seen that $\Sexpr{nrow(hypoData) - nrow(tcc$count)} (= \Sexpr{nrow(hypoData)} -\Sexpr{ nrow(tcc$count)})$ 
genes were filtered as non-expressed. 
The same procedure can be performed without 
the \Rfunction{filterLowCountGenes} function,
in which case the filtering is performed before
the \Rpackage{TCC} class object is constructed.

<<filter_low_count_genes>>=
filter <- as.logical(rowSums(hypoData) > 0)
dim(hypoData[filter, ])
tcc <- new("TCC", hypoData[filter, ], group)
dim(tcc$count)
@

\newpage

\section{Normalization}
\label{s_normalization}

This chapter describes the details of our robust normalization strategy
(i.e., DEGES) implemented in \Rpackage{TCC}. Alternative DEGES procedures
consisting of functions in other packages (\Rpackage{edgeR} and \Rpackage{DESeq2})
are also provided,
enabling {\it advanced} users familiar with the existing packages can easily
learn what is done in \Rpackage{TCC}.
Note that {\it end} users, who just want to perform robust
differential expression analysis using \Rpackage{TCC},
can skip this chapter (\ref{s_normalization} Normalization) and
move from here to, for example, the next chapter
(\ref{s_differential_expression} Differential expression).
Note also that the purpose here is to obtain
accurate normalization factors to be used with
statistical models (e.g., the exact test or
empirical Bayes) for the DE analysis described in the next
section (\ref{s_differential_expression} Differential expression).
\Rpackage{TCC} can manipulate various kinds of experimental designs.
Followings are some examples for individual designs.
Appropriate sections should be referred for your specific experimental designs.

\begin{table}[h]
  \begin{center}
  \caption{\ref{ss_norm_twogroup_reps} unpaired samples (two-group)}
    \begin{tabular}{|ccc|ccc|}
      \hline
      Label & Example 1 & Example 2 & Label & Example 3 & Example 4 \\ \hline
      G1\_rep1 & G1(mouse\_A) & Tumor(patient\_A) & G1\_rep1 & G1(mouse\_A) & Tumor(patient\_A) \\
      G1\_rep2 & G1(mouse\_B) & Tumor(patient\_B) & G1\_rep2 & G1(mouse\_B) & Tumor(patient\_B) \\
      G1\_rep3 & G1(mouse\_C) & Tumor(patient\_C) & G2\_rep1 & G2(mouse\_D) & Normal(patient\_G) \\
      G2\_rep1 & G2(mouse\_D) & Normal(patient\_G) & G2\_rep2 & G2(mouse\_E) & Normal(patient\_H) \\
      G2\_rep2 & G2(mouse\_E) & Normal(patient\_H) &  &  &  \\
      G2\_rep3 & G2(mouse\_F) & Normal(patient\_K) &  &  &  \\ \hline
    \end{tabular}
  \end{center}
\end{table}

\begin{table}[h]
  \begin{center}
  \caption{\ref{ss_norm_twogroup_without_reps} paired samples (two-group)}
    \begin{tabular}{|ccc|ccc|}
      \hline
      Label & Example 1 & Example 2 & Label & Example 3 & Example 4 \\ \hline
      G1\_rep1 & G1(mouse\_A) & Tumor(patient\_G) & G1\_rep1 & G1(mouse\_B) & Tumor(patient\_J) \\
      G1\_rep2 & G1(mouse\_B) & Tumor(patient\_H) & G1\_rep2 & G1(mouse\_C) & Tumor(patient\_K) \\
      G1\_rep3 & G1(mouse\_C) & Tumor(patient\_K) & G2\_rep1 & G2(mouse\_B) & Normal(patient\_J) \\
      G2\_rep1 & G2(mouse\_A) & Normal(patient\_G) & G2\_rep2 & G2(mouse\_C) & Normal(patient\_K) \\
      G2\_rep2 & G2(mouse\_B) & Normal(patient\_H) &  &  &  \\
      G2\_rep3 & G2(mouse\_C) & Normal(patient\_K) &  &  &  \\ \hline
    \end{tabular}
  \end{center}
\end{table}

\newpage

\begin{table}[h]
  \begin{center}
  \caption{\ref{ss_norm_multifactor_reps} unpaired samples (multi-group)}
    \begin{tabular}{|ccc|ccc|}
      \hline
      Label & Example 1 & Example 2 & Label & Example 3 & Example 4 \\ \hline
      G1\_rep1 & G1(mouse\_A) & Kidney(sample\_A) & G1\_rep1 & G1(mouse\_A) & Liver(sample\_G) \\
      G1\_rep2 & G1(mouse\_B) & Kidney(sample\_B) & G1\_rep2 & G1(mouse\_B) & Liver(sample\_H) \\
      G1\_rep3 & G1(mouse\_C) & Kidney(sample\_C) & G2\_rep1 & G2(mouse\_D) & Brain(sample\_Y) \\
      G2\_rep1 & G2(mouse\_D) & Liver(sample\_G)  & G2\_rep2 & G2(mouse\_E) & Brain(sample\_Z) \\
      G2\_rep2 & G2(mouse\_E) & Liver(sample\_H)  & G3\_rep1 & G3(mouse\_U) & Kidney(sample\_B) \\
      G2\_rep3 & G2(mouse\_F) & Liver(sample\_K)  & G3\_rep2 & G3(mouse\_T) & Kidney(sample\_C) \\
      G3\_rep1 & G3(mouse\_G) & Brain(sample\_X)  &  &  &  \\
      G3\_rep2 & G3(mouse\_H) & Brain(sample\_Y)  &  &  &  \\
      G3\_rep3 & G3(mouse\_I) & Brain(sample\_Z)  &  &  &  \\ \hline
    \end{tabular}
  \end{center}
\end{table}




\subsection{Normalization of two-group count data with replicates}
\label{ss_norm_twogroup_reps}

This package provides robust normalization methods based on DEGES 
proposed by Kadota et al. (2012). When obtaining normalization 
factors from two-group data with replicates, users can select 
a total of six combinations (two normalization methods $\times$ 
three DEG identification methods) coupled with an arbitrary number 
of iterations ($n = 0, 1, 2, \dots, 100$) in our DEGES-based 
normalization pipeline. We show some of the practical combinations 
below.

Since the three-step TbT normalization method was originally designed 
for normalizing tag count data with (biological) replicates,
we present three shorter alternatives (\ref{section-3-1-2} DEGES/edgeR, 
\ref{section-3-1-3} iDEGES/edgeR, and \ref{section-3-1-4} 
DEGES/DESeq2).



\subsubsection{DEGES/edgeR}
\label{section-3-1-2}

Now let us describe an alternative approach that is roughly $200$-$400$ times 
faster than DEGES/TbT, yet has comparable performance. The 
TMM-edgeR-TMM pipeline (called DEGES/edgeR) 
employs the exact test implemented in \Rpackage{edgeR} in step 2. 
To use this pipeline, we have to provide a reasonable threshold for 
defining potential DEGs in step 2. We will define the threshold as 
an arbitrary false discovery rate (FDR) with 
a floor value of $P_{\mbox{\tiny DEG}}$. 
The default FDR is $< 0.1$, and the default floor $P_{\mbox{\tiny DEG}}$ is 
$5\%$, but different choices are of course possible. For example, 
in case of the default settings, $x\% (x > 5\%)$ of the top-ranked 
potential DEGs are eliminated in step 2 if the percentage ($= x\%$) of 
genes satisfying FDR $< 0.1$ is over $5\%$. The DEGES/edgeR 
pipeline has an apparent advantage over TbT in computation time. 
It can be performed as follows:

<<degesEdger_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc$DEGES$execution.time
@

The normalization factors calculated from the DEGES/edgeR 
are very similar to those of DEGES/TbT with the default parameter 
settings (i.e., \Robject{samplesize = 10000}). For \Rpackage{edgeR} 
users, we provide commands, consisting of functions in \Rpackage{edgeR}, 
to perform the DEGES/edgeR pipeline without \Rpackage{TCC}. 
The \Rfunction{calcNormFactors} function together 
with the above parameter settings 
can be regarded as a wrapper function for the following commands.

<<degesEdger_edger>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
FDR <- 0.1
floorPDEG <- 0.05
d <- DGEList(counts = hypoData, group = group)
### STEP 1 ###
d <- calcNormFactors(d)
### STEP 2 ###
design <- model.matrix(~group)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
out <- glmQLFTest(fit, coef = 2)
pval <- out$table$PValue
qval <- p.adjust(pval, method = "BH")
if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(qval < FDR)
} else {
  is.DEG <- as.logical(rank(pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
### STEP 3 ###
d <- DGEList(counts = hypoData[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) /
                  colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsubsection{iDEGES/edgeR}
\label{section-3-1-3}

Our multi-step normalization can be repeated until the calculated 
normalization factors converge (Kadota et al., 2012). 
An iterative version of DEGES/TbT (i.e., iDEGES/TbT) can be described 
as the TMM-(baySeq-TMM)$_{n}$ pipeline with $n \ge 2$. 
Although the iDEGES/TbT would not be practical in terms of 
the computation time, the TMM-(edgeR-TMM)$_{n}$ pipeline 
(iDEGES/edgeR) is potentially superior to both the 
DEGES/edgeR and the DEGES/TbT. A suggested 
iDEGES/edgeR implementation ($n = 3$) 
consists of seven steps, as follows:

<<idegesEdger_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc$DEGES$execution.time
@


\subsubsection{DEGES/DESeq2}
\label{section-3-1-4}

The DEGES pipeline can also be performed by using only the
functions in the \Rpackage{DESeq2} package.
Similar to the \Rpackage{DESeq} case above,
this DESeq2-DESeq2-DESeq2 pipeline (DEGES/DESeq2) changes
the corresponding arguments of the
\Robject{norm.method} and \Robject{test.method} as follows:

<<4390811048>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "deseq2", test.method = "deseq2",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc$DEGES$execution.time
@

For \Rpackage{DESeq2} users, we also provide commands,
consisting of functions in \Rpackage{DESeq2},
to perform the DEGES/DESeq2 pipeline without \Rpackage{TCC}.
The \Rfunction{calcNormFactors} function together with the above
arguments can be regarded as a wrapper function for the following commands.

<<0869034832>>=
library(TCC)
data(hypoData)
FDR <- 0.1
floorPDEG <- 0.05
group <- factor(c(1, 1, 1, 2, 2, 2))
cl <- data.frame(group = group)
design <- formula(~ group)
dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl,
                              design = design)
### STEP 1 ###
dds <- estimateSizeFactors(dds)
sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds))

### STEP 2 ###
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
result <- results(dds)
result$pvalue[is.na(result$pvalue)] <- 1
pval <- result$pvalue
qval <- p.adjust(pval, method = "BH")
if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(qval < FDR)
} else {
  is.DEG <- as.logical(rank(pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
### STEP 3 ###
dds <- DESeqDataSetFromMatrix(countData = hypoData[!is.DEG, ], colData = cl,
                              design = design)
dds <- estimateSizeFactors(dds)
norm.factors <- sizeFactors(dds) / colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@




\subsection{Normalization of two-group count data without replicates (paired)}
\label{ss_norm_twogroup_without_reps}


\Rpackage{edgeR} and \Rpackage{DESeq2} employs generalized
linear models (GLMs) which can apply to detect DEGs from paired two-group count data.
When obtaining normalization factors from paired two group samples,
users can select a total of four combinations
(two normalization methods $\times$ two DEG identification methods) coupled with
an arbitrary number of iterations ($n = 0, 1, 2, \cdots, 100$) in our DEGES-based
normalization pipeline.
The analysis for paired samples can be performed by indicating
(1) the pair information when constructing the \Robject{TCC} class object and
(2) \Robject{paired = TRUE} when performing the \Robject{calcNormFactors} function.
For analyzing paired data, we here use the hypothetical count data
(\Robject{hypoData}; for details see \ref{ss_reading_the_count_data}) by changing
the label information, i.e.,

<<9347533928>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
head(hypoData)
@

This count data consists of three individuals (or sibships),
dogA, dogB, and dogC. "T" and "N" indicate tumor and normal tissues, respectively.



\subsubsection{DEGES/edgeR}
\label{norm_twogroup_without_reps_degesedger}

The DEGES/edgeR pipeline for two-group paired data can be performed as follows:

<<2394782901>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
group <- c(1, 1, 1, 2, 2, 2)
pair <- c(1, 2, 3, 1, 2, 3)
cl <- data.frame(group = group, pair = pair)
tcc <- new("TCC", hypoData, cl)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05, paired = TRUE)
tcc$norm.factors
@

For \Rpackage{edgeR} users, we provide commands,
consisting of functions in \Rpackage{edgeR},
to perform the DEGES/edgeR pipeline without \Rpackage{TCC}.
The \Rfunction{calcNormFactors} function together with the
above parameter settings can be regarded as a wrapper function
for the following commands.


<<8939487592>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
group <- factor(c(1, 1, 1, 2, 2, 2))
pair <- factor(c(1, 2, 3, 1, 2, 3))
design <- model.matrix(~ group + pair)
coef <- 2
FDR <- 0.1
floorPDEG <- 0.05
d <- DGEList(counts = hypoData, group = group)
### STEP 1 ###
d <- calcNormFactors(d)
### STEP 2 ###
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
out <- glmQLFTest(fit, coef = coef)
pval <- out$table$PValue
qval <- p.adjust(pval, method = "BH")
if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))){
  is.DEG <- as.logical(qval < FDR)
} else {
  is.DEG <- as.logical(rank(pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
### STEP 3 ###
d <- DGEList(counts = hypoData[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) /
colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@



\subsection{Normalization of multi-group count data with replicates}
\label{ss_norm_multifactor_reps}

Many R packages (including \Rpackage{edgeR})
support DE analysis for multi-group tag count data. 
\Rpackage{TCC} provides some prototypes of DEGES-based pipelines for 
such data. Here, we analyze another hypothetical three-group count matrix, 
the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. 
It consists of $1,000$ genes and a total of nine columns for 
testing any difference among three groups that each have triplicates.

<<load_hypoData_mg>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
tcc
dim(tcc$count)
@

Of the $1,000$ genes, the first $200$ genes are DEGs and the remaining $800$
genes are non-DEGs. The breakdowns for the $200$ DEGs are as follows: 
$140$, $40$, and $20$ DEGs are up-regulated in Groups 1, 2, and 3. 
Below, we show a DEGES-based normalization pipeline for this 
multi-group data here.


\subsubsection{DEGES/edgeR}
\label{section-3-3-2}

\Rpackage{edgeR} employs generalized linear models (GLMs) to find DEGs 
between any of the groups. The DEGES/edgeR normalization 
pipeline in \Rpackage{TCC} internally uses functions for the GLM approach 
that require two models (a full model and a null model). The full model 
corresponds to a design matrix to describe sample groups. The null model 
corresponds to the model coefficients. 
The two models can be defined as follows:

<<degesEdger_multigroup_tcc>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
design <- model.matrix(~ as.factor(group))
coef <- 2:length(unique(group))
@

The design matrix (\Robject{design}) can be constructed by using the
\Robject{model.matrix} function. For the model coefficients (\Robject{coef}),
the user should specify all the coefficients except for the intercept term.
The DEGES/edgeR pipeline with the two models (\Robject{design} and \Robject{coef})
can be performed as follows.

<<degesEdger_multigroup_tcc_nf>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1, design = design, coef = coef)
tcc$norm.factors
@

The two models (\Robject{design} and \Robject{coef}) will automatically
be generated when performing the following \Rfunction{calcNormFactors} function
if those models are not explicitly indicated. That is

<<3498028981>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
tcc$norm.factors
@

For \Rpackage{edgeR} users, we provide commands, consisting of functions in 
\Rpackage{edgeR}, to perform the DEGES/edgeR pipeline without 
\Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the 
above parameter settings can be regarded as a wrapper function for the 
following commands.

<<degesEdger_multigroup_edger>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
FDR <- 0.1
floorPDEG <- 0.05
design <- model.matrix(~ as.factor(group))
coef <- 2:ncol(design)
d <- DGEList(counts = hypoData_mg, group = group)
### STEP 1 ###
d <- calcNormFactors(d)
### STEP 2 ###
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
out <- glmQLFTest(fit, coef = coef)
result <- as.data.frame(topTags(out, n = nrow(hypoData_mg)))
result <- result$table[rownames(hypoData_mg), ]
pval <- out$table$PValue
qval <- p.adjust(pval, method = "BH")
if (sum(qval < FDR) > (floorPDEG * nrow(hypoData_mg))) {
  is.DEG <- as.logical(qval < FDR)
} else  {
  is.DEG <- as.logical(rank(pval, ties.method = "min") <=
                       nrow(hypoData_mg) * floorPDEG)
}
### STEP 3 ###
d <- DGEList(counts = hypoData_mg[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData_mg[!is.DEG, ]) /
                colSums(hypoData_mg)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsection{Retrieving normalized data}
\label{s_retr_norm_data}

Similar functions for calculating normalization factors are the 
\Rfunction{calcNormFactors} function in \Rpackage{edgeR} and the 
\Rfunction{estimateSizeFactors} function in \Rpackage{DESeq2}. Note that the 
terminology used in \Rpackage{DESeq2} (i.e., size factors) is 
different from that used in \Rpackage{edgeR} (i.e., effective 
library sizes) and ours. The effective library size in \Rpackage{edgeR} 
is calculated as the library size multiplied by the normalization factor. 
The size factors in both \Rpackage{DESeq2} package are comparable to 
the {\it normalized} effective library sizes wherein the summary statistics 
for the effective library sizes are adjusted to one. Our normalization 
factors, which can be obtained from \Robject{tcc\$norm.factors}, 
have the same names as those in \Rpackage{edgeR}. Accordingly, the 
normalization factors calculated from \Rpackage{TCC} with arbitrary 
options should be manipulated together with the library sizes when 
normalized read counts are to be obtained. Since biologists are often 
interested in such information (Dillies et al., 2012), 
we provide the \Rfunction{getNormalizedData} function 
for retrieving normalized data. 

Note that the \Robject{hypoData} consists of $1,000$ genes and a total 
of six samples (three biological replicates for G1 and three biological 
replicates for G2); i.e., \{G1\_rep1, G1\_rep2, G1\_rep3\} vs. 
\{G2\_rep1, G2\_rep2, G2\_rep3\}. These simulation data have basically 
the same conditions as shown in Fig. 1 of the TbT paper (Kadota et 
al., 2012); i.e., (i) the first $200$ genes are 
DEGs ($P_{\mbox{\tiny DEG}} = 200/1000 = 20\%$), 
(ii) the first $180$ genes of the $200$ DEGs are higher in G1 
($P_{\mbox{\tiny G1}} = 180/200 = 90\%$), 
and the remaining $20$ DEGs are higher in G2, and (iii) the level of 
DE is four-fold. The last $800$ genes were designed to be non-DEGs. 
The different normalization strategies can roughly be evaluated in terms 
of the similarity of their summary statistics for {\it normalized} data 
labeled as non-DEGs in one group (e.g., G1) to those of the other group 
(e.g., G2). The basic statistics for the non-DEGs are as follows.

<<hypoData_deg_label>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
summary(hypoData[nonDEG, ])
@

From now on, we will display only the median values for simplicity, i.e.,

<<calc_median_of_nondeg_hypoData>>=
apply(hypoData[nonDEG, ], 2, median)
@
<<calc_median_of_nondeg_hypoData_hide, echo = FALSE>>=
hypoData.median <- apply(hypoData[nonDEG, ], 2, median)
hypoData.14.median <- apply(hypoData[nonDEG, c(1, 4)], 2, median)
@

In what follows, we show detailed examples using \Robject{hypoData}. 
Note, however, that the basic usage is simple.

<<get_normalized_data>>=
normalized.count <- getNormalizedData(tcc)
@


\subsubsection{Retrieving two-group DEGES/edgeR-normalized data with replicates}
\label{section-3-4-1}


The \Rfunction{getNormalizedData} function can be applied to
the \Robject{TCC} class object after the normalization
factors have been calculated.
The DEGES/edgeR-normalized data can be retrieved as follows.


<<degesEdger_tcc_normed_data>>=
library(TCC)
data(hypoData)  
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@

<<4390011292,echo = false, result = hide>>=
buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median))
@

For comparison, the summary statistics for TMM-normalized
data produced using the original normalization method
(i.e., TMM) in \Rpackage{edgeR} can be obtained as follows.

<<2391104042>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
d <- DGEList(count = hypoData, group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors
effective.libsizes <- colSums(hypoData) * norm.factors
normalized.count <- sweep(hypoData, 2,
                    mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@

<<4330011292,echo = false, result = hide>>=
buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median))
@

It is obvious that the summary statistics
(ranging from 
$\Sexpr{sprintf("%.5f", min(buff_1))}$
to
$\Sexpr{sprintf("%.5f", max(buff_1))}$
) from DEGES/edgeR-normalized data are closer to the truth
(i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to 
$\Sexpr{sprintf("%.1f",max(hypoData.median))}$)
than those (ranging from 
$\Sexpr{sprintf("%.5f", min(buff_2))}$
to
$\Sexpr{sprintf("%.5f", max(buff_2))}$
from TMM-normalized data.




\subsubsection{Retrieving two-group DEGES/edgeR-normalized data without replicates (paired)}
\label{retriev_twogroup_paired_withoutreps}

We here analyze the \Robject{hypoData} object provided in \Rpackage{TCC}.
As described in \ref{ss_norm_twogroup_without_reps},
we regard \Robject{hypoData} as a hypothetical paired data.
The DEGES/edgeR-normalized data can be retrieved as follows. 

<<5647309237>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
pair <- c(1, 2, 3, 1, 2, 3)
cl <- data.frame(group = group, pair = pair)
tcc <- new("TCC", hypoData, cl)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05, paired = TRUE)
normalized.count <- getNormalizedData(tcc)
head(normalized.count, n = 4)
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@
<<1674110292,echo = false, result = hide>>=
buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median))
@

For comparison, the summary statistics for TMM-normalized data produced
using the original normalization method (i.e., TMM) in
\Rpackage{edgeR} can be obtained as follows.

<<4387803948>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
nonDEG <- 201:1000
group <- factor(c(1, 1, 1, 2, 2, 2))
d <- DGEList(counts = hypoData, group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors
effective.libsizes <- colSums(hypoData) * norm.factors
normalized.count <- sweep(hypoData, 2,
                          mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@

<<1674110992,echo = false, result = hide>>=
buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median))
@

It is obvious that the summary statistics
(ranging from 
$\Sexpr{sprintf("%.5f", min(buff_1))}$
to
$\Sexpr{sprintf("%.5f", max(buff_1))}$
) from DEGES/edgeR-normalized data are closer to the truth
(i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to 
$\Sexpr{sprintf("%.1f",max(hypoData.median))}$)
than those (ranging from 
$\Sexpr{sprintf("%.5f", min(buff_2))}$
to
$\Sexpr{sprintf("%.5f", max(buff_2))}$)
from TMM-normalized data.



\subsubsection{Retrieving multi-group iDEGES/edgeR-normalized data with replicates}
\label{section-3-4-4}

Here, we analyze another hypothetical three-group count matrix, 
the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. 
It consists of $1,000$ genes and a total of nine columns for testing 
any difference among three groups that each have triplicates. 
Similar to the \Robject{hypoData} object, the first $200$ genes 
are DEGs and the remaining $800$ genes are non-DEGs. 
The basic statistics for the non-DEGs are as follows.

<<hypoData_mg_deg_label>>=
library(TCC)
data(hypoData_mg)
nonDEG <- 201:1000
summary(hypoData_mg[nonDEG, ])
@

From now on, we will display only the median values for simplicity, i.e.,

<<calc_median_of_nondeg_hypoData_mg>>=
apply(hypoData_mg[nonDEG, ], 2, median)
@
<<NormHypoDataMultiGGet, echo = FALSE>>=
hypoData_mg.median <- apply(hypoData_mg[nonDEG, ], 2, median)
@

The iDEGES/edgeR-normalized data can be retrieved as follows.

<<degesEdger_tcc_multigroup_normed_data>>=
library(TCC)
data(hypoData_mg)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
design <- model.matrix(~ as.factor(group))
coef <- 2:length(unique(group))
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 3)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@
<<degesEdger_tcc_multigroup_normed_data_hdie, echo = FALSE>>=
normByiDEGES <- range(apply(normalized.count[nonDEG, ], 2, median))
@

For comparison, the summary statistics for TMM-normalized data 
produced using the original normalization method (i.e., TMM) 
in \Rpackage{edgeR} are obtained as follows.

<<tmm_tcc_multigroup_normed_data>>=
library(TCC)
data(hypoData_mg)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
d <- DGEList(tcc$count, group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors
effective.libsizes <- colSums(hypoData) * norm.factors
normalized.count <- sweep(hypoData, 2,
                    mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@
<<tmm_tcc_multigroup_normed_data_hide, echo = FALSE>>=
normByTMM <- range(apply(normalized.count[nonDEG, ], 2, median))
@

It is obvious that the summary statistics 
(ranging from $\Sexpr{sprintf("%.5f", min(normByiDEGES))}$ 
to $\Sexpr{sprintf("%.5f", max(normByiDEGES))}$) 
from iDEGES/edgeR-normalized data are closer to the truth 
(i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData_mg.median))}$ to 
$\Sexpr{sprintf("%.1f", max(hypoData_mg.median))}$) than those 
(ranging from $\Sexpr{sprintf("%.5f", min(normByTMM))}$ 
to $\Sexpr{sprintf("%.5f", max(normByTMM))}$) 
from TMM-normalized data.

\newpage

\section{Differential expression (DE)}
\label{s_differential_expression}

The particular feature of \Rpackage{TCC} is that it calculates 
robust normalization factors. Moreover, end users would like to 
have some accessory functions for subsequent analyses. Here, 
we provide the \Rfunction{estimateDE} function for identifying DEGs. 
Specifically, the function internally uses the corresponding functions 
implemented in three packages.
Similar to the usage in the \Rfunction{calcNormFactors} function 
with the \Rfunction{test.method} argument in \Rpackage{TCC}, 
those DE methods in \Rpackage{edgeR} can be performed 
by using the \Rfunction{estimateDE} function with 
\Rfunction{test.method = "edger"}, and \Rfunction{"deseq2"}, respectively.
Here, we show some  examples of DE analysis for two-group data with replicates 
(\ref{section-4-1}), two-group data without replicates (\ref{section-4-2}), 
paired two-group data without replicates (\ref{ss_de_2gnr_p}), 
and multi-group data with replicates (\ref{section-4-3}).

\subsection{DE analysis for two-group data with replicates}
\label{section-4-1}

\subsubsection{edgeR coupled with iDEGES/edgeR normalization}
\label{section-4-1-1}

We give a procedure for DE analysis using the exact test implemented 
in \Rpackage{edgeR} together with iDEGES/edgeR normalization 
factors (i.e., the iDEGES/edgeR-edgeR combination) 
for the hypothetical two-group count data with replicates (i.e., 
the \Robject{hypoData} object). If the user wants to determine the 
genes having an FDR threshold of $< 10\%$ as DEGs, one can do as follows.

<<idegesEdger_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
@

The results of the DE analysis are stored in the \Rpackage{TCC} 
class object. The summary statistics for top-ranked genes can 
be retrieved by using the \Rfunction{getResult} function.

<<get_estimated_results>>=
result <- getResult(tcc, sort = TRUE)
head(result)
@

The DE results can be broken down as follows.

<<summary_of_estimated_deg>>=
table(tcc$estimatedDEG) 
@

This means $\Sexpr{sum(tcc$estimatedDEG == FALSE)}$ non-DEGs 
and $\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs satisfy FDR $< 0.1$.
The \Rfunction{plot} function generates an M-A plot, 
where "M" indicates the log-ratio 
(i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$) and "A" 
indicates average read count 
(i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), 
from the normalized count data. 
The magenta points indicate the identified DEGs at FDR $< 0.1$.

\begin{center}
<<maplot_of_estimated_result, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc)
@
\end{center}

Since we know the true information for \Robject{hypoData}
(i.e., 200 DEGs and 800 non-DEGs),
we can calculate the area under the ROC curve
(i.e., AUC; $0 \le$ AUC $\le 1$)
between the ranked gene list and the truth and thereby evaluate
the sensitivity and specificity simultaneously.
A well-ranked gene list should have a high AUC value
(i.e., high sensitivity and specificity).

<<3498757592>>=
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
@

For comparison, the DE results from the original
procedure in \Rpackage{edgeR} can be obtained as follows.

<<5901287562>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
design <- model.matrix(~ as.factor(group))
d <- DGEList(count = hypoData, group = group)
d <- calcNormFactors(d)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
out <- glmQLFTest(fit, coef = 2)
result <- as.data.frame(topTags(out, n = nrow(hypoData), sort.by = "PValue"))
head(result)
@

This is the same as

<<1027518347>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
@


The results of the DE analysis are stored in the \Robject{TCC} class object.
The summary statistics for top-ranked genes can be retrieved by using the
\Rfunction{getResult} function.

<<5029042481>>=
result <- getResult(tcc, sort = TRUE)
head(result)
@

The DE results can be broken down as follows.

<<7512913498>>=
table(tcc$estimatedDEG)
@
<<7512013498, echo = FALSE>>=
tb <- table(tcc$estimatedDEG)
@

This means $\Sexpr{tb[["0"]]}$ non-DEGs and $\Sexpr{tb[["1"]]}$ DEGs satisfy FDR $< 0.1$.
The \Rfunction{plot} function generates an M-A plot, where "M" indicates the
log-ratio 
(i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$)
and "A" indicates average read count
(i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), 
from the normalized count data.
The magenta points indicate the identified DEGs at FDR $< 0.1$.

\begin{center}
<<3489400103, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc)
@
\end{center}

Since we know the true information for \Robject{hypoData}
(i.e., 200 DEGs and 800 non-DEGs),
we can calculate the area under the ROC curve
(i.e., AUC; $0 \le$ AUC $\le 1$) between the ranked gene list and the
truth and thereby evaluate the sensitivity and specificity simultaneously.
A well-ranked gene list should have a high AUC value
(i.e., high sensitivity and specificity).

<<4012399910>>=
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
@



\subsubsection{DESeq2 coupled with iDEGES/DESeq2 normalization}
\label{section-4-1-2}

For comparison, the DE results from the original procedure in
\Rpackage{DESeq2} can be obtained as follows.

<<4930011190>>=
library(TCC)
data(hypoData)
group <- factor(c(1, 1, 1, 2, 2, 2))
cl <- data.frame(group = group)
design <- formula(~ group)
dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl,
                              design = design)
dds <- estimateSizeFactors(dds)
sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds))
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
result <- results(dds)
head(result[order(result$pvalue),])
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
result$pvalue[is.na(result$pvalue)] <- 1
AUC(rocdemo.sca(truth = truth, data = -rank(result$pvalue)))
@

This is the same as

<<4390023901>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "deseq2", iteration = 0)
tcc <- estimateDE(tcc, test.method = "deseq2", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
tcc$norm.factors
@
<<4090911011, echo = FALSE>>=
buff_1 <- AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
@

As described in \ref{s_retr_norm_data},
the size factors termed in \Rpackage{DESeq2}
is comparable to the {\it normalized} effective library sizes termed
in \Rpackage{TCC} and \Rpackage{edgeR}.
The effective library size in \Rpackage{edgeR} is calculated as the
library size multiplied by the normalization factor.
The normalization factors and effective library sizes in \Rpackage{DESeq2}
can be retrieved as follows.

<<0309001992>>=
library(TCC)
data(hypoData)
group <- factor(c(1, 1, 1, 2, 2, 2))
cl <- data.frame(group = group)
design <- formula(~ group)
dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl,
                              design = design)
dds <- estimateSizeFactors(dds)
size.factors <- sizeFactors(dds)
size.factors
hoge <- size.factors / colSums(hypoData)
norm.factors <- hoge / mean(hoge)
norm.factors
ef.libsizes <- norm.factors * colSums(hypoData)
ef.libsizes
@

Note that the following commands should be the
simplest procedure provided in \Rpackage{DESeq2}.

<<4328593702>>=
library(TCC)
data(hypoData)
group <- factor(c(1, 1, 1, 2, 2, 2))
cl <- data.frame(group = group)
design <- formula(~ group)
dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl,
                              design = design)
dds <- estimateSizeFactors(dds)
sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds))
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
result <- results(dds)
head(result[order(result$pvalue),])
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
result$pvalue[is.na(result$pvalue)] <- 1
AUC(rocdemo.sca(truth = truth, data = -rank(result$pvalue)))
@


\subsection{DE analysis for two-group data without replicates (paired)}
\label{ss_de_2gnr_p}

\Rpackage{edgeR} and \Rpackage{DESeq2} employs
generalized linear models (GLMs) which can apply to detect DEGs from
paired two-group count data. The analysis for paired samples can be
performed by indicating (1) the pair information when constructing
the \Robject{TCC} class object and (2) \Robject{paired = TRUE} when
performing the \Rfunction{calcNormFactors} and \Rfunction{estimateDE}
functions.
For analyzing paired data, we here use the hypothetical count data
(\Robject{hypoData}; for details see \ref{ss_reading_the_count_data})
by changing the label information, i.e.,

<<0285012872>>=
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
head(hypoData)
@

This count data consists of three individuals (or sibships),
dogA, dogB, and dogC. "T" and "N" indicate tumor and normal tissues,
respectively.


We give a procedure for DE analysis using the likelihood ratio
test for GLMs implemented in \Rpackage{edgeR} together with
iDEGES/edgeR normalization factors
(i.e., the iDEGES/edgeR-edgeR combination) for the paired
two-group count data without replicates (i.e., the above
\Robject{hypoData} object). If the user wants to determine
the genes having an FDR threshold of $< 10$\% as DEGs,
one can do as follows.

<<4891209302>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
group <- c(1, 1, 1, 2, 2, 2)
pair <- c(1, 2, 3, 1, 2, 3)
cl <- data.frame(group = group, pair = pair)
tcc <- new("TCC", hypoData, cl)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05, paired = TRUE)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, paired = TRUE)
@

The results of the DE analysis are stored in the \Robject{TCC} class object.
The summary statistics for top-ranked genes can be retrieved by using the
\Rfunction{getResult} function.

<<4390911012>>=
result <- getResult(tcc, sort = TRUE)
head(result)
@

The DE results can be broken down as follows.

<<3909884931>>=
table(tcc$estimatedDEG)
@
<<3934884932, echo = FALSE>>=
tb <- table(tcc$estimatedDEG)
@

This means $\Sexpr{as.numeric(tb[["0"]])}$ non-DEGs and
$\Sexpr{as.numeric(tb[["1"]])}$ DEGs satisfy FDR $< 0.1$.
The \Rfunction{plot} function generates an M-A plot,
where "M" indicates the log-ratio
(i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$)
and "A" indicates average read count
(i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), 
from the normalized count data.
The magenta points indicate the identified DEGs at FDR $< 0.1$.

\begin{center}
<<3482340103, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc)
@
\end{center}

Since we know the true information for \Robject{hypoData}
(i.e., 200 DEGs and 800 non-DEGs), we can calculate the area
under the ROC curve (i.e., AUC; $0 \le$ AUC $\le 1$) between the
ranked gene list and the truth and thereby evaluate the
sensitivity and specificity simultaneously.
A well-ranked gene list should have a high AUC value
(i.e., high sensitivity and specificity).

<<3490930192>>=
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
@

For comparison, the DE results from the original procedure in
\Rpackage{edgeR} can be obtained as follows.

<<8402288128>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
group <- factor(c(1, 1, 1, 2, 2, 2))
pair <- factor(c(1, 2, 3, 1, 2, 3))
design <- model.matrix(~ group + pair)
coef <- 2
d <- DGEList(counts = hypoData, group = group)
d <- calcNormFactors(d)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
out <- glmQLFTest(fit, coef=2)

topTags(out, n = 6)
p.value <- out$table$PValue
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
AUC(rocdemo.sca(truth = truth, data = -rank(p.value)))
@

This is the same as

<<4209576561>>=
library(TCC)
data(hypoData)
colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC",
                        "N_dogA", "N_dogB", "N_dogC")
group <- c(1, 1, 1, 2, 2, 2)
pair <- c(1, 2, 3, 1, 2, 3)
cl <- data.frame(group = group, pair = pair)
tcc <- new("TCC", hypoData, cl)
tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0, paired = TRUE)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, paired = TRUE)
result <- getResult(tcc, sort = TRUE)
head(result)
library(ROC)
truth <- c(rep(1, 200), rep(0, 800))
AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank))
@


\subsection{DE analysis for multi-group data with replicates}
\label{section-4-3}

Here, we give three examples of DE analysis coupled with 
DEGES/edgeR normalization for the hypothetical three-group 
data with replicates, i.e., the \Robject{hypoData\_mg} object. 
The use of the DEGES/edgeR normalization 
factors is simply for reducing the computation time.



\subsubsection{edgeR coupled with DEGES/edgeR normalization}
\label{section-4-3-2}

The exact test implemented in \Rpackage{edgeR} after executing 
the DEGES/edgeR normalization 
(i.e., the DEGES/edgeR-edgeR combination)
can be performed as follows.

<<degerTbt_tcc_edger_multigroup_summary>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@




\section{Generation of simulation data}

\subsection{Introduction and basic usage}
\label{section-5-1}

As demonstrated in our previous study (Kadota et al., 2012), 
the DEGES-based normalization methods implemented in \Rpackage{TCC} 
theoretically outperform the other normalization methods when the numbers 
of DEGs (G1 vs. G2) in the tag count data are biased. However, 
it is difficult to determine whether the up- and down-regulated DEGs in 
one of the groups are actually biased in their number when analyzing 
real data (Dillies et al., 2012). 
This means we have to evaluate the 
potential performance of our DEGES-based methods using mainly simulation 
data. The \Rfunction{simulateReadCounts} function generates simulation 
data under various conditions. This function can generate simulation 
data analyzed in the TbT paper (Kadota et al., 2012), 
and that means it enables other 
researchers to compare the methods they develop with our DEGES-based 
methods. For example, the \Robject{hypoData} object, a hypothetical 
count dataset provided in \Rpackage{TCC}, was generated by using this 
function. The output of the \Rfunction{simulateReadCounts} function is stored 
as a \Rpackage{TCC} class object and is therefore ready-to-analyze.

Note that different trials of simulation analysis generally yield 
different count data even under the same simulation conditions. 
We can call the \Rfunction{set.seed} 
function in order to obtain reproducible results 
(i.e., the \Robject{tcc\$count}) with the 
\Rfunction{simulateReadCounts} function. 

<<generate_simulation_count_data>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, 
                              DEG.assign = c(0.9, 0.1),
                              DEG.foldchange = c(4, 4), 
                              replicates = c(3, 3))
dim(tcc$count)
head(tcc$count)
tcc$group
@

The simulation conditions for comparing two groups (G1 vs. G2) with 
biological replicates are as follows: (i) the number of genes is 
$1,000$ (i.e., \Robject{Ngene = 1000}), 
(ii) the first $20\%$ of genes are DEGs 
(\Robject{PDEG = 0.2}), (iii) the first $90\%$ of the DEGs are 
up-regulated in G1, and the remaining $10\%$ are up-regulated in 
G2 (\Robject{DEG.assign = c(0.9, 0.1)}), (iv) the levels of DE are four-fold 
in both groups (\Robject{DEG.foldchange = c(4, 4)}), 
and (v) there are a total of six samples (three biological 
replicates for G1 and three biological replicates for G2) 
(\Robject{replicates = c(3, 3)}). 
The variance of the NB distribution can be modeled as
 $V = \mu + \phi\mu^{2}$. 
The empirical distribution of the read counts for producing 
the mean ($\mu$) and dispersion ($\phi$) parameters of the 
model was obtained from {\it Arabidopsis} data (three biological 
replicates for each of the treated and non-treated groups) 
in \Rpackage{NBPSeq} (Di et al., 2011).

The \Robject{tcc\$count} object is essentially the same as 
the \Robject{hypoData} object of \Rpackage{TCC}. 
The information about the simulation conditions can be viewed as follows.

<<str_of_simulation_field>>=
str(tcc$simulation)
@

Specifically, the entries for $0$ and $1$ in the 
\Robject{tcc\$simulation\$trueDEG} 
object are for non-DEGs and DEGs respectively.
The breakdowns for individual entries are the same as 
stated above: $800$ entries are non-DEGs, $200$ entries are DEGs.

<<tale_of_simulation_trueDEG>>=
table(tcc$simulation$trueDEG)
@

This information can be used to evaluate the performance of the DEGES-based 
normalization methods in terms of the sensitivity and specificity of the 
results of their DE analysis. A good normalization method coupled with a 
DE method such as the exact test 
(Robinson and Smyth, 2008) and the 
empirical Bayes (Hardcastle and Kelly, 2010) should produce well-ranked 
gene lists in which the true DEGs are top-ranked and non-DEGs are 
bottom-ranked when all genes are ranked according to the degree of DE. 
The ranked gene list after performing 
the DEGES/edgeR-edgeR 
combination can be obtained as follows.

<<analysis_simulation_data>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
@

We can now calculate the area under the \Rpackage{ROC} curve 
(i.e., AUC; $0 \le $AUC$ \le 1$) between the ranked gene list and the 
truth (i.e., DEGs or non-DEGs) and thereby evaluate the sensitivity 
and specificity simultaneously. A well-ranked gene list should have 
a high AUC value (i.e., high sensitivity and specificity). 
The \Rfunction{calcAUCValue} function calculates the AUC value 
based on the information stored in the \Rpackage{TCC} class object.

<<calc_AUC_of_simulation_data>>=
calcAUCValue(tcc)
@
<<calc_AUC_of_simulation_data_hide, echo = FALSE>>=
auc.degesedger <- calcAUCValue(tcc)
@

This is essentially the same as

<<calc_AUC_of_simulation_data_ROC>>=
AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0),
                data = -tcc$stat$rank))
@

The following classic \Rpackage{edgeR} procedure 
(i.e., the TMM-edgeR combination) 
make it clear that  the DEGES-based normalization method (i.e., 
the DEGES/edgeR pipeline) outperforms the default 
normalization method (i.e., TMM) implemented in \Rpackage{edgeR}.

<<calc_AUC_of_tmm_tcc>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@

The following is an alternative procedure for \Rpackage{edgeR} users.

<<calc_AUC_of_tmm_org>>=
design <- model.matrix(~ as.factor(tcc$group$group))
d <- DGEList(counts = tcc$count, group = tcc$group$group)
d <- calcNormFactors(d)
d$samples$norm.factors <- d$samples$norm.factors / 
                          mean(d$samples$norm.factors)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
result <- glmQLFTest(fit, coef = 2)
result$table$PValue[is.na(result$table$PValue)] <- 1
AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0),
                data = -rank(result$table$PValue)))
@




\subsection{Multi-group data with and without replicates}
\label{section-5-3}

The \Rfunction{simulateReadCounts} function can generate simulation data 
with a more complex design. First, we generate a dataset consisting 
of three groups. The simulation conditions for this dataset are as 
follows: (i) the number of genes is $1,000$ 
(i.e., \Robject{Ngene = 1000}), 
(ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.3}), 
(iii) the breakdowns of the up-regulated DEGs are respectively 
$70\%$, $20\%$, and $10\%$ in Groups 1-3 
(\Robject{DEG.assign = c(0.7, 0.2, 0.1)}), 
(iv) the levels of DE are $3$-, $10$-, and $6$-fold in individual 
groups (\Robject{DEG.foldchange = c(3, 10, 6)}), 
and (v) there are a total of nine libraries (2, 4, and 3 
replicates for Groups 1-3) (\Robject{replicates = c(2, 4, 3)}).

<<generate_simulation_data_multigroup>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.3,
                         DEG.assign = c(0.7, 0.2, 0.1),
                         DEG.foldchange = c(3, 10, 6),
                         replicates = c(2, 4, 3))
dim(tcc$count)
tcc$group
head(tcc$count)
@

The pseudo-color image for the generated simulation data regarding 
the DEGs can be obtained from the \Rfunction{plotFCPseudocolor} function. 
The right bar (from white to magenta) indicates the degree of 
fold-change (FC). As expected, it can be seen that the first 
$210$, $60$, and $30$ genes are up-regulated in G1, G2, and G3, 
respectively.

\begin{center}
<<plot_fc_pseudocolor_multigroup, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
plotFCPseudocolor(tcc)
@
\end{center}

Now let us  see how the DEGES/edgeR-edgeR combination
with the original edgeR-edgeR combination performs.
First we calculate the AUC value for the ranked gene list obtained from the
DEGES/edgeR-edgeR combination.

<<degesEdger_edger_tcc_multigroup>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@

Next, we calculate the corresponding value using the original \Rpackage{edgeR}
procedure for single factor experimental design
(i.e., the edgeR-edgeR combination).

<<tmm_edger_tcc_multigroup>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 0)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@

It can be seen that the DEGES/edgeR-edgeR combination
outperforms the original \Rpackage{edgeR} procedure under the given simulation
conditions. Note that the \Robject{test.method} argument will be ignored when
\Robject{iteration = 0} is specified.

Next, let us generate another dataset consisting of a total of eight groups. 
The simulation conditions for this dataset are as follows: 
(i) the number of genes is $10,000$ 
(i.e., \Robject{Ngene = 10000}), 
(ii) the first $34\%$ of genes are DEGs (\Robject{PDEG = 0.34}), 
(iii) the breakdowns of the up-regulated DEGs are respectively 
$10\%$, $30\%$, $5\%$, $10\%$, $5\%$, $21\%$, $9\%$, and $10\%$ in Groups 1-8 
(\Robject{DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1)}), 
(iv) the levels of DE are $3.1$-, $13$-, $2$-, $1.5$-, $9$-, $5.6$-, $4$-, 
and $2$-fold in individual groups 
(\Robject{DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2)}), 
and (v) there are a total of nine libraries (except for G3, 
none of the groups have replicates) 
(\Robject{replicates = c(1, 1, 2, 1, 1, 1, 1, 1)}).

\begin{center}
<<plot_fc_pseudocolor_multigroup_2, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.34,
               DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1),
               DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2),
               replicates = c(1, 1, 2, 1, 1, 1, 1, 1))
dim(tcc$count)
tcc$group
head(tcc$count)
plotFCPseudocolor(tcc)
@
\end{center}


This kind of simulation data may be useful for evaluating methods 
aimed at identifying tissue-specific (or tissue-selective) genes.

\subsection{Multi-factor data}
\label{section-5-3-p1}

The \Rfunction{simulateReadCounts} function can also generate simulation
data in multi-factor experimental design. Different from above single-factor
experimental design, the \Robject{group} argument should be used instead of
\Robject{replicates} for specifying sample conditions (or factors) when
generating simulation data in multi-factor design.
In relation to the \Robject{group} specification,
the \Robject{DEG.foldchange} argument should also be specified as a data
frame object.

We generate a dataset consisting of two factors for comparing
(i) two Groups (i.e., "WT" vs. "KO") as the first factor,
at (ii) two time points (i.e., "1d" vs. "2d") as the second factor,
with all samples obtained from independent subjects.
There are a total of four conditions
("WT\_1d", "WT\_2d", "KO\_1d", and "KO\_2d") each of which has two
biological replicates, comprising a total of eight samples.
The \Robject{group} argument for this experimental design can be described as follows:

<<generate_simulation_data_multifactor_group>>=
group <- data.frame(
    GROUP = c("WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO"),
    TIME = c("1d", "1d", "2d", "2d", "1d", "1d", "2d", "2d")
)
@

Next, we design the number of types of DEGs and the levels of fold-change
by the \Robject{DEG.foldchange} argument. We here introduce three types of
DEGs: (a) 2-fold up-regulation in the first four samples (i.e., "WT"),
(b) 3-fold up-regulation in the last four samples (i.e., "KO"),
and (c) 2-fold down-regulation at "2d" in "WT" and 4-fold up-regulation
at "2d" in "KO". This implies that the first two types of DEGs are related
to the first factor (i.e., "WT" vs. "KO") and the third type of DEG is related
to the second factor (i.e., "1d" vs. "2d").

<<generate_simulation_data_multifactor_fc>>=
DEG.foldchange <- data.frame(
    FACTOR1.1 = c(2, 2, 2, 2, 1, 1, 1, 1),
    FACTOR1.2 = c(1, 1, 1, 1, 3, 3, 3, 3),
    FACTOR2 = c(1, 1, 0.5, 0.5, 1, 1, 4, 4)
)
@

The other simulation conditions for this dataset are as follows:
(1) the number of gene is 1,000 (i.e., \Robject{Ngene = 1000}),
(2) the first 20\% of genes are DEGs (i.e., \Robject{PDEG = 0.2}),
and (3) the breakdowns of the three types of DEGs are 50\%, 20\%,
and 30\% (i.e., \Robject{DEG.assign = c(0.5, 0.2, 0.3)}).

<<generate_simulation_data_multifactor>>=
set.seed(1000)
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.2,
               DEG.assign = c(0.5, 0.2, 0.3),
               DEG.foldchange = DEG.foldchange,
               group = group)
@

Since the first six rows in the dataset corresponds to the first type
of DEGs, we can see the 2-fold up-regulation in the first four columns
(i.e., WT-related samples) compared to the last four columns
(i.e., KO-related samples).

\begin{center}
<<plot_fc_pseudocolor_multifactor, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
head(tcc$count)
tcc$group
plotFCPseudocolor(tcc)
@
\end{center}


\subsection{Other utilities}
\label{section-5-4}

Recall that the simulation framework can handle different levels 
of DE for DEGs in individual groups, and the shape of the distribution 
for these DEGs is the same as that of non-DEGs. Let us confirm those 
distributions by introducing more drastic simulation conditions for 
comparing two groups (G1 vs. G2) with biological replicates; i.e., 
(i) the number of genes is $20,000$ (i.e., \Robject{Ngene = 20000}), 
(ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.30}), 
(iii) the first $85\%$ of the DEGs are up-regulated in G1 and the remaining 
$15\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.85, 0.15)}), 
(iv) the levels of DE are eight-fold in G1 and sixteen-fold in 
G2 (\Robject{DEG.foldchange = c(8, 16)}), and (v) there are a total of 
four samples (two biological replicates for G1 and two biological 
replicates for G2) (\Robject{replicates = c(2, 2)}).

<<simulate_data_1000>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 20000, PDEG = 0.30, 
               DEG.assign = c(0.85, 0.15),
               DEG.foldchange = c(8, 16), 
               replicates = c(2, 2))
head(tcc$count)
@

An M-A plot for the simulation data can be viewed as follows; 
the points for DEGs are colored red and the non-DEGs are colored black:

\begin{center}
<<maplot_simulate_data_1000, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc)
@
\end{center}

This plot is generated from simulation data that has been scaled 
in such a way that the library sizes of each sample are the same 
as the mean library size of the original data. That is,

<<norm_simulate_data_1000>>=
normalized.count <- getNormalizedData(tcc)
colSums(normalized.count)
colSums(tcc$count)
mean(colSums(tcc$count))
@

<<maplot_normed_simulate_data_1000_hide, echo = FALSE, fig = FALSE>>=
xy <- plot(tcc)
isnot.na <- as.logical(xy[, 1] != min(xy[, 1]))
upG1 <- as.numeric(xy$m.value < 0)
upG2 <- as.numeric(xy$m.value > 0)
median.G1 <- median(xy[((tcc$simulation$trueDEG == 1) & isnot.na) & upG1, 2])
median.G2 <- median(xy[((tcc$simulation$trueDEG == 1) & isnot.na) & upG2, 2])
median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2])
@

The summary statistics for non-DEGs and up-regulated DEGs in G1 and G2 
are upshifted compared with the original intentions of the user (i.e., 
respective M values of $0$, $-3$, and $4$ for non-DEGs and up-regulated 
DEGs in G1 and G2). Indeed, the median values, indicated as horizontal lines, 
are respectively $\Sexpr{sprintf("%.3f", median.nonDEG)}$, 
$\Sexpr{sprintf("%.3f", median.G1)}$, 
and $\Sexpr{sprintf("%.3f", median.G2)}$ for non-DEGs 
and up-regulated DEGs in G1 and G2.

\begin{center}
<<maplot_normed_simulate_data_1000, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc, median.lines = TRUE) 
@
\end{center}

<<maplot_normed_simulate_data_1000_2_hide, echo = FALSE, fig = FALSE>>=
tcc <- calcNormFactors(tcc, "tmm", "edger", iteration = 3, 
                       FDR = 0.1, floorPDEG = 0.05)
xy <- plot(tcc)
isnot.na <- as.logical(xy[, 1] != min(xy[, 1]))
median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2])
@

These upshifted M values for non-DEGs can be modified after performing the 
iDEGES/edgeR normalization, e.g., the median M value 
($= \Sexpr{sprintf("%.3f", median.nonDEG)}$) 
for non-DEGs based on the iDEGES/edgeR-normalized 
data is nearly zero. 

\begin{center}
<<maplot_normed_simulate_data_1000_2, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
plot(tcc, median.line = TRUE)
@
\end{center}

The comparison of those values obtained from different normalization 
methods might be another evaluation metric.

\newpage

\section{Session info}

<<2390118599>>=
sessionInfo()
@

\newpage

\section{References}
\renewcommand{\refname}{}
\renewcommand\refname{\vskip -1cm}

\begin{thebibliography}{99}
\bibitem{anders}
Anders S, Huber W. 2010.
Differential expression analysis for sequence count data.
Genome Biol 11: R106.
\bibitem{lovemi}
Love MI, Huber W, Anders S. 2014.
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
Genome Biol 15: R550.
\bibitem{di}
Di Y, Schafer DW, Cumbie JS, Chang JH. 2011.
The NBP negative binomial model for assessing differential
gene expression from RNA-Seq.
Stat Appl Genet Mol Biol 10.
\bibitem{dillies}
Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N,
Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L,
Lalo\"e D, Le Gall C, Scha\"effer B, Le Crom S, Guedj M, Jaffr\'ezic F;
French StatOmique Consortium. 2013.
A comprehensive evaluation of normalization methods for Illumina
high-throughput RNA sequencing data analysis.
Brief Bioinform 14: 671-683.
\bibitem{glaus}
Glaus P, Honkela A, and Rattray M. 2012.
Identifying differentially expressed transcripts from RNA-seq
data with biological variation.
Bioinformatics 28: 1721-1728.
\bibitem{wad}
Kadota K, Nakai Y, Shimizu K. 2008.
A weighted average difference method for detecting differentially expressed
genes from microarray data.
Algorithms Mol Biol 3: 8.
\bibitem{aicentropy}
Kadota K, Nishimura SI, Bono H, Nakamura S, Hayashizaki Y, Okazaki Y,
Takahashi K. 2003.
Detection of genes with tissue-specific expression patterns using
Akaike's Information Criterion (AIC) procedure.
Physiol Genomics 12: 251-259.
\bibitem{kadota}
Kadota K, Nishiyama T, and Shimizu K. 2012.
A normalization strategy for comparing tag count data.
Algorithms Mol Biol 7: 5.
\bibitem{roku}
Kadota K, Ye J, Nakai Y, Terada T, Shimizu K. 2006.
ROKU: a novel method for identification of tissue-specific genes. 
BMC Bioinformatics 7: 294.
\bibitem{mccarthy}
McCarthy DJ, Chen Y, Smyth GK. 2012.
Differential expression analysis of multifactor RNA-Seq experiments
with respect to biological variation.
Nucleic Acids Res 40: 4288-4297.
\bibitem{robinson}
Robinson MD, McCarthy DJ, Smyth GK. 2010.
edgeR: A Bioconductor package for differential expression analysis
of digital gene expression data.
Bioinformatics 26: 139-140.
\bibitem{robinson2}
Robinson MD and Oshlack A. 2010.
A scaling normalization method for differential expression analysis
of RNA-seq data. 
Genome Biol 11: R25.
\bibitem{robinson3}
Robinson MD and Smyth GK. 2008.
Small-sample estimation of negative binomial dispersion,
with applications to SAGE data.
Biostatistics 9: 321-332.
\bibitem{sun}
Sun J, Nishiyama T, Shimizu K, and Kadota K.
TCC: an R package for comparing tag count data with robust
normalization strategies.
BMC Bioinformatics 14: 219.
\bibitem{ueda}
Ueda T. 1996.
Simple method for the detection of outliers.
Japanese J Appl Stat 25: 17-26.
\end{thebibliography}
\end{document}