% \VignetteEngine{knitr::knitr}
% \VignetteIndexEntry{ccrepe}

\documentclass{article}

%% -------------------------
%% Pre-amble:
%% -------------------------

\usepackage{float}
\usepackage[nonamebreak,square]{natbib}
\usepackage{amsmath}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex(width=78, use.unsrturl=FALSE)
@

<<echo=FALSE>>=
library(ccrepe)
@

\title{CCREPE: Compositionality Corrected by PErmutation and REnormalization}
\author{Emma Schwager, George Weingart, Craig Bielski, Curtis Huttenhower}

%% -------------------------
%% End pre-amble
%% -------------------------

\begin{document}

\maketitle

\tableofcontents

\section{Introduction}

\Biocpkg{ccrepe} is a package for analysis of sparse compositional data.  Specifically, it determines the significance of association between features in a composition, using any similarity measure (e.g. Pearson correlation, Spearman correlation, etc.)  The CCREPE methodology stands for Compositionality Corrected by Renormalization and Permutation, as detailed below.  The package also provides a novel similarity measure, the N-dimensional checkerboard score (NC-score), particularly appropriate to compositions derived from microbial community sequencing data.  This results in p-values and false discovery rate q-values corrected for the effects of compositionality.  The package contains two functions \Rfunction{ccrepe} and \Rfunction{nc.score} and is maintained by the Huttenhower lab (\email{ccrepe-users@googlegroups.com}).

%---------------------------------------------------------
\section{ccrepe} 
%---------------------------------------------------------

\Rfunction{ccrepe} is the main package function.  It calculates compositionality-corrected p-values and q-values for a user-selected similarity measure, operating on either one or two input matrices.  If given one matrix, all features (columns) in the matrix are compared to each other using the selected similarity measure.  If given two matrices, each feature in the first are compared against all features in the second.

\subsection{General functionality}

Compositional data induces spurious correlations between features due to the nonindependence of values that must sum to a fixed total.  CCREPE abrogates this when determining the significance of a similarity measure for each feature pair using two main steps, permutation/renormalization and bootstrapping.  First, given two features to compare, CCREPE generates a null distribution of the similarity expected just due to compositionality by iteratively permuting one feature, renormalizing all samples in the composition to their previous sum, and computing the resulting similarity measures.  Second, CCREPE bootstraps over sample subsets in order to assess confidence in the "true" similarity measure.  Finally, the two resulting distributions are compared using a pooled-variance Z-test to give a compositionality-corrected p-value.  False discovery rate q-values are additionally calculated using the Benjamin-Hochberg-Yekutieli procedure.  For greater detail, see \citet{faust2012microbial} and \citet{schwager2012unpublished}.\\

CCREPE employs several filtering steps before the data are processed.  It removes any missing subjects using \Rcode{na.omit}: in the two dataset case, any subjects missing in \emph{either} dataset will be removed.  Any subjects or features which are all zero are removed as well: an all-zero subject cannot be normalized (its sum is 0) and an all-zero feature has standard deviation 0 (in addition to being uninteresting biologically).

\subsection{Arguments}

\begin{description}
\setlength{\itemsep}{1em}

\item[\Rcode{x}]
First \Rclass{dataframe} or \Rclass{matrix} containing relative abundances.  Columns are features, rows are samples.  Rows should therefore sum to a constant.  Row names are used for identification if present.

\item[\Rcode{y}]
Second \Rclass{dataframe} or \Rclass{matrix} (optional) containing relative abundances.  Columns are features, rows are samples.  Rows should therefore sum to a constant.  If both \Rcode{x} and \Rcode{y} are specified, they will be merged by row names.  If no row names are specified for either or both datasets,   the default is to merge by row number.

\item[\Rcode{sim.score}]
Similarity measure, such as \Rfunction{cor} or \Rfunction{nc.score}.  This can be either an existing R function or user-defined.  If the latter, certain properties should be satisfied as detailed below (also see examples).  The default similarity measure is Spearman correlation.

A user-defined similarity measure should mimic the interface of \Rfunction{cor}:
\begin{enumerate}
\item Take either two \Rclass{vector} inputs one \Rclass{matrix} or \Rclass{dataframe} input.

\item In the case of two inputs, return a single number.

\item In the case of one input, return a matrix in which the (\Rcode{i},\Rcode{j})th entry is the similarity score for column \Rcode{i} and column \Rcode{j} in the original matrix.

\item The resulting matrix (in the case of one input) must be symmetric.

\item The inputs must be named \Rcode{x} and \Rcode{y}.
\end{enumerate}

\item[\Rcode{sim.score.args}]
An optional list of arguments for the measurement function.  When given, they are passed to the \Rcode{sim.score} function directly.  For example, in the case of \Rcode{cor}, the following would be acceptable:

<<eval=FALSE>>=
sim.score.args = list(method="spearman", use="complete.obs")
@

\item[\Rcode{min.subj}]
Minimum number (count) of samples that must be non-missing in order to apply the similarity measure.  This is to ensure that there are sufficient samples to perform a bootstrap (default: 20).

\item[\Rcode{iterations}]
The number of iterations for both bootstrap and permutation calculations (default: 1000).

\item[\Rcode{subset.cols.x}]
A vector of column indices from x to indicate which features to compare

\item[\Rcode{subset.cols.y}]
A vector of column indices from y to indicate which features to compare

\item[\Rcode{errthresh}]
If feature has number of zeros greater than   $\ errthresh^{1/n} $  , that feature is excluded     

\item[\Rcode{verbose}]
If \Rcode{TRUE}, print periodic progress of the algorithm through the dataset(s), as well as including more detailed debugging output.  (default: \Rcode{FALSE}).

\item[\Rcode{iterations.gap}]
If \Rcode{verbose=TRUE}, the number of iterations between issuing status messages (default: 100).

\item[\Rcode{distributions}]
Optional output file for detailed log (if given) of all intermediate permutation and renormalization distributions.

\item[\Rcode{compare.within.x}]
  A boolean value indicating whether to do comparisons given by taking all subsets of size 2 from \Rcode{subset.cols.x} or to do comparisons given by taking all possible combinations of \Rcode{subset.cols.x} and \Rcode{subset.cols.y}.  If \Rcode{TRUE} but \Rcode{subset.cols.y=NA}, returns all comparisons involving any features in \Rcode{subset.cols.x}.  This argument is only used when \Rcode{y=NA}.
  
\item[\Rcode{concurrent.output}]
    Optional output file to which each comparison will be written as it is calculated.
    
\item[\Rcode{make.output.table}]
      A boolean value indicating whether to include table-formatted output.

\end{description}

\subsection{Output}

\Rcode{ccrepe} returns a \Rclass{list} containing both the calculation results and the parameters used:

\begin{description}
\setlength{\itemsep}{1em}

\item[\Rcode{sim.score}]
\Rclass{matrix} of simliarity scores for all requested comparisons.  The (\Rcode{i},\Rcode{j})th element corresponds to the similarity score of column \Rcode{i} (or the \Rcode{i}th column of \Rcode{subset.cols.1}) and column \Rcode{j} (or the \Rcode{j}th column of \Rcode{subset.cols.1}) in one dataset, or to the similarity score of column \Rcode{i} (or the \Rcode{i}th column of \Rcode{subset.cols.1}) in dataset \Rcode{x} and column \Rcode{j} (or the \Rcode{j}th column of \Rcode{subset.cols.2}) in dataset \Rcode{y} in the case of two datasets.

\item[\Rcode{p.values}]
\Rclass{matrix} of the corrected p-values for all requested comparisons.  The (\Rcode{i},\Rcode{j})th element corresponds to the p-value of the (\Rcode{i},\Rcode{j})th element of \Rcode{sim.score}.

\item[\Rcode{q.values}]
\Rclass{matrix} of the Benjamini-Hochberg-Yekutieli corrected p-values.  The (\Rcode{i},\Rcode{j})th element corresponds to the p-value of the (\Rcode{i},\Rcode{j})th element of \Rcode{sim.score}.

\item[\Rcode{z.stat}]
  \Rclass{matrix} of the z-statistics used in generating the p-values for all requested comparisons.  The (\Rcode{i},\Rcode{j})th element corresponds to the z-statistic generating the (\Rcode{i},\Rcode{j})th element of \Rcode{p.values}.

\end{description}

\subsection{Usage}	

<<eval=FALSE>>=
ccrepe(
 x = NA,
 y = NA,
 sim.score = cor,
 sim.score.args = list(),
 min.subj = 20,
 iterations = 1000,
 subset.cols.x = NULL,
 subset.cols.y = NULL,
 errthresh  = 1e-04,
 verbose = FALSE,
 iterations.gap = 100,
 distributions = NA,
 compare.within.x = TRUE,
 concurrent.output = NA,
 make.output.table = FALSE)
@

\subsection{Example 1}

An example of how to use \Rfunction{ccrepe} with one dataset.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.rowsum <- apply(data,1,sum)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(c(
 "Sample 1", "Sample 2","Sample 3","Sample 4","Sample 5",
 "Sample 6","Sample 7","Sample 8","Sample 9","Sample 10"),
 c("Feature 1", "Feature 2", "Feature 3","Feature 4"))

test.output <- ccrepe(x=test.input, iterations=20, min.subj=10)
@
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2.  In this case we would expect feature 1 and feature 2 to be associated. In the output we see this by the positive sim.score value in the [1,2] element of test.output\\$sim.score and the small q-value in the [1,2] element of test.output\\$q.values.", fig.width=7, fig.height=3.5,fig.pos="H">>=
par(mfrow=c(1,2))
plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized")
plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2",
     main="Normalized")
test.output
@

\subsection{Example 2}

An example of how to use \Rfunction{ccrepe} with two datasets.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.rowsum <- apply(data,1,sum)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm


data2 <- matrix(rlnorm(105,meanlog=0,sdlog=1),nrow=15,ncol=7)
aligned.rows <- c(seq(1,4),seq(6,9),11,12)  # The datasets dont need 
                                            # to have subjects line up exactly
data2[aligned.rows,1] <-  2*data[,3] + rnorm(10,0,0.01)
data2.rowsum <- apply(data2,1,sum)
data2.norm <- data2/data2.rowsum
apply(data2.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input.2 <- data2.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))
dimnames(test.input.2) <- list(paste("Sample",c(seq(1,4),11,seq(5,8),12,9,10,13,14,15)),paste("Feature",seq(1,7)))

test.output.two.datasets <- ccrepe(x=test.input, y=test.input.2, iterations=20, min.subj=10)
@
Please note that we receive a warning because the subjects don't match - only paired observations.
 
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2.  In this case we would expect feature 1 and feature 2 to be associated. In the output we see this by the positive sim.score value in the [1,2] element of test.output\\$sim.score and the small q-value in the [1,2] element of test.output\\$q.values.", fig.width=7, fig.height=3.5, fig.pos="H">>=
par(mfrow=c(1,2))
plot(data2[aligned.rows,1],data[,3],xlab="dataset 2: Feature 1",ylab="dataset 1: Feature 3",main="Non-normalized")
plot(data2.norm[aligned.rows,1],data.norm[,3],xlab="dataset 2: Feature 1",ylab="dataset 1: Feature 3",
     main="Normalized")
test.output.two.datasets
@

\subsection{Example 3}

An example of how to use \Rfunction{ccrepe} with \Rfunction{nc.score} as the similarity score.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.rowsum <- apply(data,1,sum)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))

test.output.nc.score     <- ccrepe(x=test.input, sim.score=nc.score, iterations=20, min.subj=10)
@
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2.  In this case we would expect feature 1 and feature 2 to be associated. In the output we see this by the positive sim.score value in the [1,2] element of test.output\\$sim.score and the small q-value in the [1,2] element of test.output\\$q.values. In this case, however, the sim.score represents the NC-Score between two features rather than the Spearman correlation.", fig.width=7, fig.height=3.5, fig.pos="H">>=
par(mfrow=c(1,2))
plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized")
plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2",
     main="Normalized")
test.output.nc.score
@

\subsection{Example 4}

An example of how to use \Rfunction{ccrepe} with a user-defined \Rfunction{sim.score} function.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.rowsum <- apply(data,1,sum)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))

my.test.sim.score <- function(x,y=NA,constant=0.5){
       if(is.vector(x) && is.vector(y)) return(constant)
          if(is.matrix(x) && is.na(y)) return(matrix(rep(constant,ncol(x)^2),ncol=ncol(x)))
          if(is.data.frame(x) && is.na(y)) return(matrix(rep(constant,ncol(x)^2),ncol=ncol(x)))
          else stop('ERROR')
   }

test.output.sim.score    <- ccrepe(x=test.input, sim.score=my.test.sim.score, iterations=20, min.subj=10, sim.score.args = list(constant = 0.6))
@
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2.  In this case we would expect feature 1 and feature 2 to be associated. Note that the values of sim.score are all 0.6 and none of the p-values are very small because of the arbitrary definition of the similarity score.", fig.width=7, fig.height=3.5, fig.pos="H">>=
par(mfrow=c(1,2))
plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized")
plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2",
     main="Normalized")
test.output.sim.score
@

\subsection{Example 5}

An example of how to use \Rfunction{ccrepe} when specifying column subsets.

<<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data.rowsum <- apply(data,1,sum)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))

test.output.1.3     <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1,3))
test.output.1       <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1), compare.within.x=FALSE)
test.output.12.3    <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1,2),subset.cols.y=c(3), compare.within.x=FALSE)
test.output.1.3$sim.score
test.output.1$sim.score
test.output.12.3$sim.score
@ 

%---------------------------------------------------------
\section{nc.score} 
%---------------------------------------------------------

The \Rfunction{nc.score} similarity measure is an N-dimensional extension of the checkerboard score particularly suited to similarity score calculations between compositions derived from ecological relative abundance measurements.  In such cases, features typically represent species abundances, and the NC-score discretizes these continuous values into one of N bins before computing a normalized similarity of co-occurrence or co-exclusion.  This can be used as a standalone function or with \Rfunction{ccrepe} as above to obtain compositionality-corrected p-values.

\subsection{General Functionality}

The NC-score is an extension to Diamond's checkerboard score (see \citet{cody1975ecology}) to ordinal data, and simplifies to a calculation of Kendall's $\tau$ on binned data instead of ranked data.  Let two features in a dataset with $n$ subjects be denoted by
\[
\left[ \begin{array}{cccc}
    x_1 & x_2 & \dots & x_n\\
    y_1 & y_2 & \dots & y_n\\
    \end{array} \right].
\]
The binning function maps from the original data to $b$ numbered bins in $\{1,...,b\}$.  Let the binning function be denoted by $B(\cdot)$. The co-variation and co-exclusion patterns are the same as concordant and discordant pairs in Kendall's $\tau$.  Considering a $2 \times 2$ submatrix of the form
\[
\left[ \begin{array}{cc}
    B(x_i) & B(x_j)\\
    B(y_i) & B(y_j)\\
    \end{array} \right],
\]
a co-variation pattern is counted when $(B(x_i) - B(x_j))(B(y_i) - B(y_j)) > 0$ and a co-exclusion pattern, conversely, when $(B(x_i) - B(x_j))(B(y_i) - B(y_j)) < 0$.  The NC-score statistic for features $x$ and $y$ is then defined as 
\[
(\text{number of co-variation patterns}) - (\text{number of co-exclusion patterns}),
\]
normalized by the Kendall's $\tau$ normalization factor accounting for ties described in \citet{kendall1970}.

\subsection{Arguments}

\begin{description}
\setlength{\itemsep}{1em}

\item[\Rcode{x}]
First numerical \Rclass{vector}, or single \Rclass{dataframe} or \Rclass{matrix}, containing relative abundances.  If the latter, columns are features, rows are samples.  Rows should therefore sum to a constant.

\item[\Rcode{y}]
If provided, second numerical \Rclass{vector} containing relative abundances.  If given, \Rcode{x} must be a \Rclass{vector} as well.

\item[\Rcode{nbins}]
A non-negative integer of the number of bins to generate (cutoffs will be generated by the discretize function from the infotheo package).
 
\item[\Rcode{bin.cutoffs}]
A list of values demarcating the bin cutoffs.  The binning is   performed using the findInterval function.

\item[\Rcode{use}]
An optional character string givinga method for computing covariances    in the presence of missing values.  This must be (an abbreviaion of) on of the strings "everything", "all.obs", "complete.obs","na.or.complete", or "pairwise.complete.obs".

\end{description}

\subsection{Output}

\Rfunction{nc.score} returns either a single number (if called with two vectors) or a \Rclass{matrix} of all pairwise scores (if called with a \Rclass{matrix}) of normalized scores.
This behaviour is precisely analogous to the cor function in R


\subsection{Usage}

<<eval=FALSE>>=
nc.score(
 x,
 y = NULL, 
 use = "everything",
 nbins = NULL, 
 bin.cutoffs=NULL)
@

\subsection{Example 1}

An example of using \Rfunction{nc.score} to get a single similarity score or a matrix.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data.rowsum <- apply(data,1,sum)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))

test.output.matrix <- nc.score(x=test.input)
test.output.num    <- nc.score(x=test.input[,1],y=test.input[,2])
@
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2 of the second example.  Again, we expect to observe a positive association between feature 1 and feature 2.  In terms of generalized checkerboard scores, we would expect to see more co-variation patterns than co-exclusion patterns.  This is shown by the positive and relatively high value of the [1,2] element of test.output.matrix (which is identical to test.output.num)", fig.height=3, fig.pos="H">>=
par(mfrow=c(1, 2))
plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized")
plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2",
     main="Normalized")
test.output.matrix
test.output.num
@

\subsection{Example 2}

An example of using \Rfunction{nc.score} with an aribitrary bin number.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data.rowsum <- apply(data,1,sum)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))

test.output <- nc.score(x=test.input,nbins=4)
@
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2 of the second example.  Again, we expect to observe a positive association between feature 1 and feature 2.  In terms of generalized checkerboard scores, we would expect to see more co-variation patterns than co-exclusion patterns.  This is shown by the positive and relatively high value in the [1,2] element of test.output.  In this case, the smaller bin number yields a smaller NC-score because of the coarser partitioning of the data.", fig.height=3, fig.pos="H">>=
par(mfrow=c(1, 2))
plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized")
plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2",
     main="Normalized")
test.output
@

\subsection{Example 3}

An example of using \Rfunction{nc.score} with user-defined bin edges.

<<>>=
data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4)
data.rowsum <- apply(data,1,sum)
data[,1] = 2*data[,2] + rnorm(10,0,0.01)
data.norm <- data/data.rowsum
apply(data.norm,1,sum)  # The rows sum to 1, so the data are normalized
test.input <- data.norm

dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4)))

test.output <- nc.score(x=test.input,bin.cutoffs=c(0.1,0.2,0.3))
@
<<fig.cap="Non-normalized and normalized associations between feature 1 and feature 2 of the second example.  Again, we expect to observe a positive association between feature 1 and feature 2.  In terms of generalized checkerboard scores, we would expect to see more co-variation patterns than co-exclusion patterns.  This is shown by the positive and relatively high value in the [1,2] element of test.output.  The bin edges specified here represent almost absent ([  0,0.001)), low abundance ([0.001,0.1)), medium abundance ([0.1,0.25)), and high abundance ([0.6,1)).", fig.height=3, fig.pos="H">>=
par(mfrow=c(1, 2))
plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized")
plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2",
     main="Normalized")
test.output
@

%---------------------------------------------------------
\section{References} 
%---------------------------------------------------------

\bibliographystyle{plainnat}
\bibliography{ccrepe}

\end{document}