RCM package: vignette

Introduction

The RCM package combines unconstrained and constrained ordination of microbiome read count data into a single package. The package functions allow fitting of the ordination model, plotting it and performing diagnostic checks.

Installation

The package can be installed and loaded using the following commands:

# library(BiocManager)
# BiocManager::install("RCM")
library(RCM)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
cat("RCM package version", as.character(packageVersion("RCM")), "\n")
## RCM package version 1.0.0

Dataset

As example data we use a study on the microbiome of colorectal cancer patients “Potential of fecal microbiota for early-stage detection of colorectal cancer” (2014) by Zeller et al..

Unconstrained RCM

Fitting the unconstrained RCM

The unconstrained RC(M) method represents all variability present in the data, regardless of covariate information. It should be used as a first step in an exploratory analysis. The RCM function is the front end function for this purpose, behind it is the RCM_NB function that does the hard work but requires numeric matrix imputs. We first fit a model with two dimensions,

which took 0.7 minutes. Here we supplied a phyloseq object to the RCM function which is preferable. Alternatively one can also provide a count matrix with samples in the rows and taxa in the columns, but then all plotting variables should be supplied manually as vectors.

Adding dimensions

By default, only 2 dimensions are fitted, but more can be requested throught the k argument.

The total runtime for all dimensions combined was 1.8 minutes.

Conditioning

In order to condition on certain variables, supply their names to the confounders argument. Here we condition on the country variable

Conditioning can be applied for unconstrained as well as constrained RCM, the ensuing plots have exactly the same interpretation as before, only now the variability due to the confounding variable has been filtered out.

Plotting the uconstrained RCM

Monoplots

To plot only samples or taxa, specify the plotType argument in the generic plot function.

No clear signal is present at first sight. Note also that the plot is rectangular according to the values of the importance parameters \(\psi\). In order to truthfully represent the distances between samples all axis must be on the same scale. We can add a colour code for the cancer diagnosis contained in the phyloseq object.

It is clear that some of the variability of the samples is explained by the cancer status of the patients.

We can also add a richness measure as a colur, see ?phyloseq::estimate_richness for a list of available richness measures. Here we plot the Shannon diversity

In order to plot only the species, modify the plotType argument.

The researchers found that species from the Fusobacteria genus are associated with cancer. We can plot only these species using a regular expression.

It is clear that these Fusobacterium species behave very differently between the species. We can also colour the species plots by phylogenetic level, e.g. order level, if available in the dataset.

Biplots

Finally we can combine both plots into an interpretable biplot, which is the default for unconstrained RC(M). To avoid overplotting we only show the taxa with the 10 most important departures from independence.

Samples are represented by dots, taxa by arrows. Both represent vectors with the origin as starting point.

Valid interpretatations are the following:

  • Samples (endpoints of sample vectors, the red dots) close together depart from independence in a similar way
  • The orthogonal projection of the taxon arrows on the sample arrows are proportional to the departure from independence of that taxon in that sample on the log scale, in the first two dimensions. For example Fusobacterium mortiferum is more abundant than average in samples on the left side of the plot, and more abundant in samples on the right side.
  • The importance parameters \(\psi\) shown for every axis reflect the relative importance of the dimensions

Distances between endpoints of taxon vectors are meaningless.

Adding projections

We can also graphically highlight the departure from independence for a particular taxon and sample using the addOrthProjection function:

The projection of the species vector is graphically shown here, the orange bar representing the extent of the departure from independence. Note that we providid the exact taxon name, and approximate sample coordinates visually derived from the graph, but any combination of both is possible.

We can then plot any combination of two dimensions we want, e.g. the first and the third.

The third dimension also correlates with a separation of cancer patients vs. healthy and small adenoma patients.

Assessing the goodness of fit

Some taxa (or samples) may not follow a negative binomial distribution, or their departures from independence may not be appropriately represented in low dimensions. We visualize these taxa and samples through their deviances, which are the squared sums of their deviance residuals. This allows us to graphically identify taxa and samples that are poorly represented in the current ordination.

The deviances per sample can be plotted by just supplying the argument “Deviance” to samColour

Samples with the largest scores exhibit the poorest fit. This may indicate that samples with strong departures from independence acquire large scores, but still are not well represented in lower dimensions. Especially the one bottom right may be a problematic case.

The same principle can be applied to the taxa

For the taxa it appears to be the taxa with smaller scores are the more poorly fitted ones. Note that since the count table is not square, we cannot compare sample and taxon deviances. They have not been calculated based on the same number of taxa. Also, one cannot do chi-squared tests based on the deviances since this is not a classical regression model, but an overparametrized one.

Constrained RCM

In this second step we look for the variability in the dataset explained by linear combinations of covariates that maximally separate the niches of the species. This should be done in a second step, and preferably only with variables that are believed to have an impact on the species’ abundances. Here we used the variables age, gender, BMI, country and diagnosis in the gradient. In this analysis all covariates values of a sample are projected onto a single scalar, the environmental score of this sample. The projection vector is called the environmental gradient, the magnitude of its components reveals the importance of each variable. The taxon-wise response functions then describe how the logged mean abundance depends on the environmental score.

Fitting the constrained RCM model

In order to request a constrained RCM fit it suffises to supply the names of the constraining variables to the covariates argument. The shape of the response function can be either “linear”, “quadratic” or “nonparametric” and must be provided to the responseFun argument. Here we illustrate the use of linear and nonparametric response functions. Linear response functions may be too simplistic, they have the advantage of being easy to interpret (and plot). Non-parametric ones (based on splines) are more flexible but are harder to plot.

Plotting the constrained RCM model

Monoplots

As before we can make monoplots of only samples or taxa. Starting with the samples

we clearly see three groups of samples appearing.

One group are the healthy patients, the other two are cancer patients. The cancer patients are separated by country. Note that from Germany there are only cancer patients in this dataset. In case of non-parametric response functions, the samples mononoplot is not interpretable.

Unique to the constrained ordinations are monoplots of the variables. Do note however, that the axes have not been scaled, and trends on the x-and y-axis should be interpreted independently.

Variables far away from the origin have a strong role in shaping the environmental gradient.

The envrionmental gradients are quite different from the case with the linear response functions. Especially age is an important driver of the environmental gradient here. Still the results for cancer diagnosis, country and gender are similar to before.

Biplots

In the constrained case two different biplots are meaningful: sample-taxon biplots and variable-taxon biplots.

Sample-taxon biplot

The interpretation is similar as before: the orthogonal projection of a taxon’s arrow on a sample represents the departure from independence for that taxon in that sample, explained by environmental variables. New is also that the taxa arrows do not start from the origin, but all have their own starting point. This starting point represents the environmental scores for which there is no departure from independence. The direction of the arrow then represents in which direction of the environmental gradient its expected abundance increases. Again we can show this projection visually:

Note that the projection bar does not start from the origin in this case either.

Variable-taxon biplot

The projection of species arrows on environmental variables (starting from the origin) represents the sensitivity of this taxon to changes in this variables. Note that the fact that the arrows of BMI and gender are of similar length indicates that one standard deviation in BMI has a similar effect to gender.

Also this interpretation we can show visually on the graph

We observe that country and diagnosis are the main drivers of the environmental gradient. We also see that the healthy status has a very similar to the small adenoma status, but that they are very different from cancer patients.

Triplot

Triplots combine all the three components in a single plot. This is the default behaviour of the plot function.

Note that the samples and the environmental variables cannot be related to each other, but the sample-taxon and variable-taxon relationships discussed before are still valid on the triplot.

For the non-parametric response functions we can only plot one dimensional triplots, whereby the shape of the response functions of the most strongly reacting taxa is shown. For this we use the plotRespFun function.

The first dimension is shown by default, the environmental scores of the samples are shown below as dots, the y-axis shows the response functions of the 8 most reacting species, whereby the x-axis represents the independence model.

Let us have a look at how Fusobacterium species react to this gradient, as the authors found them to be more abundant in cancer patients .

Assessing the goodness of fit

Also for constrained ordination it can be interesting to use the deviance residuals. We can use them in the unconstrained case by summing over taxa or samples, or plot them versus the environmental gradient to detect lack of fit for the shape of the response function. For this latter goal we provide two procedures: a diagnostic plot of the taxa with the strongest response to the environmental gradient, or an automatic trend detection using the runs test statistic by Ward and Wolfowitz. Checking the linearity (or Gaussian) assumption of the response function is crucial: excessive departure invalidate the interpretation of the ordination plot.

A deviance residual plot for the strongest responders, made with the residualPlot function, specified through the numTaxa argument

The same taxa but with Pearson residuals

Most species do not exhibit any obvious pattern, although departures seem to increase with environmental scores.

The runs test is a test that automatically attempts to detect non randomness in the sequence of positive and negative residuals. We plot the residual plots for the taxa with the largest run statistic (since the sample sizes and null distributions are equal this corresponds with the smallest p-values, but we do not do inference here because of the non-classical framework).

For these 6 taxa we see no signal at all.

Identifying influential observations

It can also be interesting to identify samples have the strongest influence on the environmental gradient. For this aim we can colour the samples by influence, by specifying Influence = TRUE.

Below we see a couple of very influential observations. They may be very young?

Indeed, a few young subjects affect the estimation of the age parameter most. One should always be wary of these kind of influential observations that strongly affect the ordination. The age coefficient is largest in the second dimension actually, let’s look at the influence on that component

Importance of dimensions

There are many quantities that reflect the (relative) importance of the different dimensions.

Importance parameters

The simplest axis labels are simply the importance parameters \(\psi_m\) from the RC(M)-model. They directly reflect the impact on the mean counts in the respective dimension, as the factors to which they are multplied are normalized. Hence the \(\psi_m\)’s are the only components that can grow in size to add more weight to a dimension.

On the other hand, the sample scores have already been scaled with respect to these parameters, and they do not add addtional information to the plot itself. Also, their order is often but not necessarily strictly decreasing. Only the ratios of the different \(\psi_m\)’s are direclty interpretable. Still plotting the \(\psi_m\)’s is the default option of the RCM package.

Log-likelihoods

Another way to quantify the importance of the dimensions is to compare their differences in log-likelihoods of model m (\(ll_m\)) with respect to the saturated model (“deviances”). The log-likelihood saturared model (\(ll_{sat}\)) is calculated using the Poisson density, setting mean and variance equal to the observed counts. These differences in log-likelihood are then normalized with respect to the difference in loglikelihood between the independence model (\(ll_{independence}\)) and the saturated model. The terms obtained for dimension \(m\) is then:

\[\frac{ll_{m}-ll^*}{ll_{sat}-ll_{independence}}\]

, with \(ll^*\) the log-likelihood of the lower dimension, which can be the independence model, the model after filtering on confounders or simply the lower dimension of the RC(M) model.

This approach has the advantages of also providing a measure of importance for the confounders. Also it uses the saturated model as a reference and thus provides a fraction of “total variability”.

Disadvantages are the difficulty in intepreting log-likelihoods, and the fact that in some corner case the log-likelihood drops with higher dimensions. This is because the estimation of the dispersions and environmental gradients is not full maximum likelihood.

Inertia

As with correspondence analysis, the fraction of total inertia explained by the different dimensions can be plotted on the axes. The inertia is defined as the sum of squared Pearson residuals, or

\[\sum_{i=1}^n\sum_{j=1}^p \frac{(x_{ij}-e_{ijm})^2}{e_{ijm}}\]

with \(x_{ij}\) the observed count and \(e_{ijm}\) the expected count under the model with \(m\) dimensions. The inertia has the advantage that also the variance explained by the filtering on confouders step can be plotted, and that there is a measure of residual variance.

On the other hand, as we argue in the manuscript, the inertia is not a good measure of variability for overdispersed data, as it implicitly assumes the mean to equal the variance. Hence this criterion of dimension importance should be interpreted with caution.

Session info

This vignette was generated with following version of R:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phyloseq_1.28.0 RCM_1.0.0      
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.44.0      tseries_0.10-46     VGAM_1.1-1         
##  [4] jsonlite_1.6        splines_3.6.0       foreach_1.4.4      
##  [7] assertthat_0.2.1    TTR_0.23-4          stats4_3.6.0       
## [10] yaml_2.2.0          numDeriv_2016.8-1   pillar_1.3.1       
## [13] lattice_0.20-38     glue_1.3.1          quadprog_1.5-6     
## [16] alabama_2015.3-1    digest_0.6.18       RColorBrewer_1.1-2 
## [19] XVector_0.24.0      colorspace_1.4-1    htmltools_0.3.6    
## [22] Matrix_1.2-17       plyr_1.8.4          pkgconfig_2.0.2    
## [25] zlibbioc_1.30.0     purrr_0.3.2         scales_1.0.0       
## [28] tensor_1.5          tibble_2.1.1        nleqslv_3.3.2      
## [31] mgcv_1.8-28         IRanges_2.18.0      ggplot2_3.1.1      
## [34] BiocGenerics_0.30.0 lazyeval_0.2.2      quantmod_0.4-14    
## [37] survival_2.44-1.1   magrittr_1.5        crayon_1.3.4       
## [40] evaluate_0.13       nlme_3.1-139        MASS_7.3-51.4      
## [43] xts_0.11-2          vegan_2.5-4         tools_3.6.0        
## [46] data.table_1.12.2   formatR_1.6         stringr_1.4.0      
## [49] Rhdf5lib_1.6.0      S4Vectors_0.22.0    munsell_0.5.0      
## [52] cluster_2.0.9       Biostrings_2.52.0   ade4_1.7-13        
## [55] compiler_3.6.0      rlang_0.3.4         rhdf5_2.28.0       
## [58] grid_3.6.0          iterators_1.0.10    biomformat_1.12.0  
## [61] igraph_1.2.4.1      labeling_0.3        rmarkdown_1.12     
## [64] gtable_0.3.0        codetools_0.2-16    multtest_2.40.0    
## [67] curl_3.3            reshape2_1.4.3      R6_2.4.0           
## [70] zoo_1.8-5           knitr_1.22          dplyr_0.8.0.1      
## [73] permute_0.9-5       ape_5.3             stringi_1.4.3      
## [76] parallel_3.6.0      Rcpp_1.0.1          tidyselect_0.2.5   
## [79] xfun_0.6