\documentclass{article} \usepackage[utf8]{inputenc} \usepackage{parskip} \usepackage{Sweave} \usepackage{amsmath} \usepackage{hyperref} %% \VignetteIndexEntry{mmod-demo} \begin{document} \title{\textsc{mmod} vignette} \author{David Winter\\ david.winter@gmail.com} \maketitle \tableofcontents \newpage \section{Why use mmod (or what's wrong with $G_\text{ST}$?)} Population geneticists, molecular ecologists and evolutionary biologists often want to be able to determine the degree to which populations are divided into smaller sub-populations. One very widely used approach to this question uses ``F analogues'' (measures based on Wrigtht's $F_\text{ST}$) to compare diversity within and between predefined sub-populations. Until recently, the most widely used of these statistics has been Nei's $G_\text{ST}$. Unfortunately, the value of $G_\text{ST}$ is a at least partially dependent on the number of alleles at each locus and the number of populations sampled. This makes simple interpretations of $G_\text{ST}$ difficult, and comparisons between studies (or even between loci in the same population) potentially misleading. A number of new $F_\text{ST}$ analogues have been developed that compensate for these short comings, and give values that can be compared between studies. \textsc{mmod} is a package that allows three of these statistics, $G''_\text{ST}$, $D_\text{est}$ and $\varphi'_\text{ST}$, to be calculated from \texttt{genind} objects (the standard representation of genetic datasets in the \texttt{adegenet} library) \section{Which statistic should I use?} With the proliferation of $F_\text{ST}$ analogues, it can be hard to decide on the most appropriate measure to use for your study. I encourage you to read Meirmans and Hedrick (2011 doi:\href{http://dx.doi.org/10.1111/j.1755-0998.2010.02927.x}{10.1111/j.1755-0998.2010.02927.x}), which includes a discussion on this topic. As you'll see in the demonstration below, the corrected statistics often tell a similar story. Interestingly, $G''_\text{ST}$ can be directly related to the rate of migration between populations while $D_\text{est}$ and $\varphi'_\text{ST}$ are about partitioning distances or diversity between genes. You may consider which approach is most appropriate for the specific questions you wish to ask. \section{Which statistics can \textsc{mmod} not calculate} There are at least two population genetic statistics related to the ones discussed above that \textsc{mmod} can't calculate. $R_\text{ST}$ was developed for micorsattelite data, and takes the relationship between alleles (and therefore the mutation rate) into account when measuring between-allele distances. It is not clear how the maximum potential value of $R_\text{ST}$ for a given dataset can be calculated, so it is not possible to correct this statistic in a way similar to $G''_\text{ST}$ and $\varphi'_\text{ST}$. Similarly, the calculation of the maximum value of Weir and Cockerham's $\theta$ is complex (and not yet published). If you wish to calculate a corrected version of this statistic you can use RecodeData (\url{http://www.bentleydrummer.nl/software/software/}) to create a dataset in which all between-population differences are maximised. You can then calculate $\theta$ for each dataset using \texttt{Fst} from the package \texttt{pegas}. If the statistic calculated form the recoded data is ${\theta_\text{max}}$ then the corrected statistic is simply $\frac{\theta}{\theta_\text{max}}$. \section{An Example - differentiation in the \textsc{nancycats} data} With the description out of the way, let's see how these functions work in practice. As an example, we are going to examine the \texttt{nancycats} data that comes with \texttt{adegenet}. This dataset contains microsattelite genotypes taken from feral cats in Nancy, France. So let's start. <>= library(mmod) data(nancycats) nancycats @ The nancycats data comes in \texttt{adegenet}'s default class for genotypic data, the \texttt{genind} class. The functions in \texttt{mmod} work on genind objects, so you would usually start by reading in your data using \texttt{read.genpop} or \texttt{read.fstat} depending on the format it's in. Now that we have our data on hand, our goal is to see \begin{itemize} \item Whether this population is substantially differentiated into smaller sub-populations \item Whether such differentiation can be explained by the geographical distance between sub-populations. \end{itemize} We can look at several statistics to ask answer the first question by using the \verb+diff_stats()+ function: <>= diff_stats(nancycats) @ OK, so what is that telling us? The first table has statistics calculated individually for each locus in the dataset. \texttt{Hs} and \texttt{Ht} are estimates of the heterozygosity expected for this population with and without the sub-populations defined in the \texttt{nancycats} data respectively. We need to use those to calculate the measures of population divergence so we might as well display them at the same time. \texttt{Gst} is the standard (Nei) $G_\text{ST}$, \verb+Gprime_st+ is Hedrick's $G''_\text{ST}$ and \texttt{D} is Jost's $D_\text{est}$. Because all of these statistics are estimated from estimators of $H_\text{S}$ and $H_\text{T}$, it's possible to get negative values for each of these differentiation measures. Populations can't be negatively differentiated, so you should think of these as estimates of a number close to zero (it's up to you and your reviewers to decide if you report the negative numbers of just zeros). $D_\text{est}$ is the easiest statistic to interpret, as you expect to find $D=0$ for populations with no differentiation and $D=1$ for completely differentiated populations. As you can see, different loci give quite different estimates of divergence but they range from $\sim$0.1--0.4. \textsc{mmod} can calculate another statistic of differentiation called $\varphi'_\text{ST}$. This statistic is based on the Analysis of Molecular Variance (AMOVA) method, which partitions the variance in genetic distances in a dataset to among-population and within-population components (it is possible to use this framework to partition variance using more than two levels of population structure, but that has not been implemented in \textsc{mmod} yet). Because $\varphi'_\text{ST}$ can take some time to calculate it's not included in \verb+diff_stat+ by default (but you can include it using \verb+diff_stat(x, phi_st=TRUE)+). You might want to see how all these different measures compare to each other across the loci we've looked at. You can see the corrected measures (all those other than $G_\text{ST}$) show a similar pattern, and $G_\text{ST}$ is a bit strange (Figure~\ref{Comparison}): <>= nc.diff_stats <- diff_stats(nancycats, phi_st=TRUE) with(nc.diff_stats, pairs(per.locus[,3:6], upper.panel=panel.smooth)) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Comparison of differentiation measures} \label{Comparison} \end{figure} The second part of the list returned by \verb+diff_stat+ contains global estimates of each of these statistics. For $G_\text{ST}$ and $G''_\text{ST}$ these are based on the average of \texttt{Hs} and \texttt{Ht} across loci. For $D_\text{est}$ you get two, the harmonic mean of the $D_\text{est}$ for each locus and, because that method won't work if you end up with negative estimates of $D_\text{est}$, one calculated as per $G_\text{ST}$ and $G''_\text{ST}$. The global estimate of $\varphi'_\text{ST}$ is calculated from the average distance among individuals across all loci. Now that we have a point-estimate for how differentiated these populations are we will want to have some idea of how robust this result is. \textsc{mmod} has a few functions for performing bootstrap samples of \texttt{genind} objects and calculating statistics from those samples. Because some of these functions can take a long time to run, we will create a very small (10 repetition) bootstrap sample of the \texttt{nancycats} data, then calculate $D_\text{est}$ from that sample: <>= bs <- chao_bootstrap(nancycats, nreps=10) bs.D <- summarise_bootstrap(bs, D_Jost) bs.D @ As you can see, printing a summarised bootstrap sample gives us shows a basic overview of that data. In this case the confidence intervals are calculated using the ``normal method'', which it to say the the intervals are the observed value statistic +/- 1.96 x the standard error of the boostrap sample.There is more to these objects than gets printed --- use \texttt{str(bs.D)} to check it out. I don't think there is much point trying to interpret confidence intervals estimated from 10 samples, but the point estimates seem to show a population with some substantial differentiation. Next, we want to know if geography can explain that differentiation. The \texttt{nancycats} data comes with coordinates for each population. We can use these to get Euclidean distances: <>= head(nancycats@other$xy, 4) nc.pop_dists <- dist(nancycats@other$xy, method="euclidean") @ \texttt{mmod} provides functions to calculate pairwise versions of each of the differentiation statistics. Because we want to perform a Mantel test, we'll use the ``linearized'' version of $D_\text{est}$, which is just $x/(1-x)$ (each of the pairwise stats has and argument to return this version). <>= nc.pw_D <- pairwise_D(nancycats, linearized=TRUE) @ The library \texttt{ade4}, which is loaded with \texttt{mmod}, provides functions to perform Mantel tests on distance matrices. <>= mantel.rtest(nc.pw_D, log(nc.pop_dists), 999) @ So, the geographic distance between these populations can't explain the genetic divergences we see: the correlation is small and non-significant. If you like, we can also visualize this relationship (Figure~\ref{Distances}). <>= fit <- lm(as.vector(nc.pw_D) ~ as.vector(nc.pop_dists)) plot(as.vector(nc.pop_dists), as.vector(nc.pw_D), ylab="pairwise D", xlab="physical distance") abline(fit) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Geographic distance does not explain genetic differentiation} \label{Distances} \end{figure} There are a couple of other functions that are not used here, and a few of use the functions we have used have help messages that guide interpretation of their ressults - use \texttt{help(package="mmod")} to see the full documentation. \end{document}