---
title: "Pedigree Analysis and Familial Aggregation"
graphics: yes
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Pedigree Analysis and Familial Aggregation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{FamAgg,BiocStyle}
%\VignettePackage{FamAgg}
%\VignetteKeywords{Genetics}
bibliography: references.bib
csl: biomed-central.csl
references:
- id: dummy
title: no title
author:
- family: noname
given: noname
---
```{r echo = FALSE, results = "hide" }
isWindows <- .Platform$OS.type != "unix"
library(FamAgg)
library(BiocStyle)
```
**Package**: `r BiocStyle::Biocpkg("FamAgg")`
**Authors**: `r packageDescription("FamAgg")$Author`
**Modified**: `r file.info("FamAgg.Rmd")$mtime`
**Compiled**: `r date()`
# Introduction
This package provides basic pedigree analysis and plotting utilities as well as
a variety of methods to evaluate familial clustering of cases from a given
trait. Identification of families or groups of individuals within families with
significant aggregation of cases can aid also in the selection of interesting
and promising individuals for whole genome or exome sequencing projects.
For kinship coefficient calculations and pedigree plotting the package relies
and extends the functionality of the `kinship2` package [@Sinnwell:2014kd].
If you use this package please cite Rainer et al. [@Rainer:2016hk].
# Basic pedigree operations
In the examples below we perform some simple pedigree operations, such as
plotting the pedigree for an individual or family, finding the closest common
ancestor for a set of individuals in a pedigree or retrieving the identifiers
(IDs) of all ancestors for an individual. Basic pedigree information is stored
in `FAData` objects, thus we first generate such an object from a subset of the
Minnesota Breast Cancer Study provided by the `kinship2` package. In the example
below, we generate the `FAData` providing a `data.frame` with the pedigree data,
alternatively, the pedigree information could be imported from a file (see
Section [3](#org99a8d9e)). Upon data set creation the kinship matrix (i.e. a
matrix containing the kinship coefficient between each pair of individuals in
the whole pedigree) is internally calculated using the functionality from the
`kinship2` package [@Sinnwell:2014kd].
```{r libraries, warning = FALSE, message = FALSE }
library(FamAgg)
data(minnbreast)
## Subsetting to only few families of the whole data set.
mbsub <- minnbreast[minnbreast$famid %in% 4:14, ]
mbped <- mbsub[, c("famid", "id", "fatherid", "motherid", "sex")]
## Renaming column names.
colnames(mbped) <- c("family", "id", "father", "mother", "sex")
## Defining the optional argument age.
endage <- mbsub$endage
names(endage) <- mbsub$id
## Create the object.
fad <- FAData(pedigree = mbped, age = endage)
```
We can access all the pedigree information stored in this object using the
`pedigree` method, but also using `$`. The row names of the pedigree
`data.frame` as well as the names of the vectors returned by `$` are the IDs of
the individuals in the pedigree.
```{r access-data }
## Use the pedigree method to access the full pedigree
## data.frame,
head(pedigree(fad))
## or access individual columns using $.
## The ID of the father (0 representing "founders"):
head(fad$father)
## Mother:
head(fad$mother)
## Sex:
head(fad$sex)
## We can also access the age of each individual, if
## provided.
head(age(fad))
```
To extract the pedigree for a single family we can use the `family` method,
specifying either the ID of the family or the ID of an individual in the family.
```{r family }
## Extract the pedigree information from family "4"...
nrow(family(fad, family = 4))
head(family(fad, family = 4))
## ...which is the same as extracting the family pedigree
## for an individual of this family.
head(family(fad, id = 3))
## Note that IDs are internally always converted to character,
## thus, using id=3 and id="3" return the same information.
head(family(fad, id = "3"))
```
Alternatively, we could subset the `FAData` to individuals of a single family.
```{r subsetting }
## Subset the object to a single family.
fam4 <- fad[fad$family == "4", ]
table(fam4$family)
```
To explore this family we can plot its pedigree. By default, the plotting
capabilities of the `kinship2` package are used to plot pedigrees, but
alternatively, if all required dependencies are available, the `HaploPainter`
[@Thiele:2005] perl script () can be
used instead. The `switchPlotfun` function can be used to switch the plotting
back-end. Available arguments are `ks2paint` and `haplopaint` for `kinship2` and
`HaploPainter` plotting, respectively. Note however, that `HaploPainter` only
allows to export plots to a file, while `kinship2` plotting allows, in addition
to export the plot, also to show it as a *standard* `R` plot.
Below we use the `switchPlotfun` to ensure the use of `kinship2` plotting
(usually not required) and plot the full available pedigree of individual `3`.
If the age of individuals is available, it will be plotted below the
individual's ID.
```{r plotPed, message=FALSE, fig.align='center', warning = FALSE }
switchPlotfun("ks2paint")
## By supplying device="plot", we specify that we wish to visualize the
## pedigree in an R plot. This is the default for "ks2paint", anyway.
plotPed(fad, id = 3, device = "plot")
```
The pedigree for an individual or a list of individuals can be extracted using
the `buildPed` method. By default the method first tries to identify all parents
up to 3 generations in the pedigree, and subsequently all children of the
individuals and all identified parents.
```{r buildPed, message = FALSE }
## Build the pedigree for individual 3.
fullPed <- buildPed(fad, id = "3")
nrow(fullPed)
```
Alternatively, we can extract the smallest possible pedigree for a list of
individuals by specifying `prune=TRUE`. Internally, the function transforms the
pedigree into a graph, tries to find all paths between the individuals and
returns the sub-graph of all individuals along with individuals along the paths
between them.
```{r buildPed-prune, message = FALSE }
## Find the subpedigree for individuals 21, 22 and 17.
buildPed(fad, id = c(21, 22, 17), prune = TRUE)
```
And the pedigree plot for that subset of the whole family:
```{r plotPed-3ids, message=FALSE, fig.align='center', warning = FALSE }
plotPed(fad, id = c(21, 22, 17), prune = TRUE)
```
Note that the pedigree returned by the `buildPed` method for an individual might
be different than the pedigree of a whole family. The pedigree returned by
`buildPed` contains only individuals that share kinship with the specified
individual. To exemplify this, we plot the pedigree for the family `14` in the
Minnesota Breast Cancer data set. Note that the individuals in the pedigree plot
depicted as diamonds are individuals with unknown gender. (The message "Did not
plot…" is issued by the `kinship2` plotting function and indicates singletons
that are assigned to the family but do neither have parents nor children.)
```{r plotPed-family-14, message=FALSE, fig.align='center', warning = FALSE }
plotPed(fad, family = "14", cex = 0.4)
```
In this family, founder `441` is the founder of two family branches. Building
the pedigree for individual `440` will not include any of the individuals of the
second branch, as he does not share kinship with any of them. The pedigree built
for `447` on the other hand contains also individuals from the second branch as
she shares kinship with them (*via* her mother `441`).
```{r buildPed-for-individual, message = FALSE }
## Check if we have individual 26064 from the second branch in the pedigree
## of individual 440.
any(buildPed(fad, id = "440")$id == "26064")
## What for the pedigree of 447?
any(buildPed(fad, id = "447")$id == "26064")
```
A family pedigree may consist of many founder couples (i.e. individuals for
which neither father nor mother is defined in the pedigree). To identify the
pedigree's founder couple (being the couple with the largest number of offspring
generations in the pedigree) the `findFounders` method can be used. Note that
the function returns always only one couple, even if there might be two founder
couples in the family pedigree with the same number of offspring generations.
```{r findFounders, message = FALSE }
## Find founders for family 4.
findFounders(fad, "4")
```
Alternatively, it might be of interest to determine the closest common ancestor
between individuals in a pedigree. Below we use the `getCommonAncestor` method
to identify the common ancestor for individuals `21`, `22` and `17` (which we
know from the pedigree a bit above are `1` and `2`).
```{r getCommonAncestors, message = FALSE }
## Find the closest common ancestor.
getCommonAncestor(fad, id = c(21, 22, 17))
```
Other useful methods are `getChildren`, `getAncestors` and `getSiblings`, that
return the children (or all offspring generations up to a specified level), the
parents (or all ancestors) or the siblings for the specified individuals,
respectively.
```{r getChildren-and-others, message = FALSE }
## Get the children of ID 4.
getChildren(fad, id = "4", max.generations = 1)
## Get the offsprings.
getChildren(fad, id = "4")
## Get all ancestors.
getAncestors(fad, id = "4")
## Get the siblings.
getSiblings(fad, id = c("4"))
```
In the whole Minnesota Breast Cancer data set there are 426 families
corresponding to 426 founders that had cancer during the screening phase between
1944 and 1952. In the code block below we identify the affected founders per
family.
```{r affected-founders, message = FALSE }
## Add the trait information to the FAData object.
cancer <- mbsub$cancer
names(cancer) <- as.character(mbsub$id)
trait(fad) <- cancer
## Identify the affected founders.
## First all affected individuals.
affIds <- affectedIndividuals(fad)
## Identify founders for each family.
founders <- lapply(unique(fad$family), function(z){
return(findFounders(fad, family = z))
})
names(founders) <- unique(fad$family)
## Track the affected founder.
affFounders <- lapply(founders, function(z){
return(z[z %in% affIds])
})
## Interestingly, not all founders are affected! It seems in some cases
## parents of the affected participants in the screening phase have also
## been included.
affFounders <- affFounders[unlist(lapply(affFounders, length)) > 0]
## The number of families analyzed.
length(founders)
## The number of families with affected founder.
length(affFounders)
```
Unexpectedly, only in few families one of the founders is affected. For the
other families additional (unaffected) ancestors might have been added at a
later time point.
Next we get the number of affected individuals that are related to these
affected founders.
```{r affected-for-founders, message = FALSE }
kin2affFounders <- shareKinship(fad, unlist(affFounders))
## How many of these are affected?
sum(kin2affFounders %in% affIds)
## How many affected are not related to an affected founder?
sum(!(affIds %in% kin2affFounders))
```
## Pedigree analysis methods
In this section we perform some more advanced pedigree operations. First, we
identify all individuals in the pedigree that share kinship with individual `4`.
```{r shareKinship, message = FALSE }
## Get all individuals sharing kinship with individual 4.
shareKinship(fad, id = "4")
```
It is also possible to prune remote relatives. If we don't want to include grand
children, (first) cousins and everbody else more remotely related to individual
`4`, we use the option `rmKinship`. Essentially, only siblings, children, and
parents remain.
```{r shareKinshipRm, message = FALSE }
## Get all individuals sharing kinship with individual 4, but only with kinship
## higher than 0.125 (exclude first cousins, grand children, great grand
## parents etc, i.e. everybody with kinship 0.125 or lower)
shareKinship(fad, id = "4", rmKinship = 0.125)
```
Next, we determine generations within the pedigree. Generations can only be
estimated for a single family, since in most instances e.g. the year of birth is
not available. Thus, generations are estimated considering the relation between
individuals, starting from the founder couple, i.e. generation 0, assigning
generation 1 to their children and all the mates of their children and so
on. The `estimateGenerations` method calculates such generation numbers for each
family defined in the object (or for a single family, if the family ID is
provided). The result is returned as a list with the list names corresponding to
the family ID and the list elements being the estimated generation numbers (with
names corresponding to the ID of the respective individual).
```{r estimageGenerations, message = FALSE }
## Estimate generation levels for all families.
estimateGenerations(fad)[1:3]
```
Individuals without generation level (i.e. with an `NA`) are not connected to
any other individual in the pedigree (and thus most likely represent errors in
the pedigree).
In addition, it is also possible to calculate generation levels relative to a
(single) specified individual:
```{r generationsFrom, message = FALSE }
gens <- generationsFrom(fad, id = "4")
```
We can render these generation numbers into the pedigree:
```{r plotPed-with-generations, message=FALSE, fig.align='center', warning = FALSE }
plotPed(fad, family = 4, label2 = gens)
```
## Additional plotting options
If a trait information is available it might be of interest to highlight
affected individuals in the pedigree. Trait information should always be coded
as `0` (or `FALSE`) for unaffected and `1` (or `TRUE`) for affected. In the
example below, we use the *cancer* information from the Minnesota Breast Cancer
Study.
```{r set-trait, results='hide', message=FALSE }
## Extract the cancer trait information.
tcancer <- mbsub$cancer
names(tcancer) <- mbsub$id
## Set the trait.
trait(fad) <- tcancer
```
We can now extract the trait information from the object or identify directly
the phenotyped or affected individuals.
```{r affectedIndividuals, message = FALSE }
## Extract the trait information.
head(trait(fad))
## We can also extract the IDs of the affected individuals.
head(affectedIndividuals(fad))
## Or the IDs of the phenotyped individuals.
head(phenotypedIndividuals(fad))
```
Plotting a `FAData` object with trait information results in a pedigree plot
with highlighted affected individuals (for `kinship2` pedigree plotting:
affected, unaffected and not phenotyped are represented as filled symbols, open
symbols and symbols with a question mark inside, respectively).
```{r plotPed-with-trait, message=FALSE, fig.align='center', warning = FALSE }
## Plotting the pedigree for family "9".
plotPed(fad, family = "9")
```
In addition, we can manually highlight individuals using the `highlight.ids`
argument. For `kinship2` pedigree plotting, a list of length 2 is supported as
argument `highlight.ids`, with the first element being plotted on the top left
corner of the symbol and the second element on the top right corner.
```{r plotPed-trait-highlight, message=FALSE, fig.align='center', warning = FALSE }
## Plotting the pedigree for family "9".
plotPed(fad, family = "9", highlight.ids = list(a = c("185", "201", "198"),
b = c("193")))
```
An alternative way to highlight individuals or add text to the plot is to use
the arguments `label1`, `label2` and `label3` or the `plotPed` method.
## Graph utilities
Pedigrees can also be transformed to graphs using the `ped2graph` function. That
way all graph theory methods implemented in e.g. the `igraph` package can be
applied to pedigrees.
```{r ped2graph, message = FALSE }
## Transform the full pedigree to a graph.
fullGraph <- ped2graph(pedigree(fad))
## In addition, build the graph for a single family.
singleFam <- ped2graph(family(fad, family=4))
```
We can plot these pedigrees also as graph and could use any of the layout
methods provided in the `igraph` package.
```{r plot-igraph, fig.align='center', message = FALSE, fig.cap = "Pedigrees represented as graphs." }
## Build the layout.
plot(fullGraph)
lay <- layout_(singleFam, on_grid())
plot(singleFam, layout = lay)
```
The `connectedSubgraph` function implemented in the `FamAgg` package provides
additional functionality to find the smallest connected subgraph of a list of
submitted nodes (i.e. individuals).
In the code below we want to extract the smallest possible connected subgraph of
the pedigree-graph of family 4 containing individuals `7`, `8`, `27` and `17`.
```{r connectedSubgraph, message = FALSE }
subgr <- connectedSubgraph(singleFam, nodes = c("7", "8", "27", "17"))
```
This is in principle what the `buildPed` method with the option `prune=TRUE`
does to find the smallest pedigree for a set of individuals, only that
`buildPed` ensures that also eventually missing parents are added.
```{r plot-subgraph, subgraph-plot, fig.align='center' }
## Plot the graph.
plot(subgr)
## Similar to buildPed/plotPed with prune=TRUE.
plotPed(fad, id=c("7", "8", "17", "27"), prune=TRUE)
```
# Importing and exporting pedigree data
Besides providing the pedigree data as a `data.frame`, the `FAData` constructor
can also read pedigree data from various file formats, such as plink
[@Purcell:2007dg] *ped* or *fam* files
() or generic text files.
```{r import, message=FALSE }
## Import a "ped" file.
pedFile <- system.file("txt/minnbreastsub.ped.gz", package = "FamAgg")
## Quick glance at the file.
readLines(pedFile, n = 1)
fad <- FAData(pedFile)
head(pedigree(fad))
```
Alternatively, we can import pedigree data from generic input files.
```{r import-generic, message=FALSE }
## Create the FAData by reading data from a txt file.
pedFile <- system.file("txt/minnbreastsub.txt", package = "FamAgg")
fad <- FAData(pedigree = pedFile, header = TRUE, id.col = "id",
family.col = "famid", father.col = "fatherid",
mother.col = "motherid")
```
And we can export pedigree data again using the `export` method. In the example
below, we subset the whole pedigree to the pedigree of family 4 and export this
as a *ped* file.
```{r export }
tmpF <- tempfile()
## Subset the pedigree to family 4
fam4 <- fad[fad$family == 4, ]
## Export data in ped format.
export(fam4, tmpF, format = "ped")
```
# Testing for familial aggregation
Familial aggregation aims to identify families within large ancestral pedigrees
that show a non-random aggregation of traits.
As an example, we analyze here data from the Minnesota Breast Cancer Record,
which is provided by the `kinship2` package. In brief, this data set consists of
genealogical information from 426 unrelated founders diagnosed with breast
cancer whose families entered a longitudinal study on cancer in the state of
Minnesota (USA) in 1944. Cancer cases are encoded with a `1` in column `cancer`
in the `minnbreast` `data.frame`. Note however that, besides breast cancer, also
prostate cancer cases are reported. This unfortunately causes a systematic bias
in the data set as families were only included if a founder was diagnosed with
breast cancer, but all occurrences of both breast and prostate cancer are
reported. Based on this bias many of the results below should be taken with
caution. Another important information is provided in column `endage`, which
represents either the age of cancer onset, the age at the end of the study or
the age at death of the participant.
Note that, to reduce computation time, we perform the analysis only on a subset
of families from the Minnesota Breast Cancer record and reduce the number of
simulation runs. We specifically selected some families with a high percentage
of cancer cases, thus, the analysis presented here is biased. Also, in a real
analysis you should increase the `nsim` argument.
```{r famagg-setup, warning=TRUE, message=FALSE }
library(FamAgg)
set.seed(18011977)
data(minnbreast)
## Subset the dataset to reduce processing time.
mbsub <- minnbreast[minnbreast$famid %in% c(4:100, 173, 432), ]
## Uncomment the line below to use the whole dataset instead.
## mbsub <- minnbreast
## Define the number of simulations we perform.
## nsim <- 10000
nsim <- 1000
mbped <- mbsub[, c("famid", "id", "fatherid", "motherid", "sex")]
## Renaming column names.
colnames(mbped) <- c("family", "id", "father", "mother", "sex")
## Create the FAData object.
fad <- FAData(pedigree = mbped)
## Define the trait.
tcancer <- mbsub$cancer
names(tcancer) <- as.character(mbsub$id)
```
In the following section we analyze the data set first using the *genealogical
index* method [@Hill:1980tz] (Section [4.1](#org1dbf9f5)), then we estimate the
per-individual risk of disease using the *familial incidence rate* (FIR, also
abbreviated as *FR* in the original work) [@Kerber:1995cx] (Section
[4.2](#org7913823)) and apply our *kinship sum test* to identify affected individuals
exhibiting a higher relationship to other affected individuals than what would
be expected by chance (Section [4.3](#orgb323644)). Subsequently, we apply our
*kinship group test* (Section [4.4](#orgaac6b02)) that allows to identify highly
clustered affected individuals within families.
The *genealogical index of familiality* and the *familial incidence rate* test are
well established methods while the *kinship sum test* and the *kinship group test*
are novel approaches presented here for the first time.
## *Genealogical index of familiality*
We next calculate the *genealogical index of familiality* (GIF) [@Hill:1980tz]
(referred to as the *genealogical index* in the original work) for cancer
occurrence in a subset of the Minnesota Breast Cancer Record data set. For a
given trait (e.g. whether or not an individual was diagnosed with a certain type
of cancer), the method computes the mean kinship between affected individuals
(cases) in the whole pedigree along with mean kinship values of randomly drawn
sets of individuals. The distribution of average kinship values among the
control sets is then used to estimate the probability that the observed level of
kinship among the cases is due to chance.
Below, we perform the analysis using the `genealogicalIndexTest` method on the
`cancer` trait. In its default setting, the `genealogicalIndexTest` function uses
all phenotyped individuals in the pedigree as control population from which sets
of random samples equal in size to the number of affected are drawn.
Note that by default the function excludes all singletons (i.e. unconnected
individuals in the pedigree) from the analysis. Changing the argument
`rm.singletons` to `FALSE` will estimate the GIF on the full data set.
```{r gif, warning=FALSE, message=FALSE }
## Calculate the genealogical index of familiality.
gi <- genealogicalIndexTest(fad, trait = tcancer,
traitName = "cancer", nsim = nsim)
## Display the result.
result(gi)
```
The column *genealogical index* of the result `data.frame` shown above represents
the mean kinship between all pairs of affected individuals in the pedigree
multiplied by `100000` for easier interpretation. Thus, according to the GIF
test, a clustering of cancer cases is present in the analyzed pedigree. The
output messages from the method call indicate that some individuals have been
excluded from the test since they were either not phenotyped in the trait
(i.e. have a missing value in trait), or are not *connected* in the family
pedigree (do not share kinship with any other individual in the pedigree after
removing non-phenotyped individuals).
The genealogical index of familiality implementation in this package adds some
more flexibility to the original approach. The definition of the appropriate set
of control individuals from which random samples are drawn can be specified with
the `controlSetMethod` argument. Also, it is possible to perform a stratified
sampling, e.g. if the group of affected cases in a pedigree consists of 5 female
and 3 male individuals, submitting the sex of each individual in the pedigree
with the argument `strata` (i.e. `strata=fad$sex`, with `fad` being the `FAData` object
on which the analysis is performed) allows the function to define random control
sets with the same proportion of male/female individuals.
In the next example, we use the `getSexMatched` function to define the set of
control individuals and also the `getExternalMatched` submitting the gender
information of each individual. The results from both approaches are essentially
identical, and in the present data set not that useful, as the Minnesota Breast
Cancer data set lists both, breast cancer and prostate cancer in column `cancer`,
thus, the set of control individuals will contain all individuals with known
sex.
```{r gif-2, warning=FALSE, eval=FALSE }
## Calculate the genealogical index of familiality using random sampling from
## a sex matched control set.
giSexMatch <- genealogicalIndexTest(fad, trait = tcancer,
traitName = "cancer", nsim = nsim,
controlSetMethod = "getSexMatched")
## Use an external vector to perform the matching.
## The results are essentially identical.
giExtMatch <- genealogicalIndexTest(fad, trait = tcancer,
traitName = "cancer", nsim = nsim,
controlSetMethod = "getExternalMatched",
match.using = fad$sex)
```
Note that any matching or stratified sampling can lead to the exclusion of
individuals with missing values in either the matching criteria or the strata.
In the Minnesota Breast Cancer data set, the number of prostate cancer cases is
much lower than the number of breast cancer cases, thus, simple random sampling
might result in an biased genealogical index of familiality estimate since about
the same proportion of male and female individuals will be sampled. To account
for such cases a stratified sampling, as performed below, can be used instead.
```{r gif-3, message=FALSE }
## Evaluate the proportion of male and femal cases.
table(gi$sex[affectedIndividuals(gi)])
## We can use the gender information to perform stratified sampling, i.e.
## in each permutation a random set of 3 male and 15 females will be selected.
giStrata <- genealogicalIndexTest(fad, trait = tcancer,
traitName = "cancer", nsim = nsim,
strata = fad$sex)
result(giStrata)
```
Finally, we plot the result from the simulation. The blue vertical line in the
plot below represents the mean kinship value between all affected individuals in
the pedigree. The distribution of mean kinship values from the 1000 randomly
drawn sets are shown in grey color.
```{r gif-4-plot, mbreast-genealogical-index-result, message=FALSE, warning=FALSE, fig.align='center' }
## Plot the result.
plotRes(giStrata)
```
The genealogical index of familiality can also be estimated by the `gif`
function from the `gap` R-package. Below we calculate the estimate using both
methods and compare the resulting estimate. Note that the `gif` method reports
only the genealogical index of familiality estimate but does not estimate
significance.
```{r gif-gap, message=FALSE }
library(gap)
## Adding the trait information, so the extracted pedigree data.frame will
## also contain a column "affected" with that information.
trait(fad) <- tcancer
## Extract the pedigree and re-format it for the gif function.
pedi <- pedigree(fad)
## Remove singletons.
pedi <- removeSingletons(pedi)
pedi[is.na(pedi$father), "father"] <- 0
pedi[is.na(pedi$mother), "mother"] <- 0
## Identify the affected individuals.
affIds <- as.numeric(pedi$id[which(pedi$affected == 1)])
## Execute the gif method contained in the gap package.
gifRes <- gif(pedi[, c("id", "father", "mother")], affIds)
## Calculate the GIF using FamAgg's genealogicalIndexTest.
gifT <- genealogicalIndexTest(fad, trait = tcancer, nsim = 100)
## Comparing the results:
all.equal( result(gifT)$genealogical_index, gifRes[[1]] )
```
Thus, the GIF estimate from the `gap` package is identical to the one from the
`FamAgg` package.
In the examples above, we tested for an enrichment of cancer cases in the full
data set, i.e. across all families. In addition, we can perform the test
individually for each family, by setting the `perFamilyTest` parameter of the
`genealogicalIndexTest` to `TRUE`, and thus test for a clustering of cancer
cases within each family.
```{r gif-5, message=FALSE, warning=FALSE }
## Perform the analysis (no strata etc) separately for each family.
giFam <- genealogicalIndexTest(fad, trait = tcancer, nsim = nsim,
perFamilyTest = TRUE,
traitName = "Cancer")
## Display the result from the analysis.
head(result(giFam))
```
## *Familial incidence rate* (FIR)
A per-individual risk of e.g. disease can be calculated using the *familial
incidence rate* (FIR, abbreviated as *FR* in the original work)
[@Kerber:1995cx]. This measure considers the kinship of each individual with
any affected in a given trait in the pedigree and the time at risk for each
individual. Thus, the FIR is an estimate for the risk per gene-time for each
individual given the disease-experience in the cohort.
As *time at risk* for each individual we use the `endage` column in the
Minnesota Breast Cancer data set, which represents the participant's age at the
last follow-up or at cancer incidence. This estimate of time at risk is rather
crude and in a real life situation a better, more accurate, estimate that is
based e.g. on the birth dates and dates of last follow up or incidence might be
used instead. See the help of functions `estimateTimeAtRisk` and `sliceAge` for
details and options related to *time at risk*.
```{r fir-1, warning=FALSE, message = FALSE }
## Estimate the risk for each individual using the familial incidence
## rate method. We use the "endage" provided in the Minnesota Breast Cancer
## Record as a measure for time at risk.
fr <- familialIncidenceRate(fad, trait = tcancer, timeAtRisk = mbsub$endage)
```
A note on singletons: for all per-individual measures unconnected individuals
within the pedigree are automatically excluded from the calculations as no
kinship-based statistics can be estimated for them (they do, by definition, not
share kinship with any other individual in the pedigree, thus their kinship
coefficient with any other individual in the pedigree will be `0`). Note also
that the removal of e.g. not phenotyped individuals prior to the calculation can
also *generate* singletons, that additionally become removed. This removal
results in an estimate with the value `NA` for all singletons as well as not
phenotyped individuals.
Next, we calculate the mean FIR within each family and plot this information.
```{r fir-2, mbreast-mean-fr-per-family, message=FALSE, warning=FALSE, fig.align='center' }
## Split the FIR by family and average the values within each.
frFam <- split(fr, f = fad$family)
frFamAvg <- lapply(frFam, mean, na.rm = TRUE)
## Sort and plot the averages.
frFamAvg <- sort(unlist(frFamAvg), decreasing = TRUE)
plot(frFamAvg, type = "h", xaxt = "n", xlab = "", ylab = "mean FIR",
main = "Per family averaged familial incidence rate")
axis(side = 1, las = 2, at = 1:length(frFamAvg), label = names(frFamAvg))
```
Not unexpectedly, individuals in some families have on average a higher familial
incidence rate, and thus a higher risk of cancer than others.
In the next example, we calculate the familial incidence rate assessing in
addition the significance of each estimate using Monte Carlo simulations. This
extension to the original approach from Kerber [@Kerber:1995cx] does also
allow stratified sampling.
```{r fir-3, warning=FALSE, message=FALSE }
## Estimate the risk for each individual using the familial incidence
## rate method. We use the endage provided in the Minnesota Breast Cancer
## Record as a measure for time at risk.
frTest <- familialIncidenceRateTest(fad, trait = tcancer,
traitName = "cancer",
timeAtRisk = mbsub$endage,
nsim = nsim)
```
The familial incidence rate can be extracted easily from the result object using
the `familialIncidenceRate` method or using `$fir`. Also, the empirical p-value
from the simulation analysis and the time at risk can be accessed using the `$`
operator (i.e. using `$pvalue`, `$tar` or `$timeAtRisk`, respectively).
```{r fir-4 }
head(familialIncidenceRate(frTest))
head(frTest$fir)
```
Finally, we inspect the results from the analysis.
```{r fir-5 }
head(result(frTest))
```
We can also identify the families containing individuals with a significant FIR.
```{r fir-6 }
frRes <- result(frTest)
frSig <- frRes[which(frRes$padj < 0.05), ]
## Split by family.
frFam <- split(frSig, frSig$family)
frRes <- data.frame(family = names(frFam),
no_sign_fir = unlist(lapply(frFam, nrow)))
## Determine the number of phenotyped and affected individuals per family.
noPheNAff <- sapply(names(frFam), function(z){
fam <- family(frTest, family = z)
return(c(no_pheno = sum(!is.na(fam$affected)),
no_aff = length(which(fam$affected == 1))
))
})
frRes <- cbind(frRes, t(noPheNAff))
## Display the number of phenotyped and affected individuals as well as
## the number of individuals within the families with a significant FIR.
frRes[order(frRes[, "no_sign_fir"], decreasing = TRUE), ]
```
We have an enrichment of affected cases in families 173, 13 and 432.
## *Kinship sum test*
Next, we use the *kinship sum test* that evaluates familial aggregation based on
the sum of kinship values between affected cases. The test identifies affected
individuals exhibiting a higher relationship to other affected individuals than
would be expected by chance. By specifying the `strata` we perform
sex-stratified random sampling, i.e. ensure that the proportion of male and
female individuals in each randomly sampled group matches the corresponding
proportions in the *real*, observed, affected.
```{r kinsum-1, message = FALSE }
## Perform the kinship sum test.
kinSum <- kinshipSumTest(fad, trait = tcancer, traitName = "cancer",
nsim = nsim, strata = fad$sex)
head(result(kinSum))
```
Next, we identify those individuals that have a significant kinship sum
accepting a 10% false discovery rate (FDR).
```{r kinsum-2, message = FALSE }
## Extract the IDs of the individuals with significant kinship. By default,
## the raw p-values are adjusted for multiple hypothesis testing using the
## method from Benjamini and Hochberg.
kinSumRes <- result(kinSum)
kinSumIds <- as.character(kinSumRes[kinSumRes$padj < 0.1, "affected_id"])
## From which families are these?
table(kinSumRes[kinSumIds, "family"])
```
Thus, most of the identified significant individuals are from two families.
Next, we compare the FIR scores of affected or unaffected (but phenotyped)
individuals in this family to the FIR scores of affected or unaffected
individuals of all other families.
```{r kinsum-3, mbreast-family-432-FIR-compared-to-others, message=FALSE, warning=FALSE, fig.align='center' }
## Get the familial ratio of the significant in this family, of all in
## this family, and of all others.
famId <- kinSumRes[1, "family"]
## Extract the family.
fam <- family(kinSum, family = famId)
## Stratify individuals in affected/unaffected.
strat <- rep("All, unaff.", length(kinSum$id))
strat[which(kinSum$affected > 0)] <- "All, aff."
strat[kinSum$id %in% fam$id] <- paste0("Fam ", famId, ", unaff.")
strat[kinSum$id %in% fam$id[which(fam$affected > 0)]] <-
paste0("Fam ",famId,", aff.")
famData <- data.frame(fr = fr, group = strat)
boxplot(fr~group, data = famData, na.rm = TRUE, ylab = "FIR",
col = rep(c("#FBB4AE", "#B3CDE3"), 2))
```
As expected, the familial incidence rate (i.e., in the present data set, the
risk of individuals to get cancer, given their kinship to other cancer cases)
for individuals (whether affected or yet unaffected) in this family is higher
than in the data set analyzed here.
Next, we plot the pedigree of this family.
```{r kinsum-4, mbreast-family-432-affected, message=FALSE, warning=FALSE, fig.align='center' }
## Plot the pedigree for the family of the selected individual removing
## all individuals that were not phenotypes.
plotPed(kinSum, id = kinSumIds[1], cex = 0.3, only.phenotyped = TRUE)
```
And finally, also plot the kinship sum for the individuals with the largest
kinship sum in relation to the *expected* kinship sums from the Monte Carlo
simulations.
```{r kinsum-5, mbreast-family-432-affecte-res, message=FALSE, warning=FALSE, fig.align='center' }
plotRes(kinSum, id = kinSumIds[1])
```
## *Kinship group test*
Here we apply the *kinship group test* to the data set. This test first defines
for each affected individual a group of individuals considering only individuals
that are as closely related as the most distant affected individual. For each
of these kinship groups two tests are then performed, one by comparing the mean
kinship among affected in the group with the mean kinship from Monte Carlo
simulations (ratio test) and one evaluating the largest observed kinship value
between affected individuals with those of random samples from the simulation
(kinship group test).
In the example below we specify again the `strata` argument and thus perform
sex-stratified random sampling.
```{r kingroup-1, message=FALSE }
## Calculate the kinship test.
kinGroup <- kinshipGroupTest(fad, trait = tcancer,
traitName = "cancer",
nsim = nsim, strata = fad$sex)
head(result(kinGroup))
```
The kinship group test finds a significant aggregation of cases in families 13,
72, 173 and 432. In fact, as we see further below, the test identified a
subgroup in the latter which shows with an exceptional high proportion of cases.
Below, we summarize the results further by listing the total number of families
in the pedigree and the number of families in which kinship groups with
significant kinship p-value and significant ratio p-value (both at a 5% FDR).
```{r kingroup-2, message = FALSE }
kinGroupRes <- result(kinGroup)
## Create a data.frame with the summarized results.
resTab <- data.frame(total_families = length(unique(kinGroup$family)),
ratio_sign = length(unique(
kinGroupRes[kinGroupRes$ratio_padj < 0.05, "family"]
)),
kinship_sign = length(unique(
kinGroupRes[kinGroupRes$kinship_padj < 0.05, "family"]
))
)
resTab
```
The most significant kinship group identified by the kinship group test is shown
in the figure below. The mother (individual `17609`) of the nuclear family
representing this group and all her daughters have cancer (see figure
below). This mother is however not directly related to the affected founder of
this family, individual `17517`, but did marry her son (id `17530`; see figure above
for the full pedigree of this family `432`).
We are also submitting the familial incidence rate values calculated above with
argument `label1` which are then displayed below the ID of each individual in the
plot.
```{r kingroup-3, mbreast-family-432-affecte-res-kinship, message=FALSE, warning=FALSE, fig.align='center' }
plotPed(kinGroup, id = kinGroupRes[kinGroupRes$family == "432",
"group_id"][1],
prune = TRUE, label1 = fr)
```
## Binomial test
The binomial test evaluates whether the number of affected in a family (or the
whole pedigree) is significantly higher than what would be expected by chance
(given a probability of being affected in a trait). In contrast to most other
methods this test does not take the degree of kinship between individuals into
account and is hence independent of the family structure in the pedigree. We can
perform this type of test using the `binomialTest` function on any `FAData` object
or any object extending it. Below we use the binomial test to evaluate a
significant enrichment of affected individuals in any family in the pedigree.
```{r bintest-1, message = FALSE }
binRes <- binomialTest(fad, trait = tcancer, traitName = "Cancer")
binResTab <- result(binRes)
head(binResTab)
```
The probability used on the binomial test is shown in column `"prob"` and is in
essence the ratio between the affected and phenotyped in the pedigree
(i.e. 154/2202). This might be an overestimation, especially if the provided
pedigree is not representative of the population. A population-based probability
can however be provided with argument `prob`. Below we test specifically whether
we have families in which the number of individuals with breast cancer is
significantly higher than expected. To this end we set the trait status of all
male individuals to `NA` and repeat the test providing the probability of
developing breast cancer during in women, which, according to the U.S. Breast
Cancer Statistics (from breastcancer.org) is 1 out of 8 in their life time.
```{r bintest-2, message = FALSE }
## Set the trait status to NA for all male individuals.
tcancer[fad$sex == "M" | is.na(fad$sex)] <- NA
## Perform the test providing also the population probability
binRes <- binomialTest(fad, trait = tcancer, prob = 1/8)
binResTab <- result(binRes)
head(binResTab)
```
Below we plot the pedigree for the family with the strongest enrichment with
affected individuals.
```{r bintest-3, message = FALSE, fig.align = "center", fig.pos = "h!" }
plotPed(binRes, family = 173)
```
# References