Susan Holmes, Wolfgang Huber
2022-06-24
RA Fisher | Pioneer of Experimental Design |
---|---|
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. | |
(Fisher 1935) | (Presidential Address to the First Indian Statistical Congress, 1938. Sankhya 4, 14-17). |
Well-characterized cell line growing in laboratory conditions on defined media, temperature and atmosphere.
We administer a precise amount of a drug, and after 72h we measure the activity of a specific pathway reporter.
We recruited 200 patients that have a disease, fulfill inclusion criteria (e.g. age, comorbidities, mental capacity) and ask them to take a drug each day exactly at 6 am. After 3 months, we take an MRI scan and lots of other biomarkers to see whether and how the disease has changed or whether there were any other side effects.
What to do about this?
library("DESeq2")
library("airway")
library("ggplot2")
library("dplyr")
library("gridExtra")
data("airway")
aw = DESeqDataSet(airway, design = ~ cell + dex) %>% estimateSizeFactors
sizeFactors(aw)
samples = c("SRR1039513", "SRR1039517")
myScatterplot = function(x) {
as_tibble(x) %>%
mutate(rs = rowSums(x)) %>%
filter(rs >= 2) %>%
ggplot(aes(x = asinh(SRR1039513),
y = asinh(SRR1039517))) + geom_hex(bins = 50) +
coord_fixed() +
geom_abline(slope = 1, intercept = 0, col = "orange") +
theme(legend.position = "none")
}
grid.arrange(
myScatterplot(counts(aw)),
myScatterplot(counts(aw, normalized = TRUE)),
ncol = 2)
What do we want from a good normalization method:
Possible figure of merit?
William of Ockham
If one can explain a phenomenon without assuming this or that hypothetical entity, there is no ground for assuming it.
One should always opt for an explanation in terms of the fewest possible causes, factors, or variables.
The efficiency of most biochemical or physical processes involving DNA-polymers depends on their sequence content, for instance, occurrences of long homopolymer stretches, palindromes, GC content.
These effects are not universal, but can also depend on factors like concentration, temperature, which enzyme is used, etc.
When looking at RNA-Seq data, should we treat GC content as noise or as bias?
We may think that the outcome of tossing a coin is completely random.
If we meticulously registered the initial conditions of the coin flip and solved the mechanical equations, we could predict which side has a higher probability of coming up: noise becomes bias.
We use probabilistic modelling as a method to bring some order into our ignorance – but let’s keep in mind that these models are a reflection of our subjective ignorance, not an objective property of the world.
Imagine we want to test whether a weight loss drug works. Which of the following designs is better?
A person is weighed on milligram precision scales, with 20 replicates. He/she follows the diet, and four weeks later, (s)he is weighed again, with 20 replicates. The data are recorded by a specifically trained technician.
5 people weigh themselves on their bathroom scales and self-report the number. Four weeks later, they weigh themselves and report again. So we have 5 replicates.
This example shows the difference between technological versus biological replicates.
Some design decisions in HT biology are similar, if more subtle:
For reliable variant calling with current sequencing technology, you need ca. \(30\times\) coverage per genome.
In the 1000 genomes project, the average depth of the data produced was 5.1 for 1,092 individuals. Why was that study design chosen?
In fact, the terminology is too simplistic. Error can creep in at many different levels:
We will use the notion of blocks.
If we know about these nuisance factors and have kept track of them, we can (should) include them explicitely as bias terms in our models.
If we did not keep track, we can try to use latent factor models (SVA, PEER, RUV) to identify them from the data.
Pre-modern measurement systems measured lengths in feet, arms, inches (first joint of an index finger), weights in stones, volumes in multiples of the size of a wine jar, etc.
In the International System of Units: meters, seconds, kilograms, Ampères, … are defined based on universal physical constants. A meter measured by a lab in Australia using one instrument has the same meaning as a meter measured a year later by a lab in Stanford using a different instrument, by a space probe in the Kuiper belt, or ETs on Proxima Cen b.
Measurements in biology are, unfortunately, rarely that comparable.
Often, absolute values are not reported (these would require units), but only fold changes with regard to some local reference.
Even when absolute values exist (e.g., read counts in an RNA-Seq experiment) they usually do not translate into universal units such as molecules per cell or mole per milliliter.
Regular noise can be modelled by simple probability models such as independent normal distributions, Poissons, or mixtures such as Gamma-Poisson or Laplace.
We can use relatively straightforward methods to take noise into account in our data analyses.
The real world is more complicated: measurements can be completely off the scale (a sample swap, a contamination or a software bug), and multiple errors often come together: e.g., a whole microtiter plate went bad, affecting all data measured from it.
Such events are harder to model or even correct for – our best chance to deal with them is data quality assessment, outlier detection and documented removal.
A film director will view daily takes, to correct potential lighting, shooting issues before they affect too much footage. It is a good idea not to wait until all the runs of an experiment have been finished before looking at the data. |
---|
Intermediate data analyses and visualizations will track eventual unexpected sources of variation in the data and enable you to adjust the protocol.
It is important to be aware of sources of variation as they occur and adjust for them.
A pharmacist’s balance weighing analogy (Hotelling (1944) and Mood (1946)). |
---|
Experimental design aims to maximize the available resources. One strategy is to capitalize on cancellations and symmetries. An interesting example is a weighing scheme devised by Hotelling.
Consider a set of eight objects of unknown weights \(m=(m_1,\ldots,m_8)\).
For our simulation study, we create a vector of true \(m\) randomly.
## [1] 0.6 0.9 0.4 0.9 1.5 1.0 1.8 1.6
First, let’s weigh each \(m_i\) individually.
Suppose our scale has errors distributed normally with a SD of 0.1. We simulate the measurements (incl. individual errors) as follows. We also compute the root of the mean of their squares (RMS) as a measure of overall error.
## [1] 8.359209 10.868363 4.298189 16.896361 3.454035 14.967027
## [7] 7.857463 13.452812
## [1] -0.240791027 -0.031636891 -0.101811022 -0.003639091 -0.045965424
## [6] -0.032972940 0.057463277 -0.147188095
## [1] 0.1104119
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 1 1 1 1 1 1 1
## [2,] 1 -1 1 -1 1 -1 1 -1
## [3,] 1 1 -1 -1 1 1 -1 -1
## [4,] 1 -1 -1 1 1 -1 -1 1
## [5,] 1 1 1 1 -1 -1 -1 -1
## [6,] 1 -1 1 -1 -1 1 -1 1
## [7,] 1 1 -1 -1 -1 -1 1 1
## [8,] 1 -1 -1 1 -1 1 1 -1
We use the columns of this matrix to guide our new weighing scheme.
The first column says we should put all 8 pieces on one side of the
balance and weigh them, call this Y[1]
.
The second column says: place 1, 3, 5, 7 on one side of the balance
and on the other and evaluate the difference. Call the result
Y[2]
.
And so on.
Each of the eight weighings has a normal error with sd = 0.1.
Now, because we know the true \(m\), we can compute the errors and their sum of squares:
## [1] 0.052211813 0.035408972 -0.051291112 0.027410736 -0.028415118
## [6] -0.006958278 0.034178628 -0.049007538
## [1] 0.03838579
## [1] 2.876375
We saw that the second method had RMS about 3 times smaller than the first. Were we just lucky?
Repeat Hotelling’s weighing experiment B = 10,000 times with different \(m\) and look at the sampling distributions of the variances of the errors in both schemes
Guess at the relationship between the RMS results.
tcoef8 = t(coef8) / 8
errs = replicate(10000, {
m = sample(seq(2, 16, by = 2), 8) + round(rnorm(8, 1), 1)
X_1to8 = m + rnorm(length(m), mean = 0, sd = 0.1)
err_1to8 = X_1to8 - m
Y = coef8 %*% m + rnorm(length(m), mean = 0, sd = 0.1)
mhat = tcoef8 %*% Y
err_hotel = mhat - m
c(rms(err_1to8), rms(err_hotel))
})
library("ggplot2")
library("tibble")
ggplot(tibble(ratio = errs[1,] / errs[2,]), aes(x = log2(ratio))) +
geom_histogram(bins = 50)
We say that the second scheme is more efficient than the first by a factor of \(\sqrt{8}\) because the RMS errors generated by the measurement are \(\sqrt{8}\) times lower (\(\log_2\sqrt{8}\approx\) 1.5).
Ibn Sina (Avicenna) | Physician Scientist | |
---|---|---|
His Canon of Medicine (1020) lists seven rules of experimental design, including the need for controls and replication, the danger of confounding and the necessity of changing only one factor at a time. |
However, suppose the experiment was done in two “batches”, and we color the data according that:
We cannot conclude because we are in the presence of confounding.
Now suppose the experiment has higher noise levels, and the same number of points as in the previous study. Then, the sample sizes (2 x 6) is not enough. With the same error and a bigger sample size (2 x 60), it looks better.
The experiment at \(n_1=n_2=6\) is not powerful enough.
##
## Welch Two Sample t-test
##
## data: Biomarker B by states
## t = -2.7115, df = 9.9937, p-value = 0.02189
## alternative hypothesis: true difference in means between group healthy and group tumor is not equal to 0
## 95 percent confidence interval:
## -1.1592407 -0.1133796
## sample estimates:
## mean in group healthy mean in group tumor
## 1.549076 2.185387
With the same effect size and a larger sample size, we have the power to see the difference:
##
## Welch Two Sample t-test
##
## data: Biomarker C by states_60
## t = -7.822, df = 116.83, p-value = 2.56e-12
## alternative hypothesis: true difference in means between group healthy and group tumor is not equal to 0
## 95 percent confidence interval:
## -0.5516679 -0.3287520
## sample estimates:
## mean in group healthy mean in group tumor
## 1.493643 1.933853
Depends on:
Each pot in Darwin’s Zea Mays experiment is a
block, only the factor of interest should be different (pollination
method), all other factors should be kept equal within a block. A balanced design is an experimental design where all the different factor combinations have the same number of observation replicates. Such data are particularly easy to analyse because the effect of each factor is identifiable. When there are (likely) nuisance factors, it is good to make sure they are balanced with the factors of interest. Sometimes this is inconvenient or impractical for logistic or economic reasons – but in such cases analysts are on thin ice and need to proceed with caution. |
---|
When comparing various possible designs, we do power simulations.
Here, we suppose the sample size is 15 in each group and the effect size is 0.2. We also need to make assumptions about the standard deviations of the measurements, here we suppose both groups have the same sd=0.25.
n = 15
effect = 0.2
pots = rnorm(15, 0, 1)
noiseh = rnorm(15, 0, 0.25)
noisea = rnorm(15, 0, 0.25)
autoz = pots + noisea
hybrid = pots + effect + noiseh
Perform a simple and a paired \(t\)-test. Which is more powerful in this case?
##
## Welch Two Sample t-test
##
## data: hybrid and autoz
## t = 0.58292, df = 27.908, p-value = 0.5646
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6118805 1.0985542
## sample estimates:
## mean of x mean of y
## 0.2016872 -0.0416497
##
## Paired t-test
##
## data: hybrid and autoz
## t = 3.3113, df = 14, p-value = 0.005145
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.08572565 0.40094807
## sample estimates:
## mean difference
## 0.2433369
Maybe we were just lucky (due to randomness in our simulation) here.
To confirm, run a parametric bootstrap experiment. Generate data as
above B=1000
times and compute the average probability of
rejection, if allowing for a false positive rate \(\alpha=0.05\).
pvals = replicate(1000, {
pots = rnorm(15,0,1)
noiseh = rnorm(15,0,0.25)
noisea = rnorm(15,0,0.25)
autoz = pots + noisea
hybrid = pots + effect + noiseh
sapply(c(FALSE, TRUE), function(prd)
setNames(t.test(hybrid, autoz, paired = prd)$p.value, paste(prd))
)
})
rowMeans(pvals < 0.05)
## FALSE TRUE
## 0.000 0.538
We can plot the p-values obtained using both methods:
powercomparison = function(potsd = 1, noisesd = 0.25, n = 15, effect = 0.2,
B = 1000, pthresh = 0.05) {
pv = replicate(B, {
pots = rnorm(n, 0, potsd)
noiseh = rnorm(n, 0, noisesd)
noisea = rnorm(n, 0, noisesd)
autoz = pots + noisea
hybrid = pots + effect + noiseh
sapply(c(FALSE, TRUE), function(prd)
setNames(t.test(hybrid, autoz, paired = prd)$p.value, paste(prd))
)
})
rowMeans(pv < pthresh)
}
Here are a few examples showing that when the pots sd is smaller than the noise sd, pairing hardly makes a difference. If the pots variability is larger than that of the measurement noise, pairing makes a big difference.
## FALSE TRUE
## 0.037 0.515
## FALSE TRUE
## 0.134 0.175
## FALSE TRUE
## 0.511 0.998
## FALSE TRUE
## 0.472 0.510
For 100 plants of each type and the two SDs being 0.5, the power of the paired test is nearly 80%.
## FALSE TRUE
## 0.512 0.787
Lesson: take into account a natural pairing of the observations – for instance, studies of patients before and after a treatment.
What can be done when pairing is not available?
Try to create pairs of subjects that have as much similarity as possible through matching age, gender (sex), background health etc. One is treated, the other serves as a control.
(George Box, 1978)
Often we don’t know which nuisance factors will be important, or we cannot plan for them ahead of time.
In such cases, randomization is a practical strategy: at least in the limit of large enough sample size, the effect of any nuisance factor should average out.
“The design space is divided into uniform units to account for any variation so that observed differences within units are largely due to true differences between treatments. Treatments are then assigned at random to the subjects in the blocks - once in each block. The defining feature of the Randomized Complete Block Design is that each block sees each treatment exactly once.” (Trudi Grant)
(Latin square: n × n array filled with n different symbols, each occurring exactly once in each row and exactly once in each column)
We often need to remove variation due to unknown factors, or decompose variability according to its different sources; this is classically done using analysis of variance that can accomodatefactors such as subject or batch effects.
Usually these decompositions require at least 3 replicate
measurements in each cell
.
Removal of effects from unknown sources can only be done through the use of negative controls.
Calibration of the effect size in an experiment also requires the use of positive controls; spike-ins (for instance External RNA Control Consortium controls as used in Risso et al. (2014)) where a known quantity or a known expression level aid in these calibrations and are a standard part of many experimental protocols.
Validation in independent data / independent lines of argument
If it is too good to be true, it usually is….
Synonymous mutations in representative yeast genes are mostly strongly non-neutral
We saw how we can use preliminary data and simulations to calculate how many replicates are needed to achieve, say, a 80% true positive rate, if we know the effect size.
Now, recall the discussion of experiments versus studies.
For the cell line experiment, we might get the correct result already from one replicate; usually we’ll do two or three to be sure.
On the other hand, for the study, our intuition tells us that there is so much uncontrolled variability that 20 is likely far too few, and we may need 200 or 2,000 patients for a reliable result. The number depends on the amount of uncontrolled variability, and the effect size. A pragmatic approach is to check out previous successful (or unsuccessful) experiments or studies that did something comparable and use simulations, subsampling or bootstrapping to get an estimate of our proposed new study’s power. Here are more details about how to go about this in practice.
Dependent data require larger samples (intuition: each sample contains less information)
A sample of independent observations is more informative than the same number of dependent observations.
Suppose you want to do an opinion poll by knocking at people’s doors and asking them a question.
In both cases, the number of people polled is \(n\), but if we assume that people living in the same neighborhood are more likely to have the same opinion, the data from Scenario 2 are (positively) correlated. To explore this, let’s do a simulation.
doPoll = function(n = 100, numPeoplePolled = 12) {
opinion = sort(rnorm(n))
i1 = sample(n, numPeoplePolled)
i2 = sample(seq(3, n, by = 3), numPeoplePolled / 3)
i2 = c(i2, i2 - 1, i2 - 2)
c(independent = mean(opinion[i1]), correlated = mean(opinion[i2]))
}
responses = replicate(5000, doPoll())
ggplot(pivot_longer(as_tibble(t(responses)), cols = rownames(responses)),
aes(x = value, col = name)) + geom_density() +
geom_vline(xintercept = 0) + xlab("Opinion poll result")
There are 100 people in the country, of which in the first approach
(i1
) we randomly sample 12. In the second approach, we
sample 4 people as well as two neighbors for each (i2
). The
opinion in our case is a real number, normally distributed in
the population with mean 0 and standard deviation 1.
We model the spatio-sociological structure of our country by sorting
the houses from most negative to most positive opinion in the first line
of the doPoll
function.
You’ve seen examples for data transformations that compress or stretch the space of quantitativemeasurements in such a way that the measurements’ noise variance (not: the signal variance) is more similar throughout – i.e., no longer highly dependent on the mean value.
The mean-variance relationship of a measurement technology can in principle be any function, but in many cases, the following prototyic relationships hold:
constant: the variance is independent of the mean (“additive noise”, \(Y=\mu+\varepsilon\))
Poisson: the variance is proportional to to the mean (“shot noise”, \(N\sim\text{Poi}(\mu)\))
quadratic: the standard deviation is proportional to the mean, therefore the variance grows quadratically (“multiplicative noise”, \(Y=\mu \cdot e^\varepsilon\))
Do you know examples for biological assays (or measurement technologies) whose data show these types of relationships?
Real data can also be affected by a combination of these. For instance:
with DNA microarrays, the fluorescence intensities are subject to a combination of background noise that is largely independent of the signal (additive noise), and multiplicative noise whose standard deviation is proportional to the signal (Rocke and Durbin 2001). Therefore, for bright spots the multiplicative noise dominates, whereas for faint ones, the background.
in RNA-Seq, low counts are dominated by Poisson noise, high counts by the multiplicative noise.
We recall from Chapter 5 that for data with a mean-variance relationship \(v(\mu)\) the variance-stabilizing transformation \(g:\mathbb{R}\to\mathbb{R}\) fulfills the condition \[g'(x)=\frac{1}{\sqrt{v(x)}}\]
These pervade all phases of an analysis, from assembling the raw data over transformation, summarization, model fitting, hypothesis testing or screening for ``hits’’ to interpretation. QA-related questions include:
For the last two sets of questions, heatmaps, principal component plots and other ordination plots are useful.
Henry Ford’s (possibly apocryphal) quote: ``If I had asked people what they wanted, they would have said faster horses.’’ (https://corporate.ford.com/history.html) expresses the view of quality as fitness for purpose, versus adherence to specifications.
It is not easy to nail down (mathematically define) quality, and the word is used with many meanings. See also http://en.wikipedia.org/wiki/Quality_(business)
Longitudinal data have time as a covariate.
handful of time points (e.g. response of a cell line measured 48h, 72h and 84h after exposure to a drug) vs a long and densely sampled time series (e.g. patch clamp data in electrophysiology or movie from life cell microscopy).
in 1st case, think of time as just another experimental factor, like concentration or choice of drug. One analysis strategy could be to first identify the “best”, or biologically most indicative, time point, and then focus on that. Or whether there is any effect at all, regardless of the time.
in 2nd case, time series, we’ll often want to fit dynamical models to the data. We have many choices:
There is some attraction to seemingly “unbiased” approaches that analyse the data at hand without any reference to what is already known. Such tendencies are reinforced by the fact that statistical methods have often been developed to be generic, for instance, working of a general matrix without specific reference to what the rows and column might signify.
Generic approaches are a good way to get started, and for analyses that are highly powered and straightforward, such an an approach might work out. But often, it is wasteful. Recall the example of an RNA-Seq experiment for differential expression. As we saw in the RNA-Seq and Hypothesis Testing lectures, we could perform a hypothesis test for differential expression for each annotated gene, regardless of its read counts, and then run a multiple testing method that treats all tests as exchangeable. But this is inefficient - we can improve our detection power by filtering out, or downweighting, the hypotheses with lower test power, or with lower prior probability of being alternative.
Other examples include:
Penalisation or feature selection in high-dimensional regression or classification. It is easy to use schemes that treat all features the same, for instance, standardize all of them to zero mean and unit variance. But sometime we know that some classes of features are likely to be more or less informative than others.
You can also use graphs to represent “other” data (networks) and use approaches like the group or graph lasso to structure your penalties in high-dimensional modelling.
Unsupervised clustering of your objects of interests (samples, genes, etc.) and subsequent search for over-represented annotations. Instead, just check whether the members of groups defined by the annotations are more similar than two randomly picked objects.
Data analysis pipelines in high-throughput biology often work as funnels that successively summarise and compress the data. In high-throughput sequencing, we may start with individual sequencing reads, then align them to a reference, then only count the aligned reads for each position, summarise positions to genes (or other kinds of regions), then “normalize” these numbers by library size, etc.
At each step, we loose some information, and it is important to make sure we still have enough information for the task at hand. The problem is particularly burning if we use a data pipeline built from a series of separate components without enough care being taken ensuring that all the information necessary for ulterior steps is conserved.
Statisticians have a concept for whether certain summaries enable the reconstruction of all the relevant information in the data: sufficiency. E.g., in a Bernoulli random experiment with a known number of trials, \(n\), the number of successes is a sufficient statistic for estimating the probability of success \(p\).
In a 4-state Markov chain (A,C,G,T) such as the one we saw in Lecture 2, what are the sufficient statistics for the estimation of the transition probabilities?
Iterative approaches akin to what we saw when we used the EM algorithm can sometimes help avoid information loss. For instance, when analyzing mass spectroscopy data, a first run guesses at peaks individually for every sample. After this preliminary spectra-spotting, another iteration allows us to borrow strength from the other samples to spot spectra that may have been labeled as noise.
Analysis projects often begin with a simple script, perhaps to try out a few initial ideas or explore the quality of the pilot data. Then more ideas are added, more data come in, other datasets are integrated, more people become involved. Eventually the paper needs to be written, figures be redone properly, and the analysis be saved for the scientific record and to document its integrity. Here are a few tools that can help with such a process.
Use an integrated development environment. RStudio is a great choice; there are also other platforms such as Emacs or Eclipse.
Use literate programming tools such as Rmarkdown or Jupyter. This is more readable (for yourself and for others) than burying explanations and usage instructions in comments in the source code or in separate README files, in addition you can directly embed figures and tables in these documents. Such documents are often good starting points for the supplementary material of your paper.
Anticipate re-engineering of the data formats and the software. The first version of how you choose to represent the data and structure the analysis workflows will rarely be the best. Don’t be afraid to make a clean cut and redesign them as soon as you notice that you are doing a lot of awkward data manipulations or repetitive steps. This is time well-invested. Sometimes it also helps to unearth bugs.
Reuse existing tools. Don’t reinvent the wheel and rather spend your time on things that are actually new. Before implementing a ‘heuristic’ or a temporary hack that analyses your data in a certain way, spend a couple of minutes researching to see if something like this hasn’t been done before. More often than not, it has, and there is a clean, scalable, and already tested solution.
Use version control, such as git. This takes some time to learn, but in the long run will be infinitely better than all your self-grown attempts at managing evolving code with version numbers, switches and the like. Moreover, this is also the sanest option for collaborative work on code, and it provides an extra backup of your codebase, especially if the server is distinct from your workplace machine.
Use functions rather than copy-pasting stretches of code.
Use the R package system. Soon you’ll note recurring function or variable definitions that you want to share between your individual scripts. It is fine to use the R function to manage them initially, but it is never to early to move them into your own package at the latest when you find yourself starting to write README files or long emails explaining others how to use some script or another. Assembling existing code into an R package is not hard by any means, and offers you many goodies including standardized and convenient ways to provide documentation, to show code usage examples, to test the correct functioning of your code, and to version it. Quite likely you’ll soon appreciate the benefit of using namespaces.
Centralize the location of the raw data files and streamline the derivation of intermediate data. Store the input data at a centralized file server that is professionally backed up. Mark the files as read-only. Have a clear and linear workflow for computing the derived data (e.g., normalised, summarised, transformed etc.) from the raw files, and store these in a separate directory. Anticipate that this workflow will need to be re-run several times, and version it. Use or similar tools to mirror these files on your personal computer.
Integration. When developing downstream analysis ideas that bring together several different data types, you don’t want to do the conversion from data type specific formats to the representations that machine learning or generic statistical methods use each time on an ad hoc basis. Have a recipe script that assembles the different ingredients and cooks them up as an easily consumable matrix, data frame or Bioconductor.
Keep a hyperlinked webpage with an index of all analyses. This is helpful for collaborators (especially if the page and the analysis can be accessed via a web browser) and also a good starting point for the methods part of your paper. Structure it in chronological or logical order, or a combination of both.
Combining all the data so it is ready for analysis or visualisation often involves a lot of shuffling around of the data, until they are in the right shape and format for an analytical algorithm or a graphics routine.
Errors can occur, lost labels, lost information: be safe, redundancy is good.
Recall Hiiragi data (for space reasons we print only the first five rows and columns):
## 1 E3.25 2 E3.25 3 E3.25 4 E3.25 5 E3.25
## 1415670_at 4.910459 7.526672 6.956328 6.424048 5.105808
## 1415671_at 9.768979 9.144228 9.295010 11.059831 9.376749
## 1415672_at 10.411893 10.918942 9.495738 10.317996 11.143684
## 1415673_at 5.618108 6.439416 6.730465 4.914527 5.619778
## 1415674_a_at 7.541891 8.380285 8.480580 7.977363 8.650312
## [1] 45101 101
This dataframe has columns of data, one for each sample (annotated by the column names). Its rows correspond to the probes, annotated by the probe identifiers. This is an example for a data table in wide format.
dfw$probeid = rownames(dfw)
dfl = pivot_longer(dfw,
cols = colnames(x),
names_to = "Embryonic Day") %T>% print
## # A tibble: 4,555,201 × 3
## probeid `Embryonic Day` value
## <chr> <chr> <dbl>
## 1 1415670_at 1 E3.25 4.91
## 2 1415670_at 2 E3.25 7.53
## 3 1415670_at 3 E3.25 6.96
## 4 1415670_at 4 E3.25 6.42
## 5 1415670_at 5 E3.25 5.11
## 6 1415670_at 6 E3.25 5.86
## 7 1415670_at 7 E3.25 5.06
## 8 1415670_at 8 E3.25 4.57
## 9 1415670_at 9 E3.25 8.12
## 10 1415670_at 10 E3.25 5.46
## # … with 4,555,191 more rows
In dfl
, each row corresponds to exactly one measured
value, stored in the column named value
. Then there are
additional columns that store the associated covariates.
Now suppose we want to store somewhere not only the probe identifiers but also the associated gene symbols.
In dfw
, we could stick them as an additional column, and
perhaps also throw in the genes’ ENSEMBL identifier for good measure.
But now we have a problem: the dataframe now has some columns that
represent different samples, and others that refer to information for
all samples (the gene symbol and identifier) and we somehow have to know
this when interpreting the dataframe. This is what Hadley Wickham calls
untidy data.
In contrast, in dfl
, we can add these columns, yet still
know that each row forms exactly one observation, and all information
associated with that observation is in the same row.
In tidy data
each variable forms a column,
each observation forms a row,
each type of observational unit forms a table.
See also https://www.huber.embl.de/msmb/Chap-Design.html#tidy-data-using-it-wisely
Even though there are only 45101 probe-gene symbol relationships, we
are storing them 4555201 times in the rows of dfl
.
We may choose to call these columns probeid
and
geneid
, but the next person might call them
probe_id
and gene_id
, or even something
completely different. When we find a dataframe that was made by someone
else and that contains such-named columns, we may hope, but have no
guarantees, that these are valid gene symbols.
Addressing such issues is behind the object-oriented design of the
specialized data structures in Bioconductor, such as the class
SummarizedExperiment
.
Representation of the matrix may also simply be more natural and mathematically stringent - e.g., if we want to apply PCA / dimension reduction, or classification.
There are two types of variation in an experiment or study: those of interest, those that are unwanted.
We usually cannot rid ourselves of all the unwanted variation, but we saw how we can used balanced randomized designs, data transformations.
Reproducible workflows
Open science: make data and code available
Book chapter
Lecture and chapter presented merely a pragmatic introduction to design, there are many book long treatments of the subject that offer detailed advice on setting up experiments: Wu and Hamada (2011); Box, Hunter, and Hunter (1978); Glass (2007)