CSAMA 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 23 June, 2014
This case study servers as a refresher basic input and manipulation of data.
Input a file that contains ALL (acute lymphoblastic leukemia) patient information
fname <- file.choose() ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)
Check out the help page ?read.delim
for input options, and explore
basic properties of the object you've created, for instance…
class(pdata)
## [1] "data.frame"
colnames(pdata)
## [1] "id" "diagnosis" "sex" "age"
## [5] "BT" "remission" "CR" "date.cr"
## [9] "t.4.11." "t.9.22." "cyto.normal" "citog"
## [13] "mol.biol" "fusion.protein" "mdr" "kinet"
## [17] "ccr" "relapse" "transplant" "f.u"
## [21] "date.last.seen"
dim(pdata)
## [1] 127 21
head(pdata)
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 6 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 6 FALSE complex alt. NEG <NA> NEG hyperd. FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 6 TRUE FALSE REL 12/15/1997
summary(pdata$sex)
## F M NA's
## 42 83 2
summary(pdata$cyto.normal)
## Mode FALSE TRUE NA's
## logical 69 24 34
Remind yourselves about various ways to subset and access columns of a data.frame
pdata[1:5, 3:4]
## sex age
## 1 M 53
## 2 M 19
## 3 F 52
## 4 M 38
## 5 M 57
pdata[1:5, ]
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
head(pdata[, 3:5])
## sex age BT
## 1 M 53 B2
## 2 M 19 B2
## 3 F 52 B4
## 4 M 38 B1
## 5 M 57 B2
## 6 M 17 B1
tail(pdata[, 3:5], 3)
## sex age BT
## 125 M 19 T2
## 126 M 30 T3
## 127 M 29 T2
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 10 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE
## 11 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 10 FALSE del(p15) BCR/ABL p190 NEG <NA> FALSE
## 11 FALSE del(p15/p16) BCR/ABL p190/p210 NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 10 TRUE FALSE REL 7/11/1997
## 11 FALSE TRUE BMT / DEATH IN CR <NA>
It seems from below that there are 17 females over 40 in the data set,
but when sub-setting pdata
to contain just those individuals 19 rows
are selected. Why? What can we do to correct this?
idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
## idx
## FALSE TRUE
## 108 17
dim(pdata[idx,])
## [1] 19 21
Use the mol.biol
column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
The mol.biol
column is a factor, and retains all levels even after
subsetting. How might you drop the unused factor levels?
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
The BT
column is a factor describing B- and T-cell subtypes
levels(bcrabl$BT)
## [1] "B" "B1" "B2" "B3" "B4" "T" "T1" "T2" "T3" "T4"
How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T
table(bcrabl$BT)
##
## B B1 B2 B3 B4 T T1 T2 T3 T4
## 4 9 35 22 9 4 1 15 9 2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
##
## B T
## 79 31
Use xtabs()
(cross-tabulation) to count the number of samples with
B- and T-cell types in each of the BCR/ABL and NEG groups
xtabs(~ BT + mol.biol, bcrabl)
## mol.biol
## BT BCR/ABL NEG
## B 37 42
## T 0 31
Use aggregate()
to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.
aggregate(age ~ mol.biol + sex, bcrabl, mean)
## mol.biol sex age
## 1 BCR/ABL F 39.94
## 2 NEG F 30.42
## 3 BCR/ABL M 40.50
## 4 NEG M 27.21
Use t.test()
to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using boxplot()
. In both cases,
use the formula
interface. Consult the help page ?t.test
and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?
t.test(age ~ mol.biol, bcrabl)
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 4.817, df = 68.53, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.135 17.224
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40.25 28.07
boxplot(age ~ mol.biol, bcrabl)
This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control's Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.
Input the data using read.csv()
, creating a variable brfss
to hold
it. Use file.choose()
to locate the data file BRFSS-subset.csv
fname <- file.choose() ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read.csv(fname)
Explore the data using class()
, dim()
, head()
, summary()
,
etc. Use xtabs()
to summarize the number of males and females in
the study, in each of the two years.
Use aggregate()
to summarize the average weight in each sex and
year.
Create a scatterplot showing the relationship between the square
root of weight and height, using the plot()
function and the
main
argument to annotate the plot. Note the transformed
Y-axis. Experiment with different plotting symbols (try the command
example(points)
to view different points).
plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes")
Color the female and male points differently. To do this, use the
col
argument to plot()
. Provide as a value to that argument a
vector of colors, subset by brfss$Sex
.
Create a subset of the data containing only observations from 2010.
brfss2010 <- brfss[brfss$Year == "2010", ]
Create the figure below (two panels in a single figure). Do this by
using the par()
function with the mfcol
argument before calling
plot()
. You'll need to create two more subsets of data, perhaps
when you are providing the data to the function plot
.
opar <- par(mfcol=c(1, 2))
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Female", ],
main="2010, Female")
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Male", ],
main="2010, Male")
par(mfcol=c(1, 1))
Plotting large numbers of points means that they are often
over-plotted, potentially obscuring important patterns. Experiment
with arguments to plot()
to address over-plotting, e.g.,
pch='.'
or alpha=.4
. Try using the smoothScatter()
function
(the data have to be presented as x
and y
, rather than as a
formula). Try adding the hexbin library to your R session
(using library()
) and creating a hexbinplot()
.
R has a number of additional plotting facilities, both as part of the 'base' distribution and user-contributed packages. The lattice package adopts a particular philosophy about the presentation of data, and can be a very effective visualization tool.
Use library()
to load the lattice package.
library(lattice)
Create the following figure using the xyplot()
function with a
formula and the brfss2010
data. The formula is sqrt(Weight) ~
Height | Sex
, which can be read as square root of Weight as a
function of Height, conditioned on Sex'.
xyplot(sqrt(Weight) ~ Height | Sex, brfss2010)
Add a background grid and a regression line to each panel using the
argument type=c('p', 'g', 'r')
; change the width (lwd
) and
color (col.line
) of the regression line. For some hints on other
arguments, see the help pages ?xyplot
(for overall structure of
the plot) and ?panel.xyplot
(for how each 'panel' of data is
plotted).
Create the following figure. Use the densityplot()
function with
the formula ~ sqrt(Weight)
. The group=Sex
function argument
creates separate lines for each sex. Be sure to use
plot.points=FALSE
to avoid a rug' of points at the base of the
figure. Can you add a key (legend)?
densityplot(~sqrt(Weight), brfss2010, group=Sex, plot.points=FALSE)
Create the figure below using the bwplot
function. The formula
requires that Year
be coerced to a factor
, factor(Year)
.
bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss)
Create the Figure below, a violin plot, using bwplot()
and the
panel
argument set to panel.violin
. from ?bwplot
we learn
that panel
is a function that determines how each panel is
drawn; the details for controling the violin panel plot are
described on the help page ?panel.violin
.
bwplot(sqrt(Weight) ~ factor(Year) | Sex, brfss, panel=panel.violin)
(Advanced) We can write our own panel
argument to lattice
functions to influence how each panel is displayed. Here we add a
point at the median age and weight.
xyplot(sqrt(Weight) ~ Height|Sex, brfss2010,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.points(median(x, na.rm=TRUE), median(y, na.rm=TRUE),
cex=2, pch=20, col="red")
},
type=c("p", "g", "r"), lwd=2, col.line="red", xlim=c(120, 210))
This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.
Sequence analysis is specialized
Additional considerations
Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.
The IRanges package defines an important class for specifying integer ranges, e.g.,
library(IRanges)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, as.vector, cbind, colnames, do.call,
## duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unlist
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
## IRanges of length 3
## start end width
## [1] 10 14 5
## [2] 20 24 5
## [3] 30 34 5
There are many interesting operations to be performed on ranges, e.g,
flank()
identifies adjacent ranges
flank(ir, 3)
## IRanges of length 3
## start end width
## [1] 7 9 3
## [2] 17 19 3
## [3] 27 29 3
Consult the help page for flank, ?flank
, and explore other
range-based operations.
The IRanges
class is part of a class hierarchy. To see this, ask R for
the class of ir
, and for the class definition of the IRanges
class
class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
getClassDef(class(ir))
## Class "IRanges" [package "IRanges"]
##
## Slots:
##
## Name: start width NAMES elementType
## Class: integer integer characterORNULL character
##
## Name: elementMetadata metadata
## Class: DataTableORNULL list
##
## Extends:
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
##
## Known Subclasses: "NormalIRanges"
Notice that IRanges
extends the Ranges
class. Now try entering
?"flank,<tab>
, where <tab>
means to press the tab key to ask for
tab completion (may not be necessary in Rstudio). You can see that
there are help pages for several different classes. Tab-complete to
?"flank,Ranges-method"
and verify that you're at the page that describes the method relevant
to an IRanges
instance.
The GenomicRanges package extends the notion of ranges to include
features relevant to application of ranges in sequence analysis,
particularly the ability to associate a range with a sequence name
(e.g., chromosome) and a strand. Create a GRanges
instance based on
our IRanges
instance, as follows
library(GenomicRanges)
## Loading required package: GenomeInfoDb
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [10, 14] +
## [2] chr1 [20, 24] -
## [3] chr2 [30, 34] +
## ---
## seqlengths:
## chr1 chr2
## NA NA
The notion of flanking sequence has a more nuanced meaning in
biology. In particular we might expect that flanking sequence on the
+
strand would precede the range, but on the minus strand would
follow it. Verify that flank
applied to a GRanges
object has this
behavior.
flank(gr, 3)
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [ 7, 9] +
## [2] chr1 [25, 27] -
## [3] chr2 [27, 29] +
## ---
## seqlengths:
## chr1 chr2
## NA NA
Discover what classes GRanges
extends, find the help page
documenting the behavior of flank
when applied to a GRanges object,
and verify that the help page documents the behavior we just observed.
class(gr)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
getClassDef(class(gr))
## Class "GRanges" [package "GenomicRanges"]
##
## Slots:
##
## Name: seqnames ranges strand elementMetadata
## Class: Rle IRanges Rle DataFrame
##
## Name: seqinfo metadata
## Class: Seqinfo list
##
## Extends:
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3
?"flank,GenomicRanges-method"
Notice that the available flank()
methods have been augmented by the
methods defined in the GenomicRanges package.
It seems like there might be a number of helpful methods available for
working with genomic ranges; we can discover some of these from the
command line, indicating that the methods should be on the current
search()
path
showMethods(class="GRanges", where=search())
Use help()
to list the help pages in the GenomicRanges
package,
and vignettes()
to view and access available vignettes; these are
also available in the Rstudio 'Help' tab.
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.
Basics
A package needs to be installed once, using the instructions on the landing page. Once installed, the package can be loaded into an R session
library(GenomicRanges)
and the help system queried interactively, as outlined above:
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges
Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.
Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.
?consensusMatrix
,
for instance. Also check out the BSgenome package for working
with whole genome sequences, e.g., ?"getSeq,BSgenome-method"
?readGAlignments
help
page and vigentte(package="GenomicAlignments",
"summarizeOverlaps")
import
and export
functions can read in many
common file types, e.g., BED, WIG, GTF, …, in addition to querying
and navigating the UCSC genome browser. Check out the ?import
page
for basic usage.Annotation: Bioconductor provides extensive access to 'annotation' resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday's lab, but some interesting examples to explore during this lab include:
?select
?exonsBy
page to retrieve all
exons grouped by gene or transcript.The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies.
identical()
, all.equal()
NA
values, …system.time()
or the microbenchmark package.Rprof()
function, or packages such as lineprof or aprofVectorize – operate on vectors, rather than explicit loops
x <- 1:10
log(x) ## NOT for (i in seq_along) x[i] <- log(x[i])
## [1] 0.0000 0.6931 1.0986 1.3863 1.6094 1.7918 1.9459 2.0794 2.1972 2.3026
Pre-allocate memory, then fill in the result
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
result[i] <- runif(1) * result[i - 1]
result
## [1] 0.72601 0.64980 0.44125 0.25742 0.09304 0.08861 0.07351 0.07070
## [9] 0.05750 0.02301
Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
for
looplm.fit()
rather than repeatedly fitting
the same design matrix.Re-use existing, tested code
Re-think how to attack the problem
Compile your script with the byte compiler
Use parallel evaluation
Speak in tongues – 'foreign' languages like C, Fortran
Here's an obviously inefficient function:
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
Use system.time
to investigate how this algorithm scales with n
,
focusing on elapsed time.
system.time(f0(10000))
## user system elapsed
## 0.256 0.003 0.258
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
Remember the current 'correct' value, and an approximate time
n <- 10000
system.time(expected <- f0(n))
## user system elapsed
## 0.252 0.003 0.254
head(expected)
## [1] 0.000 1.386 2.197 2.773 3.219 3.584
Revise the function to hoist the common multiplier, a
, out of the
loop. Make sure the result of the 'optimization' and the original
calculation are the same. Use the microbenchmark package to
compare the two versions
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
## expr min lq median uq max neval
## f0(n) 231.6 234.1 235.2 349.7 352.2 5
## f1(n) 229.6 232.3 348.2 348.8 357.9 5
Adopt a 'pre-allocate and fill' strategy
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
## expr min lq median uq max neval
## f0(n) 235.23 236.08 236.31 351.28 363.17 5
## f2(n) 16.68 16.81 17.15 17.91 18.37 5
Use an *apply()
function to avoid having to explicitly pre-allocate,
and make opportunities for vectorization more apparent.
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## f0(n) 234.50 237.66 298.54 359.67 365.47 10
## f2(n) 16.84 17.24 17.32 18.30 21.26 10
## f3(n) 20.37 21.04 22.21 22.55 27.46 10
Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
## expr min lq median uq max neval
## f0(n) 233304.5 237095.1 298124 359998.1 364520.8 10
## f3(n) 20219.3 20356.3 20679 21235.9 22307.4 10
## f4(n) 473.4 474.3 488 488.7 527.9 10
Returning to our explicit iteration f2()
, in these situations it
can be helpful to compile the code to a more efficient
representation. Do this using the compiler package.
library(compiler)
f2c <- cmpfun(f2)
n <- 10000
identical(f2(n), f2c(n))
## [1] TRUE
microbenchmark(f2(n), f2c(n), times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## f2(n) 17.214 17.670 18.173 19.025 22.358 10
## f2c(n) 7.272 7.374 7.533 7.835 9.165 10
f4()
definitely seems to be the winner. How does it scale with n
?
(Repeat several times)
n <- 100000 * seq(1, 20, 2) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, type="b")
Any explanations for the different pattern of response?
Lessons learned:
*apply()
functions help avoid need for explicit pre-allocation
and make opportunities for vectorization more apparent. This may
come at a small performance cost, but is generally worth itIt can be very helpful to reason about an algorithm in an abstract sense, to gain understanding about how an operation might scale. Here's an interesting problem, taken from StackOverflow: Suppose one has a very long sorted vector
vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing
f0 <- function(v) sum(v < 0)
This algorithm compares each element of vec
to zero, creating a
logical vector as long as the original, length(v)
. This logical
vector is then scanned by sum
to count the number of elements
satisfying the condition.
Questions:
v
need to be allocated for this algorithm?v
?
Verify this with a simple figure.N <- seq(1, 11, 2) * 1e6
Time <- sapply(N, function(n) {
v <- sort(rnorm(n))
system.time(f0(v))[[3]]
})
plot(Time ~ N, type="b")
Is there a better algorithm, i.e., an approach that arrives at the same answer but takes less time (and / or space)? The vector is sorted, and we can take advantage of that by doing a binary search. The algorithm is surprisingly simple: create an index to the minimum (first) element, and the maximum (last) element. Check to see if the element half way between is greater than or equal to zero. If so, move the maximum index to that point. Otherwise, make that point the new minimum. Repeat this procedure until the minimum index is no longer less than the maximum index.
f3 <- function(x, threshold=0) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= threshold)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
Approximately half of the possible values are discarded each
iteration, so we expect on average to arrive at the end after about
log2(length(v))
iterations – the algorithm scales with the log of
the length of v
, rather than with the length of v
, and no long
vectors are created. These difference become increasingly important as
the length of v
becomes long.
Questions:
f3()
and f0()
result in
identical()
answers.f3()
scales with vector length.## identity
identical(f0((-2):2), f3((-2):2))
## [1] TRUE
identical(f0(2:4), f3(2:4))
## [1] TRUE
identical(f0(-(4:2)), f3(-(4:2)))
## [1] TRUE
identical(f0(vec), f3(vec))
## [1] TRUE
## scale
N <- 10^(1:7)
Time <- sapply(N, function(n) {
v <- sort(rnorm(n))
system.time(f3(v))[[3]]
})
plot(Time ~ N, type="b")
f0()
and f3()
with the original data, vec
.f3()
. Use
compiler::cmpfun()
to compile f3()
, and compare the result
using microbenchmark.## relative time
library(microbenchmark)
microbenchmark(f0(vec), f3(vec))
## Unit: microseconds
## expr min lq median uq max neval
## f0(vec) 26688.44 27822.29 27924.6 28301.45 32728.0 100
## f3(vec) 52.44 55.19 72.3 79.73 121.5 100
library(compiler)
f3c <- cmpfun(f3)
microbenchmark(f3(vec), f3c(vec))
## Unit: microseconds
## expr min lq median uq max neval
## f3(vec) 53.10 55.28 59.17 60.93 69.20 100
## f3c(vec) 23.11 23.85 24.83 25.71 35.18 100
We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won't do that!
It is useful to ask what is lost by f3()
compared to f0()
. For
instance, does the algorithm work on character vectors? What about
when the vector contains NA
values? How are ties at 0 treated?
findInterval()
is probably a better built-in way to solve the
original problem, and generalizes to additional situations. The idea
is to take the query that we are interested in, 0
, and find the
interval specified by vec
in which it occurs.
f4 <- function(v, query=0)
findInterval(query - .Machine$double.eps, v)
identical(f0(vec), f4(vec))
## [1] TRUE
microbenchmark(f0(vec), f3(vec), f4(vec))
## Unit: microseconds
## expr min lq median uq max neval
## f0(vec) 26911.29 27841.17 27956.62 28343.8 31953.2 100
## f3(vec) 52.59 58.17 70.79 78.8 137.9 100
## f4(vec) 21972.92 22035.44 22062.42 22158.6 23855.8 100
The fact that it is flexible and well tested means that it would often
be preferred to f3()
, even though it is less speedy. For instance,
compare the time it takes to query 10000 different points using f3
and
iteration, versus findInterval
and vectorization.
threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))
## [1] TRUE
microbenchmark(sapply(x, f3), f4(vec, x))
## Unit: microseconds
## expr min lq median uq max neval
## sapply(x, f3) 128.2 138.2 181.7 257.9 291.2 100
## f4(vec, x) 22000.2 22037.3 22091.4 22308.3 23871.0 100
Some R functions that implement efficient algorithms are sort()
(including radix sort), match()
(hash table look-up), and
tabulate()
; these can be useful in your own code.
Lessons learned: