Last modified: 31 May, 2019
This tutorial is for all who wish to write R packages. R is a fantastic language for you to develop new statistical approaches for the analysis and comprehension of real-world data. R packages provide a way to capture your new approach in a reproducible, documented unit. An R package is surprisingly easy to create, and creating an R package has many benefits. In this tutorial we create an R package. We start with a data set and a simple script transforming the data in a useful way; perhaps you have your own data set and script? We replace the script with a function, and place the function and data into an R package. We then add documentation, so that our users (and our future selves) understand what the function does and how the function applies to new data sets. With an R package in hand, we can tackle more advance challenges: vignettes for rich narrative description of the package; unit tests to make our package more robust; and version control to document how we change the package. The final step in the development of our package is to share it with others, through github, through CRAN, or though domain-specific channels such as Bioconductor.
The ‘central dogma’ of molecular biology: genes encoded in DNA (chromosomes) are transcribed to mRNA and then translated to protiens.
RNA-sequencing (bulk RNA-seq)
Single-cell RNA-seq
Parameters:
n_genes <- 20000
n_cells <- 100
## gamma-distributed gene means
rate <- .1
shape <- .1
## negative binomial counts
dispersion <- 0.1
A very rough simulation:
set.seed(123)
gene_means <- rgamma(n_genes, shape = shape, rate = rate)
cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5)
cell_means <- outer(gene_means, cell_size_factors, `*`)
counts <- matrix(
rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion),
nrow = n_genes, ncol = n_cells
)
Basic properties of the simulated data
range(counts)
## [1] 0 256
## proportion of zeros
mean(counts == 0)
## [1] 0.793133
## 'library size' -- reads mapped per cell
hist(colSums(counts), main = "Library Size")
## average experssion per gene
hist(rowMeans(log1p(counts)), main = "log Gene Expression")
log_counts <- log(counts)
centered <- log_counts - rowMeans(log_counts)
filtered_median <- function(x)
median(x[is.finite(x)])
size_factors <- exp(apply(centered, 2, filtered_median))
hist(size_factors)
range(size_factors)
## [1] 0.4438464 2.8639456
median(size_factors)
## [1] 1.025781
Collection of files and directories on disk. A complete package might have a structure like that illustrated below.
SCSimulate
DESCRIPTION
NAMESPACE
R/
simulate.R
size_factors.R
man/
simulate.Rd
size_factors.Rd
vignettes/
Using_this_package.Rmd
tests/
testthat.R
testthat/
test_simulate.R
test_size_factor.R
DESCRIPTION
Other packages this package depends on (‘dependencies’)
Depends
: Data structures or work flows required for use of this
package.Imports
: Used inside the current package. For instance, we will
use function like rgamma()
, rnorm()
, and rnbinom()
from the
stats
package.Suggests
: Used in examples or vignettes.NAMESPACE
import()
, importFrom()
export()
R/
man/
vignettes/
tests/
Create a package
devtools::create("SCSimulate")
## ✔ Creating 'SCSimulate/'
## ✔ Setting active project to '/Users/ma38727/b/github/BiocIntro/vignettes/SCSimulate'
## ✔ Creating 'R/'
## ✔ Writing 'DESCRIPTION'
## Package: SCSimulate
## Title: What the Package Does (One Line, Title Case)
## Version: 0.0.0.9000
## Authors@R (parsed):
## * First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
## Description: What the package does (one paragraph).
## License: What license it uses
## Encoding: UTF-8
## LazyData: true
## ✔ Writing 'NAMESPACE'
## ✔ Setting active project to '<no active project>'
Edit the DESCRIPTION file (plain text)
Package: SCSimulate
Title: Simulate single-cell RNA seq data
Version: 0.0.0.9000
Authors@R:
c(person(
given = "Martin",
family = "Morgan",
role = c("aut", "cre"),
email = "Martin.Morgan@RoswellPark.org",
comment = c(ORCID = "YOUR-ORCID-ID")
), person(
given = "Another",
family = "Author",
role = "aut"
))
Description: This package simulates single-cell RNA seq data using a gamma
distribution of gene expression values, and negative binomial model of
counts per cell. The package also contains functions for preprocessing,
including a simple calculation of library scaling factors.
License: Artistic-2.0
Imports: stats
Encoding: UTF-8
LazyData: true
So far, our package looks like
SCSimulate
DESCRIPTION
NAMESPACE
R/
Transform the part of the script describing the simulation to a
function simulate()
. Use function arguments to capture default
values.
simulate <-
function(n_genes = 20000, n_cells = 100,
rate = 0.1, shape = 0.1, dispersion = 0.1)
{
gene_means <- rgamma(n_genes, shape = shape, rate = rate)
cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5)
cell_means <- outer(gene_means, cell_size_factors, `*`)
matrix(
rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion),
nrow = n_genes, ncol = n_cells
)
}
Transform the part of the script describing calculation of size
factors to a function size_factors()
. The only argument to
size_factors()
is a matrix of counts.
.filtered_median <- function(x)
median(x[is.finite(x)])
size_factors <-
function(counts)
{
log_counts <- log(counts)
centered <- log_counts - rowMeans(log_counts)
exp(apply(centered, 2, .filtered_median))
}
Check that we haven’t made any mistakes.
set.seed(123)
counts <- simulate()
size_factors <- size_factors(counts)
range(size_factors)
## [1] 0.4438464 2.8639456
median(size_factors)
## [1] 1.025781
Place functions into files in the R/
directory. Typically, name the
file after the function / group of functions in the file. E.g.,
file: R/simulate.R
simulate <-
function(n_genes = 20000, n_cells = 100,
rate = 0.1, shape = 0.1, dispersion = 0.1)
{
gene_means <- rgamma(n_genes, shape = shape, rate = rate)
cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5)
cell_means <- outer(gene_means, cell_size_factors, `*`)
matrix(
rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion),
nrow = n_genes, ncol = n_cells
)
}
file: R/size_factors.R
.filtered_median <- function(x)
median(x[is.finite(x)])
size_factors <-
function(counts)
{
log_counts <- log(counts)
centered <- log_counts - rowMeans(log_counts)
exp(apply(centered, 2, .filtered_median))
}
Our package now looks like
SCSimulate
DESCRIPTION
NAMESPACE
R/
simulate.R
size_factors.R
Use roxygen2
for documentation by adding tagged lines starting with
#'
immediatly above each function. Common tags are illustrated
below.
@title
is a one-line description appearing at the top of a help page.@description
provides a short description of the function,
presented after the title. Use @details
for more extensive
description appearing after the ‘Usage’ section (generated based on
the signature of the function after the @export
tag) of a help
page.@param
) and return (@return
) values
carefully. The @param
values are used to form the ‘Arguments’
section of the help page. The @return
value appears in the
‘Returns’ section of the help page.@examples
are include in the ‘Examples’ section of the help page,
and must be complete and syntactically correct R code (examples are
evaluated when a package is built and checked).@importFrom
to indicate that a particular package provides
specific functions used in the current package.file: R/simulate.R
#' @title Simulate single-cell data
#'
#' @description `simulate()` produces a genes x cells count matrix of
#' simulated single-cell RNA-seq data. Gene expression is modelled
#' using a gamma distribution. Counts are simulated using a
#' negative binomial distribution.
#'
#' @param n_genes integer(1) the number of genes (rows) to simulate.
#'
#' @param n_cells integer(1) the number of cells (columns) to simulate.
#'
#' @param rate numeric(1) rate parameter of the `rgamma()` distribution.
#'
#' @param shape numeric(1) shape parameter of the `rgamma()` distribution.
#'
#' @param dispersion numeric(1) size (`1 / dispersion`) parameter of
#' the `rnbinom()` distribution.
#'
#' @return `simulate() returns a `n_genes x n_cells` matrix of
#' simulated single-cell RNA-seq counts.
#'
#' @examples
#' counts <- simulate()
#' dim(counts)
#' mean(counts == 0) # fraction of '0' cells
#' range(counts)
#'
#' @importFrom stats rgamma rnorm rnbinom
#'
#' @export
simulate <-
function(n_genes = 20000, n_cells = 100,
rate = 0.1, shape = 0.1, dispersion = 0.1)
{
gene_means <- rgamma(n_genes, shape = shape, rate = rate)
cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5)
cell_means <- outer(gene_means, cell_size_factors, `*`)
matrix(
rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion),
nrow = n_genes, ncol = n_cells
)
}
file: R/size_factors.R
#' @importFrom stats median
.filtered_median <- function(x)
median(x[is.finite(x)])
#' @title Calculate geometric mean-centered median scaled cell scaling
#' factors.
#'
#' @description `size_factors()` centers the log counts of each row
#' (gene) by the row mean of the log counts. The finite centered
#' values are then used to compute column-wise geometric median
#' scaling factors.
#'
#' @param counts matrix() of gene x cell RNA-seq counts.
#'
#' @return `size_factors()` returns a `numeric(ncol(counts))` vector
#' of scaling factors.
#'
#' @examples
#' set.seed(123)
#' counts <- simulate()
#' size_factors <- size_factors(counts)
#' median(size_factors) # approximately 1
#'
#' @export
size_factors <-
function(counts)
{
log_counts <- log(counts)
centered <- log_counts - rowMeans(log_counts)
exp(apply(centered, 2, .filtered_median))
}
devtools::document("SCSimulate")
## Updating SCSimulate documentation
## Writing NAMESPACE
## Loading SCSimulate
## Writing NAMESPACE
Updates the NAMESPACE file
stats::rgamma()
,
stats::rnorm()
, stats::rnbinom()
).simulate()
and size_factors()
, but not
.filtered_median()
).Transforms the documentation introduced above into stand-alone files
man/simulate.Rd
Our package now looks like
SCSimulate
DESCRIPTION
NAMESPACE
R/
simulate.R
size_factors.R
man/
simulate.Rd
size_factors.Rd
The NAMESPACE file has been updated to
cat(readLines("SCSimulate/NAMESPACE"), sep="\n")
## # Generated by roxygen2: do not edit by hand
##
## export(simulate)
## export(size_factors)
## importFrom(stats,median)
## importFrom(stats,rgamma)
## importFrom(stats,rnbinom)
## importFrom(stats,rnorm)
SCSimulate_0.0.0.9000.tar.gz
.devtools::check("SCSimulate")
## Updating SCSimulate documentation
## Writing NAMESPACE
## Loading SCSimulate
## Writing NAMESPACE
## Writing size_factors.Rd
## ── Building ────────────────────────────────────────────────────── SCSimulate ──
## Setting env vars:
## ● CFLAGS : -Wall -pedantic -fdiagnostics-color=always
## ● CXXFLAGS : -Wall -pedantic -fdiagnostics-color=always
## ● CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
## ────────────────────────────────────────────────────────────────────────────────
## ✔ checking for file ‘/Users/ma38727/a/github/BiocIntro/vignettes/SCSimulate/DESCRIPTION’
## ─ preparing ‘SCSimulate’:
## ✔ checking DESCRIPTION meta-information
## ─ checking for LF line-endings in source and make files and shell scripts
## ─ checking for empty or unneeded directories
## ─ building ‘SCSimulate_0.0.0.9000.tar.gz’
##
## ── Checking ────────────────────────────────────────────────────── SCSimulate ──
## Setting env vars:
## ● _R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE
## ● _R_CHECK_CRAN_INCOMING_REMOTE_ : FALSE
## ● _R_CHECK_CRAN_INCOMING_ : FALSE
## ● _R_CHECK_FORCE_SUGGESTS_ : FALSE
## ── R CMD check ─────────────────────────────────────────────────────────────────
## Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
## ─ using log directory '/private/var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T/Rtmp6S4exQ/SCSimulate.Rcheck'
## ─ using R Under development (unstable) (2019-12-01 r77489)
## ─ using platform: x86_64-apple-darwin17.7.0 (64-bit)
## ─ using session charset: UTF-8
## ─ using options '--no-manual --as-cran'
## ✔ checking for file 'SCSimulate/DESCRIPTION'
## ─ this is package 'SCSimulate' version '0.0.0.9000'
## ─ package encoding: UTF-8
## ✔ checking package namespace information
## ✔ checking package dependencies (3.4s)
## ✔ checking if this is a source package
## ✔ checking if there is a namespace
## ✔ checking for executable files
## ✔ checking for hidden files and directories
## ✔ checking for portable file names
## ✔ checking for sufficient/correct file permissions
## ✔ checking serialization versions
## ✔ checking whether package 'SCSimulate' can be installed (1.8s)
## ✔ checking installed package size
## ✔ checking package directory
## ✔ checking for future file timestamps (505ms)
## ✔ checking DESCRIPTION meta-information
## ✔ checking top-level files
## ✔ checking for left-over files
## ✔ checking index information
## ✔ checking package subdirectories
## ✔ checking R files for non-ASCII characters
## ✔ checking R files for syntax errors
## ✔ checking whether the package can be loaded
## ✔ checking whether the package can be loaded with stated dependencies
## ✔ checking whether the package can be unloaded cleanly
## ✔ checking whether the namespace can be loaded with stated dependencies
## ✔ checking whether the namespace can be unloaded cleanly
## ✔ checking loading without being on the library search path
## ✔ checking dependencies in R code
## ✔ checking S3 generic/method consistency (651ms)
## ✔ checking replacement functions
## ✔ checking foreign function calls
## ✔ checking R code for possible problems (2.1s)
## ✔ checking Rd files
## ✔ checking Rd metadata
## ✔ checking Rd line widths
## ✔ checking Rd cross-references
## ✔ checking for missing documentation entries
## ✔ checking for code/documentation mismatches (407ms)
## ✔ checking Rd \usage sections (792ms)
## ✔ checking Rd contents
## ✔ checking for unstated dependencies in examples
## ✔ checking examples (1.7s)
## ✔ checking for non-standard things in the check directory
## ✔ checking for detritus in the temp directory
##
##
## ── R CMD check results ────────────────────────────── SCSimulate 0.0.0.9000 ────
## Duration: 14.9s
##
## 0 errors ✔ | 0 warnings ✔ | 0 notes ✔
devtools::install("SCSimulate")
## ✔ checking for file ‘/Users/ma38727/a/github/BiocIntro/vignettes/SCSimulate/DESCRIPTION’
## ─ preparing ‘SCSimulate’:
## ✔ checking DESCRIPTION meta-information
## ─ checking for LF line-endings in source and make files and shell scripts
## ─ checking for empty or unneeded directories
## ─ building ‘SCSimulate_0.0.0.9000.tar.gz’
##
## Running /Users/ma38727/bin/R-devel/bin/R CMD INSTALL \
## /var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//Rtmp6S4exQ/SCSimulate_0.0.0.9000.tar.gz \
## --install-tests
## * installing to library ‘/Users/ma38727/Library/R/4.0/Bioc/3.11/library’
## * installing *source* package ‘SCSimulate’ ...
## ** using staged installation
## ** R
## ** byte-compile and prepare package for lazy loading
## ** help
## *** installing help indices
## ** building package indices
## ** testing if installed package can be loaded from temporary location
## ** testing if installed package can be loaded from final location
## ** testing if installed package keeps a record of temporary installation path
## * DONE (SCSimulate)
library(SCSimulate)
?simulate
?size_factors
example(size_factors)
##
## sz_fct> set.seed(123)
##
## sz_fct> counts <- simulate()
##
## sz_fct> size_factors <- size_factors(counts)
##
## sz_fct> median(size_factors) # approximately 1
## [1] 1.025781
sessionInfo()
sessionInfo()
## R Under development (unstable) (2019-12-01 r77489)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-devel/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] SCSimulate_0.0.0.9000 BiocStyle_2.15.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 compiler_4.0.0 BiocManager_1.30.10
## [4] prettyunits_1.0.2 remotes_2.1.0 tools_4.0.0
## [7] testthat_2.3.1 digest_0.6.23 pkgbuild_1.0.6
## [10] pkgload_1.0.2 evaluate_0.14 memoise_1.1.0
## [13] rlang_0.4.2 rstudioapi_0.10 cli_2.0.0
## [16] yaml_2.2.0 xopen_1.0.0 xfun_0.11
## [19] xml2_1.2.2 roxygen2_7.0.2 withr_2.1.2
## [22] stringr_1.4.0 knitr_1.26 desc_1.2.0
## [25] fs_1.3.1 devtools_2.2.1 rprojroot_1.3-2
## [28] glue_1.3.1 R6_2.4.1 processx_3.4.1
## [31] fansi_0.4.0 rcmdcheck_1.3.3 rmarkdown_2.0
## [34] bookdown_0.16 sessioninfo_1.1.1 purrr_0.3.3
## [37] whisker_0.4 callr_3.4.0 magrittr_1.5
## [40] codetools_0.2-16 clisymbols_1.2.0 backports_1.1.5
## [43] ps_1.3.0 ellipsis_0.3.0 htmltools_0.4.0
## [46] usethis_1.5.1 assertthat_0.2.1 stringi_1.4.3
## [49] crayon_1.3.4