Motivation and relevance
Access to existing libraries (e.g., samtools)
Fast or memory efficient implementation of advanced algorithms
(e.g., findOverlaps
).
NA
values.Exercise: sum a numeric vector, in C
Create a file in your home directory ~/c_sum.c
containing the
following lines of code
void c_sum(const double *x, const int *length, double *answer)
{
answer[0] = 0;
for (int i = 0; i < length[0]; ++i)
answer[0] += x[i];
}
We'll pass the arguments x
, length
, and answer
in to C from
R; everything in R is a vector, so all values seen in C are
pointers.
Compile this function to a 'dynamic library' by opening a shell,
navigating to the directory where c_sum.c
is located, and
evaluating the command
R CMD SHLIB c_sum.c
This creates (if there are no syntax or other errors!) a file c_sum.so
.
In R, write a short wrapper function to call c_sum
c_sum <- function(x)
.C("c_sum", as.numeric(x), length(x), answer=numeric(1))$answer
The first argument to .C
is the name of the C function we want to
invoke. The remainder of the arguments are the values we pass to C;
all are vectors in R, hence pointers in C. The type of the R
argument must match the type of the C signature; there is no
type-checking provided (e.g., it would be a mistake to pass the
integer vector x <- 1:10
, without first coercing to numeric). x
and length(x)
MUST BE treated as read-only in C; we've enforced
this by declaring them as const
, e.g., const double *
. We'll
modify the final argument, but to do so we must be confident that
there are no other references to it in R, i.e., that it's NAMED
level is 0.
Arguments to .C
can be named; names are not relevant at the C
level, but are useful in interpretting the return value. The return
value is a list()
of the original arguments, perhaps updated
during the call The paradigm adopted above names the argument that
we wish to use in subsequent calculations, and selects that value
from the returned list.
Use the function after loading the dynamic library
dyn.load("~/c_sum.so")
c_sum(1:100)
Compare the time required for c_sum()
to sum a large vector,
e.g., 100M values, with the time required for base R's sum()
. Any
ideas on the difference? Break c_sum()
in various ways.
(Advanced) Explore C headers provided with R
R.home("include")
## [1] "/Library/Frameworks/R.framework/Resources/include"
The .C
interface is useful but quite restricted, e.g., we had to
pass in the value that we returned, rather than creating it in C.
Add the following line at the begining of c_sum.c
#include <Rinternals.h>
This includes the C header file that defines function prototypes
and other information required to work with R. Explore the header
file (at R.home("include")
) at your leisure.
Add the following function after c_sum
:
SEXP call_sum(SEXP x)
{
SEXP answer = PROTECT(Rf_allocVector(REALSXP, 1));
int len = Rf_length(x);
c_sum(REAL(x), &len, REAL(answer));
UNPROTECT(1);
return answer;
}
This defines a function that takes an SEXP
(S-expression, i.e.,
any R object; an SEXP is a pointer to a C struct
) as an argument,
and returns an SEXP
.
The first line declares a new SEXP
, allocates memory to hold
information for a REALSXP
of length 1 (the equivalent of
numeric(1)
in R), and PROTECT
s the allocation from garbage
collection.
The second line creates an integer variable, and uses an accessor
to reach in to x
to discover its length.
The third line uses accessors to reach in to the various SEXP
to
retrieve the underlying data; these are passed to our previously
defined, pure C function c_sum
.
In the third line we prepare to hand the memory we've allocated
back to R for memory management; we don't need to PROTECT
it any
more.
The final line returns our allocated SEXP to R.
Compile as before
Write a short wrapper. Because we're able to determine the length of the object we've passed in, and because we can allocate and return memory from C, we just need to pass in the vector we're adding.
call_sum <- function(x) .Call("call_sum", as.numeric(x))
Load the DLL and invoke call_sum
as before. It may be necessary
to start an new R session to replace the previously loaded shared
object.
Compare the output and performance of sum()
, c_sum()
, and
call_sum()
.
Move the vector coercion as.numeric()
in R to Rf_coerceVector(x,
REALSXP)
in C. Do you need to PROTECT
the result of the coerion?
Any thoughts on whether it's better to coerce in R, or in C?
.Call()
is flexible, but exposes the details of R's internal
representation and memory management to the programmer. The Rcpp
package hides this complexity behind C++ classes, trading off
specialized understanding of R's internals for more general
understanding of C++.
Start a new file “rcpp_sum.cpp” with the following
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rcpp_sum(NumericVector x)
{
NumericVector ans(1);
for (int i = 0; i < x.size(); ++i)
ans[0] += x[i];
return ans;
}
The file starts by including the headers for Rcpp. using namespace
Rcpp
is a C++ directive to manage where symbols (e.g.,
NumericVector
) will be discovered. // [[Rcpp::export]]
is an
Rcpp attribute that indicates that the following function should be
made available in R (not all functions need to be exposed to the R
user).
NumericVector
is a C++ class representing R's numeric()
type,
so the signature to rcpp_sum()
says that it expects a single
numeric()
vector argument, and will return a numeric()
vector
to R. Rcpp is wired so that type coercion and memory management are
automatic – if the user invokes rcpp_sum()
with an integer
vector (Rcpp type IntegerVector
), it will be coerced to type
NumericVector
. NumericVector
behaves like standard C++
containers, so has a size()
method and array subset operations
[]
that behave in predictable ways.
Rcpp implements NumericVector
and friends on top of R's
interface, but hides from us the need to explicitly allocate and
manage memory.
It's possible to compile this as before, but let's instead do this from within R
library(Rcpp)
sourceCpp("~/rcpp_sum.cpp")
This creates an object in R, rccp_sum()
, that can be invoked
immediately.
rcpp_sum(1:10)
Compare the output and performance of rcpp_sum()
with sum()
and
the other functions we've written. What issues have been addressed
by Rcpp? What problems remain?