stringfish

stringfish is a framework for string and sequence operations using the ALTREP system (introduced in R 3.5) as a way to represent R objects using custom memory layout.

This package has two primary goals:

stringfish currently provides two ALTREP backends with the same semantics: sf_vec, a simple vector of string objects, and slice_store, which stores strings within large contiguous blocks of memory. They make different storage tradeoffs, but the same stringfish operations work across both.

For text data, stringfish is intentionally UTF-8-centric outside of explicit byte mode, so conversions, comparisons, and ALTREP views stay consistent across normal R vectors and both backends.

Installation

install.packages("stringfish", type="source", configure.args="--with-simd=AVX2")

Benchmark

The simplest way to show the utility of the ALTREP framework is through a quick benchmark comparing stringfish and base R.

On favorable workloads, some functions in stringfish can be more than an order of magnitude faster than vectorized base R operations, and built-in multithreading can widen that gap further. On large text datasets, this can turn minutes of computation into seconds.

Currently implemented functions

A list of implemented stringfish functions and analogous base R functions:

Utility functions:

In addition, many R operations in base R and other packages are already ALTREP-aware (i.e. they don’t cause materialization). Functions that subset or index into string vectors generally do not materialize.

stringfish functions are not intended to exactly replicate their base R analogues. One difference is that subject parameters are always the first argument, which is easier to use with pipes. E.g., gsub(pattern, replacement, subject) becomes sf_gsub(subject, pattern, replacement).

Extensibility

stringfish as a framework is intended to be easily extensible. Stringfish vectors can be worked into Rcpp scripts or even into other packages. The example below creates an sf_vec-backed output because it is simple and direct, but the same indexing semantics work across both backends.

Below is a detailed Rcpp script that creates a function to alternate upper and lower case of strings.

// [[Rcpp::depends(stringfish)]]
#include <Rcpp.h>
#include "sf_external.h"
using namespace Rcpp;

// [[Rcpp::export]]
SEXP sf_alternate_case(SEXP x) {
  // Iterate through a character vector using the RStringIndexer class
  // If the input vector x is a stringfish character vector it will do so without materialization
  RStringIndexer r(x);
  size_t len = r.size();
  
  // Create an output stringfish vector
  // Like all R objects, it must be protected from garbage collection
  SEXP output = PROTECT(sf_vector_create(len));
  
  // Obtain a reference to the underlying output data
  sf_vec_data & output_data = sf_vec_data_ref(output);
  
  // You can use range based for loop via an iterator class that returns RStringIndexer::rstring_info e
  // rstring info is a struct containing const char * ptr, int len, and an encoding flag
  // ptr should be treated as a byte pointer plus length, not as a null-terminated C string
  // a NA string is represented by a nullptr
  // Alternatively, access the data via the function r.getCharLenCE(i)
  size_t i = 0;
  for(auto e : r) {
    // check if string is NA and go to next if it is
    if(e.ptr == nullptr) {
      i++; // increment output index
      continue;
    }
    // Create a temporary output string and process the results.
    // This example intentionally toggles ASCII letters only.
    std::string temp(e.len, '\0');
    bool case_switch = false;
    for(int j=0; j<e.len; j++) {
      if((e.ptr[j] >= 65) && (e.ptr[j] <= 90)) { // char j is upper case
        if((case_switch = !case_switch)) { // check if we should convert to lower case
          temp[j] = e.ptr[j] + 32;
          continue;
        }
      } else if((e.ptr[j] >= 97) && (e.ptr[j] <= 122)) { // char j is lower case
        if(!(case_switch = !case_switch)) { // check if we should convert to upper case
          temp[j] = e.ptr[j] - 32;
          continue;
        }
      } else if(e.ptr[j] == 32) {
        case_switch = false;
      }
      temp[j] = e.ptr[j];
    }
    
    // Create a new vector element sfstring and insert the processed string into the stringfish vector
    // sfstring has three constructors, 1) taking a std::string and encoding, 
    // 2) a char pointer and encoding, or 3) a CHARSXP object (e.g. sfstring(NA_STRING))
    output_data[i] = sfstring(temp, e.enc);
    i++; // increment output index
  }
  // Finally, call unprotect and return result
  UNPROTECT(1);
  return output;
}

Example function call:

sf_alternate_case("hello world") 
[1] "hElLo wOrLd"