---
title: "Intermediate rtables - Identifying Required Faceting Behavior"
subtitle: Contributed by Johnson & Johnson Innovative Medicine
date: "2025-10-22"
author:
- Gabriel Becker
- Dan Hofstaedter
output:
    rmarkdown::html_document:
        theme: "spacelab"
        highlight: "kate"
        toc: true
        toc_float: true
        code_folding: show
vignette: >
  %\VignetteIndexEntry{Intermediate rtables - Identifying Required Faceting Behavior}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  markdown:
      wrap: 72
  chunk_output_type: console
---

```{r, include = FALSE}
suggested_dependent_pkgs <- c("dplyr")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = all(vapply(
    suggested_dependent_pkgs,
    requireNamespace,
    logical(1),
    quietly = TRUE
  ))
)
```

# Introduction

```{r init, echo = FALSE, results = "hidden"}
suppressPackageStartupMessages(library(rtables))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble))

## XXX put this somewhere else so everyone can share it
fixed_shell <- function(tt) {
  mystr <- table_shell_str(tt)
  regex_hits <- gregexpr("[(]N=[[:digit:]]+[)]", mystr)[[1]]
  hit_lens <- attr(regex_hits, "match.length")
  if (regex_hits[1] > 0) {
    for (i in seq_along(regex_hits)) {
      start <- regex_hits[i]
      len <- hit_lens[i]
      substr(mystr, start, start + len - 1) <- padstr("(N=xx)", len, just = "center")
    }
  }
  cat(mystr)
}


knitr::opts_chunk$set(comment = "")
```

`rtables` supports *generalized* faceting when declaring row and
column structure. In particular it, allows faceting behavior to
deviate from that seen in e.g., `ggplot2` faceting support in four
crucial ways often required for tables:

1. Facets need not be mutually exclusive,
2. Facets need not be exhaustive,
3. Nested faceting behavior can depend on the parent facet it occurs
   within, and
4. Facets can be created that do not reflect a single categorical
   value in the data.

While this flexibility provides a cornerstone to `rtables`' power -
alongside the flexibility of analysis functions discussed in the
previous chapter - it also means we must actively think about faceting
when creating table layouts in a way simply not required of users of
`facet_grid` in `ggplot2`.

In this chapter we will cover identifying which aspects of a shell or
desired table should be achieved by specifying the correct split
function(s) in the layout. As with the previous chapter's handling of
analysis behavior, we will leave *implementation* of fully custom
split functions for the advanced portion of this guide and focus
solely on the identification of required behavior to prepare users to
choose between a selection of pre-existing non-default split functions
available to them.

# A Brief Review

Faceting serves three purposes within the `rtables` layouting
framework. It declares

1. The row- and column-labeling when the table is rendered,
2. The organization of the sets of cells that will make up the table's
   body, and
3. The data to be analyzed when calculating contents for each set of
   cells in the table.
   
In particular, (3) means that the data passed to analysis functions
is the intersection of the data associated with the row- and
column-facets that define the location of the cell(s) whose contents
are being calculated.

`rtables` is designed such that data should not need to be duplicated,
nor .e.g, levels of a factor, restricted in the dataset prior to
calling `build_table`. Things like adding combination levels and
restricting or reordering factor levels are all declared via faceting
in the layout and then performed automatically by the internal
`rtables` machinery during table creation.


## Split Function Basics

We will leave a detailed technical discussion of how split functions
work for when we implement our own custom split functions in the
advanced portion of this guide. For our purposes here, it suffices to
consider a split function to be a mapping from an incoming dataset
(the data associated with the parent facet) to a set of one or more
facets, each of which are associated with (sub)sets of that incoming
data.

## Default Faceting

By default, faceting instructions:

1. Declare facets based on a *partition* of incoming data defined by a
   categorical variable, and
2. Nest within previously declared instructions in the same dimension
   (row/column).

The above behaviors combine to mean that sequential faceting
instructions (i.e., repeated calls to `split_cols_by` or
`split_rows_by`) result in *full factorial faceting*, where each
combination of levels from the variables faceted on is represented.

This is true with column faceting:

```{r}
lyt <- basic_table() |>
  split_cols_by("ARM") |>
  split_cols_by("SEX")

build_table(lyt, ex_adsl)
```

as well as with row faceting, with the caveat that row faceting does
not generate individual rows, and thus an analyze call is required:

```{r}
lyt2 <- basic_table() |>
  split_rows_by("STRATA1") |>
  split_rows_by("BMRKR2") |>
  analyze("AGE")

build_table(lyt2, ex_adsl)
```

# Recognizing Non-Full-Factorial Faceting

Any time we need faceting that does not represent a full factorial
combination of one or more variables (i.e., the full set of
combinations of levels from those variables), we will need to use
split functions to declare our desired structure.

The key, then, is to carefully consider how our desired faceting
structure deviates from the full factorial structure that default
faceting would generate. This will tell us what behaviors we need from
our split functions.

## Excluding Factor Levels

The simplest deviation from full-factorial faceting is to omit some
levels when faceting based on a single categorical variable. This can
come in two flavors:

1. Prescriptive - when the level(s) to be omitted are set a priori,
2. Empirical - when the level(s) to be omitted depend on the data.

Prescriptively omitting levels(/facets) is fairly straightforward: you
have a set of levels that, for whatever reason, you do not want facets
for in the resulting table. `rtables` provides the
`remove_split_levels` to create split functions which achieve this.

Empirically omitting levels(/facets) is more open ended, as
technically the logic determining what should be omitted can be
completely arbitrary. The most common version, however, is to omit
unobserved levels (which would result in facets whose associated data
subset is empty); the `drop_split_levels` split function does this.

We will use a slightly modified version of our synthetic data to
illustrate the difference:

```{r}
adsl <- subset(ex_adsl, as.character(SEX) %in% c("F", "M", "U"))
qtable(adsl, col_vars = "SEX")
```

First we declare faceting that omits the (rare but observed) `"U"`
level using `remove_split_levels`.

```{r}
lyt_pre <- basic_table() |>
  split_cols_by("SEX", split_fun = remove_split_levels("U")) |>
  analyze("STRATA1")

build_table(lyt_pre, adsl)
```

Next we will use `drop_split_levels`:

```{r}
lyt_emp <- basic_table() |>
  split_cols_by("SEX", split_fun = drop_split_levels) |>
  analyze("STRATA1")

build_table(lyt_emp, adsl)
```

Here we get exactly -- and only -- facets for the levels of `SEX`
observed in the data.

It is important to note that `drop_split_levels` omits facets for
levels not observed ***in the incoming data*** which is the data for
the parent facet. This only translates to the full data being
tabulated in cases of top level faceting (not nested within anything)
and other special cases.

We can see this if we nest faceting using the empirical
`drop_split_levels` within another faceting instruction:


```{r}
lyt_bad_emp <- basic_table() |>
  split_cols_by("ARM") |>
  split_rows_by("RACE", split_fun = drop_split_levels) |>
  split_rows_by("SEX", split_fun = drop_split_levels) |>
  analyze("AGE")

build_table(lyt_bad_emp, adsl)
```

Here we see that different sets of `SEX` facets are generated within
different `RACE` facets, with the `"MULTIPLE"` and `"NATIVE HAWAIIAN
OR OTHER PACIFIC ISLANDER"` races each having only a (different)
single facet. This is sometimes the desired behavior, but often it is
not so care should be used with `drop_split_levels` in non-trivial
faceting structures.

## Adding Combination Levels

Some shells call for levels to be combined into new virtual
levels. For example, we might need an "All Drug X" category in our
table which represents both arms A (`"A: Drug X") and C (`"C:
Combination"`) as a single group of patients, either in addition to or
instead of those individual arms.

As with omitting defined factor levels, this is a deviation from the
default full factorial behavior. In this case we want a facet for a
level not present in the data and (assuming the individual arms are
left in alongside our combination arm) our desired facets are not
mutually exclusive.

`rtables` provides the `add_combo_levels` split function to directly
invoke this behavior. It takes a "combination data.frame" that
declares the combination levels to add.

```{r}
combodf <- tribble(
  ~valname, ~label, ~levelcombo, ~exargs,
  "A_C", "Arms A+C", c("A: Drug X", "C: Combination"), list()
)

lyt_combo1 <- basic_table() |>
  split_cols_by("ARM", split_fun = add_combo_levels(combodf), show_colcounts = TRUE)

build_table(lyt_combo1, ex_adsl)
```

## Nested Faceting On Non-Independent Variables

Often times when performing nested faceting, the inner variable
represents the same information as the outer variable in more
detail. Another way to view this is that the information represented
by the outer variable is implicitly included (or embedded) within the
information for the inner variable. When this occurs, most
combinations of levels from the pair of variables are not logically
consistent, can never occur in practice, and most importantly, should
not be represented in our resulting table. Whenever this is the case,
we cannot rely on the default splitting behavior.

An ubiquitous example of this in clinical trials are the System Organ
Class (`AESOC`) and Preferred Term (`AEDECOD`) variables used when
describing adverse events. `AESOC` represents the broad category an
adverse events falls within (e.g., "SKELETOMUSCULAR" or
"GASTROINTESTINAL") while `AEDECOD` represents the specific type of
adverse-event ("BACK PAIN", "VOMITING"). In this example, the
combination of `AESOC` being `"SKELETOMUSCULAR"` while `AEDECOD` is
`"VOMITING"`. In our alternate framing we would say that the `AEDECOD`
value `"VOMITING"` implies that `AESOC` *must* be `"SKELETOMUSCULAR"`.

Note that our synthetic data does not contain realistic values for
`AESOC` and `AEDECOD`, but rather values of the form `"cl X`" (with X
a capital letter) and `"dcd X.m.n.o.p"` with m-p individual digits,
respectively. Note this makes the information embedding even more
explicit, as the X is the same between values of `AESOC` and the
values of `AEDECOD` they apply to.

As with omitting facets within a single faceting instruction, there
are broadly two ways to approach this type of nested faceting:

1. Prescriptively, and
2. Empirically.

In both cases, we can think about this in terms of *pairs of levels we
want to represent in our table*. The goal here is to preemptively
omit pairs which are not logically consistent (and thus which we can
assume have no observations in the data).

The empirical approach assumes that either:

- All valid pairs of levels have at least one observation, or
- we want to display *only* observed pairs, omitting any valid
unobserved pairs.

To this end, `rtables` provides the `trim_levels_in_group` split
function factory, which, for each observed level in variable being
split, levels of a declared `inner_var` are restricted to those
observed *in combination to that level of the split variable*. When we
then split on or analyze the inner variable, we get a table that contains only
the observed pairs:

```{r}
lyt_tig <- basic_table() |>
  split_rows_by("AESOC", split_fun = trim_levels_in_group("AEDECOD")) |>
  analyze("AEDECOD")

build_table(lyt_tig, ex_adae)
```


`trim_levels_in_group` can be used in chains to further restrict the
displayed combinations of more than two variables, if desired:

```{r}
lyt_tig2 <- basic_table(title = "Observed Toxicity Grades") |>
  split_rows_by("AESOC", split_fun = trim_levels_in_group("AEDECOD")) |>
  split_rows_by("AEDECOD", split_fun = trim_levels_in_group("AETOXGR")) |>
  analyze("AETOXGR")

build_table(lyt_tig2, ex_adae)
```

Sometimes the above is the desired behavior; many times, however,
there are certain counts or values which are important to display
*even when they are not observed*. In such cases, we still want to
omit pairs of levels that are impossible/logically inconsistent, but
cannot rely on which combinations are observed in the data.

In such cases, we must *prescriptively* declare which combinations we
want to appear in our table. `rtables` provides the
`trim_levels_to_map` split function factory for this, which accepts a
pre-defined map of all combinations which should be included (in the
form of a data.frame). Any combinations which do not appear in the map
will be omitted *even if they are observed in the data*.

```{r}
map <- tribble(
  ~AESOC, ~AEDECOD,
  "cl A", "dcd A.1.1.1.2",
  "cl B", "dcd B.1.1.1.1",
  "cl B", "dcd B.2.2.3.1",
  "cl D", "dcd D.1.1.1.1"
)

lyt_ttm <- basic_table() |>
  split_rows_by("AESOC", split_fun = trim_levels_to_map(map)) |>
  analyze("AEDECOD")

build_table(lyt_ttm, ex_adae)
```

Note that because there were no pairs in the map with an `AESOC` of
`"cl C"`, that entire facet is omitted. This will be true in the case
of nested faceting as well:

```{r}
lyt_ttm2 <- basic_table() |>
  split_rows_by("AESOC", split_fun = trim_levels_to_map(map)) |>
  split_rows_by("AEDECOD", split_fun = trim_levels_in_group("AETOXGR")) |>
  analyze("AETOXGR")

build_table(lyt_ttm2, ex_adae)
```

## Facets That Vary Meaning Instead of Data Subset

In our examples so far, faceting has translated to mapping the
incoming data to a set of distinct (if not necessarily mutually
exclusive or exhaustive) subsets of the data. This is the most common
form of faceting, but it is not the only one `rtables` supports.

In some cases, we want facets to be *semantically* distinct from
each other; in other words, instead of representing different subsets
of the data, we want them to represent different aspects of the same
data. This is most commonly useful column space, where individual
columns are defined via faceting, unlike individual rows.


An toy example of this would be 

```{r, echo = FALSE}
library(tibble)

tpose_afun <- function(x, .var, .spl_context) {
  spldf <<- .spl_context
  mycol <- tail(tail(.spl_context$cur_col_split_val, 1)[[1]], 1)
  cell <- switch(mycol,
    n = rcell(length(x), format = "xx"),
    mean = rcell(mean(x, na.rm = TRUE), format = "xx.x"),
    sd = rcell(sd(x, na.rm = TRUE), format = "xx.xx")
  )
  in_rows(.list = setNames(list(cell), .var))
}

combo_df <- tribble(
  ~valname, ~label, ~levelcombo, ~exargs,
  "n", "n", select_all_levels, list(),
  "mean", "mean", select_all_levels, list(),
  "sd", "sd", select_all_levels, list()
)


lyt_sem_cols <- basic_table() |>
  split_cols_by("ARM") |>
  split_cols_by("STUDYID", split_fun = add_combo_levels(combo_df, keep_levels = combo_df$valname)) |>
  split_rows_by("SEX", split_fun = keep_split_levels(c("F", "M"))) |>
  analyze(c("AGE", "BMRKR1"), afun = tpose_afun, show_labels = "hidden")

fixed_shell(build_table(lyt_sem_cols, ex_adsl))
```

Here we have individual columns for ***different statistics calculated
using the same data*** (`n`, `mean` and `sd`), within a faceting
structure that splits on arm in column space and gender in row space,
and calculated for two different continuous numeric variables (age and
"biomarker 1" value).

To achieve this, we need faceting that creates three columns all of
whose "subsets" of the incoming (arm) data are identical: all of
it. We can achieve this with the `add_combo_levels` split function
factory we used above; the key is to use the `select_all_levels`
sentinel value provided by rtables to indicate that all levels in the
data should be combined when creating each of our new combination
levels.


We will turn on column counts at all levels to show that it is doing
what we want, despite it being redundant and not suitable for any
actual table output.


```{r}
my_combo_df <- tribble(
  ~valname, ~label, ~levelcombo, ~exargs,
  "n", "n", select_all_levels, list(),
  "mean", "mean", select_all_levels, list(),
  "sd", "sd", select_all_levels, list()
)

lyt_tpose_cols_only <- basic_table() |>
  split_cols_by("ARM", show_colcounts = TRUE) |>
  split_cols_by("STUDYID",
    split_fun = add_combo_levels(my_combo_df, keep_levels = combo_df$valname),
    show_colcounts = TRUE
  )

build_table(lyt_tpose_cols_only, ex_adsl)
```

We split on study id in the above code largely for convenience. Given
that we are defining combination levels using `select_all_levels`, we
could split on anything and have each of the facets represent the
entirety of the incoming data. This approach, however, is a
generalization of splitting on study id in order to create a single
facet representing all the incoming data, a trick worth having in our
back pocket.

Thus we've achieved the column structure we wanted. Now we need an
analysis function with the correct *column-conditional behavior* (see
[the previous chapter](./guided_intermediate_afun_reqs.html)) and we
will have our output.

Without discussing how we construct it (as that will be covered in the
advanced portion of this guide), assuming we have a `tpose_afun` which
meets our requirements, we can then fully create our table:


```{r}
lyt_tpose_full <- basic_table() |>
  split_cols_by("ARM", show_colcounts = TRUE) |>
  split_cols_by("STUDYID",
    split_fun = add_combo_levels(my_combo_df, keep_levels = combo_df$valname),
    show_colcounts = TRUE
  ) |>
  split_rows_by("SEX", split_fun = keep_split_levels(c("F", "M"))) |>
  analyze(c("AGE", "BMRKR1"), afun = tpose_afun, show_labels = "hidden")

build_table(lyt_tpose_full, ex_adsl)
```

   
# Combining These Faceting Needs

For some table shells, we need to combine the types of needs we
explored above; we might need `trim_levels_to_map` type behavior, but
also need to include a virtual combination treatment/arm. The split
functions/function factories we discussed here generally cannot achieve
this, though our reasoning for ***how to think about the faceting we
need*** still applies. In such cases, we will construct fully custom
split functions which exactly meet our needs, which will be the topic
of an entire chapter in the advanced portion of this guide.