In this vignette, we would like to discuss the similarities and
differences between dplyr and rtable.
Much of the rtables framework focuses on
tabulation/summarizing of data and then the visualization of the table.
In this vignette, we focus on summarizing data using dplyr
and contrast it to rtables. We won’t pay attention to the
table visualization/markup and just derive the cell content.
Using dplyr to summarize data and gt to
visualize the table is a good way if the tabulation is of a certain
nature or complexity. However, there are tables such as the table
created in the introduction
vignette that take some effort to create with dplyr. Part
of the effort is due to fact that when using dplyr the
table data is stored in data.frames or tibbles
which is not the most natural way to represent a table as we will show
in this vignette.
If you know a more elegant way of deriving the table content with
dplyr please let us know and we will update the
vignette.
Here is the table and data used in the introduction
vignette:
n <- 400
set.seed(1)
df <- tibble(
  arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
  country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
  gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
  handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
  age = rchisq(n, 30) + 10
) %>% mutate(
  weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
)
lyt <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
#                     Arm A                     Arm B         
#              Female        Male        Female        Male   
#              (N=96)      (N=105)       (N=92)      (N=107)  
# ————————————————————————————————————————————————————————————
# CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
#   Left     32 (33.3%)   42 (40.0%)   26 (28.3%)   37 (34.6%)
#     mean      38.9         40.4         40.3         37.7   
#   Right    13 (13.5%)   22 (21.0%)   20 (21.7%)   25 (23.4%)
#     mean      36.6         40.2         40.2         40.6   
# USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
#   Left     34 (35.4%)   19 (18.1%)   25 (27.2%)   25 (23.4%)
#     mean      40.4         39.7         39.2         40.1   
#   Right    17 (17.7%)   22 (21.0%)   21 (22.8%)   20 (18.7%)
#     mean      36.9         39.8         38.5         39.0We will start by deriving the first data cell on row 3 (note, row 1
and 2 have content cells, see the introduction
vignette). Cell 3,1 contains the mean age for left handed & female
Canadians in “Arm A”:
mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female" & df$handed == "Left"])
# [1] 38.86979or with dplyr:
df %>%
  filter(country == "CAN", arm == "Arm A", gender == "Female", handed == "Left") %>%
  summarise(mean_age = mean(age))
# # A tibble: 1 × 1
#   mean_age
#      <dbl>
# 1     38.9Further, dplyr gives us other verbs to easily get the
average age of left handed Canadians for each group defined by the 4
columns:
df %>%
  group_by(arm, gender) %>%
  filter(country == "CAN", handed == "Left") %>%
  summarise(mean_age = mean(age))
# `summarise()` has grouped output by 'arm'. You can override using the `.groups`
# argument.
# # A tibble: 4 × 3
# # Groups:   arm [2]
#   arm   gender mean_age
#   <fct> <fct>     <dbl>
# 1 Arm A Female     38.9
# 2 Arm A Male       40.4
# 3 Arm B Female     40.3
# 4 Arm B Male       37.7We can further get to all the average age cell values with:
average_age <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarise(mean_age = mean(age))
# `summarise()` has grouped output by 'arm', 'gender', 'country'. You can
# override using the `.groups` argument.
average_age
# # A tibble: 16 × 5
# # Groups:   arm, gender, country [8]
#    arm   gender country handed mean_age
#    <fct> <fct>  <fct>   <fct>     <dbl>
#  1 Arm A Female CAN     Left       38.9
#  2 Arm A Female CAN     Right      36.6
#  3 Arm A Female USA     Left       40.4
#  4 Arm A Female USA     Right      36.9
#  5 Arm A Male   CAN     Left       40.4
#  6 Arm A Male   CAN     Right      40.2
#  7 Arm A Male   USA     Left       39.7
#  8 Arm A Male   USA     Right      39.8
#  9 Arm B Female CAN     Left       40.3
# 10 Arm B Female CAN     Right      40.2
# 11 Arm B Female USA     Left       39.2
# 12 Arm B Female USA     Right      38.5
# 13 Arm B Male   CAN     Left       37.7
# 14 Arm B Male   CAN     Right      40.6
# 15 Arm B Male   USA     Left       40.1
# 16 Arm B Male   USA     Right      39.0In rtable syntax, we need the following code to get to
the same content:
lyt <- basic_table() %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  split_rows_by("handed") %>%
  analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
#                Arm A           Arm B    
#            Female   Male   Female   Male
# ————————————————————————————————————————
# CAN                                     
#   Left                                  
#     mean    38.9    40.4    40.3    37.7
#   Right                                 
#     mean    36.6    40.2    40.2    40.6
# USA                                     
#   Left                                  
#     mean    40.4    39.7    39.2    40.1
#   Right                                 
#     mean    36.9    39.8    38.5    39.0As mentioned in the introduction to this vignette, please ignore the
difference in arranging and formatting the data: it’s possible to
condense the rtable more and it is possible to make the
tibble look more like the reference table using the
gt R package.
In terms of tabulation for this example there was arguably not much
added by rtables over dplyr.
Unlike in rtables the different levels of summarization
are discrete computations in dplyr which we will then need
to combine
We first focus on the count and percentage information for handedness within each country (for each arm-gender pair), along with the analysis row mean values:
c_h_df <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarize(mean = mean(age), c_h_count = n()) %>%
  ## we need the sum below to *not* be by country, so that we're dividing by the column counts
  ungroup(country) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(n_col = sum(c_h_count), c_h_percent = c_h_count / n_col)
# `summarise()` has grouped output by 'arm', 'gender', 'country'. You can
# override using the `.groups` argument.
c_h_df
# # A tibble: 16 × 8
# # Groups:   arm, gender [4]
#    arm   gender country handed  mean c_h_count n_col c_h_percent
#    <fct> <fct>  <fct>   <fct>  <dbl>     <int> <int>       <dbl>
#  1 Arm A Female CAN     Left    38.9        32    96       0.333
#  2 Arm A Female CAN     Right   36.6        13    96       0.135
#  3 Arm A Female USA     Left    40.4        34    96       0.354
#  4 Arm A Female USA     Right   36.9        17    96       0.177
#  5 Arm A Male   CAN     Left    40.4        42   105       0.4  
#  6 Arm A Male   CAN     Right   40.2        22   105       0.210
#  7 Arm A Male   USA     Left    39.7        19   105       0.181
#  8 Arm A Male   USA     Right   39.8        22   105       0.210
#  9 Arm B Female CAN     Left    40.3        26    92       0.283
# 10 Arm B Female CAN     Right   40.2        20    92       0.217
# 11 Arm B Female USA     Left    39.2        25    92       0.272
# 12 Arm B Female USA     Right   38.5        21    92       0.228
# 13 Arm B Male   CAN     Left    37.7        37   107       0.346
# 14 Arm B Male   CAN     Right   40.6        25   107       0.234
# 15 Arm B Male   USA     Left    40.1        25   107       0.234
# 16 Arm B Male   USA     Right   39.0        20   107       0.187which has 16 rows (cells) like the average_age data
frame defined above. Next, we will derive the group information for
countries:
c_df <- df %>%
  group_by(arm, gender, country) %>%
  summarize(c_count = n()) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(n_col = sum(c_count), c_percent = c_count / n_col)
# `summarise()` has grouped output by 'arm', 'gender'. You can override using the
# `.groups` argument.
c_df
# # A tibble: 8 × 6
# # Groups:   arm, gender [4]
#   arm   gender country c_count n_col c_percent
#   <fct> <fct>  <fct>     <int> <int>     <dbl>
# 1 Arm A Female CAN          45    96     0.469
# 2 Arm A Female USA          51    96     0.531
# 3 Arm A Male   CAN          64   105     0.610
# 4 Arm A Male   USA          41   105     0.390
# 5 Arm B Female CAN          46    92     0.5  
# 6 Arm B Female USA          46    92     0.5  
# 7 Arm B Male   CAN          62   107     0.579
# 8 Arm B Male   USA          45   107     0.421Finally, we left_join() the two levels of summary to get
a data.frame containing the full set of values which make up the body of
our table (note, however, they are not in the same order):
full_dplyr <- left_join(c_h_df, c_df) %>% ungroup()
# Joining with `by = join_by(arm, gender, country, n_col)`Alternatively, we could calculate only the counts in
c_h_df, and use mutate() after the
left_join() to divide the counts by the n_col
values which are more naturally calculated within c_df.
This would simplify c_h_df’s creation somewhat by not
requiring the explicit ungroup(), but it prevents each
level of summarization from being a self-contained set of
computations.
The rtables call in contrast is:
lyt <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
#                     Arm A                     Arm B         
#              Female        Male        Female        Male   
#              (N=96)      (N=105)       (N=92)      (N=107)  
# ————————————————————————————————————————————————————————————
# CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
#   Left     32 (33.3%)   42 (40.0%)   26 (28.3%)   37 (34.6%)
#     mean      38.9         40.4         40.3         37.7   
#   Right    13 (13.5%)   22 (21.0%)   20 (21.7%)   25 (23.4%)
#     mean      36.6         40.2         40.2         40.6   
# USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
#   Left     34 (35.4%)   19 (18.1%)   25 (27.2%)   25 (23.4%)
#     mean      40.4         39.7         39.2         40.1   
#   Right    17 (17.7%)   22 (21.0%)   21 (22.8%)   20 (18.7%)
#     mean      36.9         39.8         38.5         39.0We can now spot check that the values are the same
frm_rtables_h <- cell_values(
  tbl,
  rowpath = c("country", "CAN", "handed", "Right", "@content"),
  colpath = c("arm", "Arm B", "gender", "Female")
)[[1]]
frm_rtables_h
# [1] 20.0000000  0.2173913
frm_dplyr_h <- full_dplyr %>%
  filter(country == "CAN" & handed == "Right" & arm == "Arm B" & gender == "Female") %>%
  select(c_h_count, c_h_percent)
frm_dplyr_h
# # A tibble: 1 × 2
#   c_h_count c_h_percent
#       <int>       <dbl>
# 1        20       0.217
frm_rtables_c <- cell_values(
  tbl,
  rowpath = c("country", "CAN", "@content"),
  colpath = c("arm", "Arm A", "gender", "Male")
)[[1]]
frm_rtables_c
# [1] 64.0000000  0.6095238
frm_dplyr_c <- full_dplyr %>%
  filter(country == "CAN" & arm == "Arm A" & gender == "Male") %>%
  select(c_count, c_percent)
frm_dplyr_c
# # A tibble: 2 × 2
#   c_count c_percent
#     <int>     <dbl>
# 1      64     0.610
# 2      64     0.610Further, the rtable syntax has hopefully also become a
bit more straightforward to derive the cell values than with
dplyr for this particular table.
In this vignette learned that:
dplyr and
data.frame or tibble as data structure
dplyr keeps simple things simplertables streamlines the construction of complex
tablesWe recommend that you continue reading the clinical_trials
vignette where we create a number of more advanced tables using
layouts.