--- title: "2 - Post processing" format: html: toc: true vignette: > %\VignetteIndexEntry{2 - Post processing} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' --- ## Introduction ::: callout-caution Before reading this, you should read `vignette("reading")` to learn how to import your database. ::: After importing your database with EDCimport, you end up with an `edc_database` object, which can be loaded to the global environment using `load_database()`. However, EDCimport provides a few functions to improve the database before loading it. ## Harmonize Subject ID across the database The Subject ID column, usually `SUBJID` for CDISC data, is the primary key, shared by almost all your datasets. Using `edc_unify_subjid()`, you can harmonize this column across the whole database, so that it becomes a `factor`, consistent for all datasets. With `preprocess`, you can even customize it. This is especially convenient for joining your data and checking for missing patients. ```{r} #| warning: false library(EDCimport) db1 = edc_example() load_database(db1) enrol$subjid %>% class() enrol$subjid %>% head() db2 = edc_example() %>% edc_unify_subjid(preprocess=~paste0("#", .x)) load_database(db2) enrol$subjid %>% class() enrol$subjid %>% head() #missing patients in table `ae` ae$subjid %>% forcats::fct_count() %>% dplyr::filter(n==0) ``` ::: callout-note If your SUBJID column is numeric and `preprocess` is empty, SUBJID will be cast to numeric. ::: ## Clean dataset names Is your database from a messy EDC software, filled with special characters or camelCase column names? Fear not! With `edc_clean_names()` you can clean all dataset names at once. By default, it converts names to lowercase letters, numbers, and underscores only. For this example, since `edc_example()` already provides clean column names, let's convert all columns to **uppercase**: ```{r} #| warning: false library(EDCimport) db = edc_example() %>% edc_clean_names(toupper) load_database(db) names(enrol) ``` ## Split some dataset to short+long ::: {.callout-note appearance="minimal"} This one is a bit more complex, but bear with me, I'll try to make is understandable. ::: When a CRF form contains both repeated and non-repeated measures, the export usually duplicates the non-repeated measure. This results in a "mixed" data format, combining both "long" and "short" structures. (You usually call the latter "wide", but in this case it is not really.) For example, in the dataset `long_mixed` from `edc_example()`, you have two long-format variables (one value per observation) and one wide-format variable (one value per subject). ```{r} head(long_mixed) ``` With complex CRFs and lengthy forms, this mixed structure can complicate analysis, as repeated and non-repeated data may be unrelated. With `edc_split_mixed()`, you can split this dataset into two, one `short` and one `long`: ```{r} #| warning: false db = edc_example() %>% edc_split_mixed(long_mixed) load_database(db) head(long_mixed_short) #one row per subject head(long_mixed_long) #one row per observation ``` ## You can combine! Obviously, these functions can be piped to one another: ``` r db = edc_example() %>% edc_split_mixed(long_mixed) %>% edc_unify_subjid(preprocess=~paste0("#", .x))%>% edc_clean_names(toupper) load_database(db) ``` Don't hesitate to [submit a feature request](https://github.com/DanChaltiel/EDCimport/issues/new?template=feature_request.md) if you think another function can be useful to others!