--- title: "3 - Data checking" format: html: toc: true vignette: > %\VignetteIndexEntry{3 - Data checking} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' --- ## Introduction You imported your database, but now you need to check it for errors and inconsistencies. There are a lot of ways to do so, so EDCimport provides functions for a few concepts. As in previous vignettes, we will be using `edc_example()`, but in the real world you should use EDC reading functions. See `vignette("reading")` to see how. ```{r} #| warning: false #| message: true library(EDCimport) library(dplyr) db = edc_example(N=200) %>% edc_unify_subjid() %>% edc_clean_names() db load_database(db) ``` ## Data warning ### Data errors The primary and most valuable data-checking tool in EDCimport is `edc_data_warn()`. Simply use `dplyr::filter()` to identify problematic or inconsistent rows and pipe them into the function. For example, let's say that in our study: - Patients should be older than 25 - Adverse event grades should be between 1 and 5. - Patients in the treatment arm should not have `data1$date1` before 2010-04-10 Here's how you check these conditions: ```{r} enrol %>% filter(age<25) %>% edc_data_warn("Patients should be >25yo", issue_n=1) ae %>% filter(aegr<1 | aegr>5) %>% edc_data_warn("Incorrect adverse event grade", issue_n=2) data1 %>% edc_left_join(enrol) %>% filter(arm=="Trt") %>% filter(date1<"2010-04-10") %>% edc_data_warn("Treated patients should have been seen later", issue_n=3) ``` You can implement these checks according to your Data Validation Plan, ensuring they run after every export. Any failed check will produce a warning in your R console. Once the database is corrected, the warnings will no longer appear (e.g. `issue_n=2`). After running all your checks, you can use `edc_data_warnings()` to get a summary of all detected issues. ```{r} edc_data_warnings() ``` ### Fatal error If one check is so mandatory that you cannot work in a database where it fails, use `edc_data_stop()` instead. For example, you can use it to check that some variable construction didn't go wrong: ``` r df = mtcars %>% mutate( type = case_when( cyl==4 ~ "4 cylinders", cyl==6 ~ "6 cylinders", cyl==8 ~ "8 cylinders", .default="ERROR" ), ) df %>% filter(type=="ERROR") %>% edc_data_stop("Error on type construction") ``` ## Duplicate-free dataset assertion If you work with multiple datasets, your code probably include a lot of joins. As you may have painfully discovered, joining data carries a high risk of altering the data layout and resulting in multiple rows per patient. This is why you should always include `assert_no_duplicate()` in your pipeline if you expect only one row per patient. ```{r} #| error: true enrol %>% assert_no_duplicate() %>% count(arm) enrol %>% edc_left_join(data1) %>% #oopsie assert_no_duplicate() %>% count(arm) ``` ::: callout-tip This is the [Fail Fast principle](https://www.codereliant.io/fail-fast-pattern/): you'd better have an error in your R script than in your analysis report. ::: ## Last-news table If your analysis involves a survival endpoint, you likely have a follow-up dataset that includes the vital status as of the last visit date. However, in real-world scenarios, this dataset might not be accurately filled, and some patient can have other data after the date of last visit. The `lastnews_table()` function calculates the actual date of the last recorded information for each patient (`SUBJID`), based on all Date/Datetime columns across all datasets. Currently, `edc_example()` does not include an explicit table for this scenario, so let's consider the following example: - `data3$date10` is your last visit date in your followup dataset. It is the origin you should `prefer` if there is a tie and the reference to identify any inconsistency. - `data1` is a dataset containing scheduled protocol dates, such as planned medical visits. You should ignore all columns in this dataset, as they do not pertain to the patient's last known - `date8` and `date9` are dates when treatments were administered. They imply that the patient was alive at that time. If a date is after `date10`, it means that the survival time is underestimated. Here is how to parameterize `lastnews_table()` to fit this scenario: ```{r} lastnews_table(prefer="date10", except="data1", show_delta=TRUE) %>% mutate(delta=round(delta)) %>% arrange(desc(delta)) ``` ::: callout-warning This table is useful for Overall Survival and data checking. However, you should be very careful when using it for Event-Free Survival: a patient can be alive at that point without having experienced an event. :::