library(validata)
library(tidyselect)In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
#> # A tibble: 6 x 6
#>   ID_COL1 ID_COL2 ID_COL3     VAL1   VAL2   VAL3
#>     <dbl>   <dbl>   <dbl>    <dbl>  <dbl>  <dbl>
#> 1    2413    1034    1014 -0.0639  -1.16  -0.302
#> 2    2413    1034    1322  0.363    1.62   0.165
#> 3    2413    1034    2999 -0.00466  1.23   0.819
#> 4    2413    1034    3544  1.83    -2.58  -0.525
#> 5    2413    1034    9901  0.837   -0.442 -0.341
#> 6    2413    1122    1014 -0.894   -1.11   0.768Let’s use confirm_distinct iteratively to find the uniquely identifying columns of sample_data1.
sample_data1 %>% 
  confirm_distinct(ID_COL1)
#> database has 120 duplicates at ID_COL1sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function determine distinct.
sample_data1 %>% 
  determine_distinct(matches("ID"))confirm_mapping tells you the mapping between two columns in a data frame:
confirm_mapping gives the option to view which type of mapping is associated with each individual row.
sample_data1 %>% 
  confirm_mapping(ID_COL1, ID_COL2, view = F)
#> many - many mapping between ID_COL1 and ID_COL2sample_data1 %>% 
  determine_mapping(everything())The overlap functions give a venn style description of the values in 2 columns. This is especially useful before performing a join function, and you want to confirm that the dataframes have matching keys.
confirm_overlap is different from the other confirm functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 x 5
#>   only_in_iris_Sepal.W… only_in_iris_Petal.… shared_names total_names pct_shared
#>                   <int>                <int>        <int>       <int> <chr>     
#> 1                    12                   32           11          55 20%confirm_overlap returns a summary data frame invisibly allowing you to access individual elements using the helper functions.
print(iris_overlap)
#> # A tibble: 55 x 4
#>        x iris_Sepal.Width iris_Petal.Length both_flags
#>    <dbl>            <dbl>             <dbl>      <dbl>
#>  1   3.5                1                 1          2
#>  2   3                  1                 1          2
#>  3   3.2                1                 0          1
#>  4   3.1                1                 0          1
#>  5   3.6                1                 1          2
#>  6   3.9                1                 1          2
#>  7   3.4                1                 0          1
#>  8   2.9                1                 0          1
#>  9   3.7                1                 1          2
#> 10   4                  1                 1          2
#> # … with 45 more rowsFind the elements unique to the first column
iris_overlap %>% 
  co_find_only_in_1() %>% 
  head()
#> # A tibble: 6 x 1
#>   iris_Sepal.Width
#>              <dbl>
#> 1              3.2
#> 2              3.1
#> 3              3.4
#> 4              2.9
#> 5              2.3
#> 6              2.8Find the elements unique to the second column
iris_overlap %>% 
  co_find_only_in_2() %>% 
  head()
#> # A tibble: 6 x 1
#>   iris_Petal.Length
#>               <dbl>
#> 1               1.4
#> 2               1.3
#> 3               1.5
#> 4               1.7
#> 5               1.6
#> 6               1.1Find the elements shared by both columns
iris_overlap %>% 
  co_find_in_both() %>% 
  head()
#> # A tibble: 6 x 1
#>       x
#>   <dbl>
#> 1   3.5
#> 2   3  
#> 3   3.6
#> 4   3.9
#> 5   3.7
#> 6   4determine_overlap takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.
Note that the overlap functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
iris %>% 
  confirm_strlen(Species) -> species_len
#>  Species_chr_len  n percent
#>                6 50   33.3%
#>                9 50   33.3%
#>               10 50   33.3%output is a dataframe
head(species_len)
#> # A tibble: 6 x 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <int>
#> 1          5.1         3.5          1.4         0.2 setosa                6
#> 2          4.9         3            1.4         0.2 setosa                6
#> 3          4.7         3.2          1.3         0.2 setosa                6
#> 4          4.6         3.1          1.5         0.2 setosa                6
#> 5          5           3.6          1.4         0.2 setosa                6
#> 6          5.4         3.9          1.7         0.4 setosa                6A helped function for the output of confirm_strlen that filters the database for chosen string lengths.
species_len %>% 
  choose_strlen(len = 6) %>% 
  head()
#> # A tibble: 6 x 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <int>
#> 1          5.1         3.5          1.4         0.2 setosa                6
#> 2          4.9         3            1.4         0.2 setosa                6
#> 3          4.7         3.2          1.3         0.2 setosa                6
#> 4          4.6         3.1          1.5         0.2 setosa                6
#> 5          5           3.6          1.4         0.2 setosa                6
#> 6          5.4         3.9          1.7         0.4 setosa                6Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
iris %>% 
  diagnose()
#> # A tibble: 5 x 6
#>   variables    types   missing_count missing_percent unique_count unique_rate
#>   <chr>        <chr>           <int>           <dbl>        <int>       <dbl>
#> 1 Sepal.Length numeric             0               0           35       0.233
#> 2 Sepal.Width  numeric             0               0           23       0.153
#> 3 Petal.Length numeric             0               0           43       0.287
#> 4 Petal.Width  numeric             0               0           22       0.147
#> 5 Species      factor              0               0            3       0.02