Contents

Compiled date: 2021-05-19

Last edited: 2021-14-05

License: GPL-3

1 Installation

Run the following code to install the Bioconductor version of the package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("fobitools")

2 Load packages

library(fobitools)

We will also need some additional CRAN packages that will be very useful in this vignette.

library(tidyverse)
library(kableExtra)

3 Load food items from a food frequency questionnaire (FFQ) sample data

In nutritional studies, dietary data are usually collected by using different questionnaires such as FFQs (food frequency questionnaires) or 24h-DRs (24 hours dietary recall). Commonly, the text collected in these questionnaires require a manual preprocessing step before being analyzed.

This is an example of how an FFQ could look like in a common nutritional study.

load("data/sample_ffq.rda")

sample_ffq %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))
ID Name
ID_001 Beef: roast, steak, mince, stew casserole, curry or bolognese
ID_002 Beefburgers
ID_003 Pork: roast, chops, stew, slice or curry
ID_004 Lamb: roast, chops, stew or curry
ID_005 Chicken, turkey or other poultry: including fried, casseroles or curry
ID_006 Bacon
ID_007 Ham
ID_008 Corned beef, Spam, luncheon meats
ID_009 Sausages
ID_0010 Savoury pies, e.g. meat pie, pork pie, pasties, steak & kidney pie, sausage rolls, scotch egg

4 Automatic dietary text anotation

The fobitools::annotate_foods() function allows the automatic annotation of free nutritional text using the FOBI ontology (Castellano-Escuder et al. 2020). This function provides users with a table of food IDs, food names, FOBI IDs and FOBI names of the FOBI terms that match the input text. The input should be structured as a two column data frame, indicating the food IDs (first column) and food names (second column). Note that food names can be provided both as words and complex strings.

This function includes a text mining algorithm composed of 5 sequential layers. In this process, singulars and plurals are analyzed, irrelevant words are removed, each string of the text input is tokenized and each word is analyzed independently, and the semantic similarity between input text and FOBI items is computed. Finally, this function also shows the percentage of the annotated input text.

annotated_text <- fobitools::annotate_foods(sample_ffq)
#> 89.57% annotated
#> 3.277 sec elapsed

annotated_text$annotated %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))
FOOD_ID FOOD_NAME FOBI_ID FOBI_NAME
ID_00100 Oranges, satsumas, mandarins, tangerines, clementines FOODON:03309832 orange (whole, raw)
ID_00101 Grapefruit FOODON:03301702 grapefruit (whole, raw)
ID_00102 Bananas FOODON:03311513 banana (whole, ripe)
ID_00103 Grapes FOODON:03301123 grape (whole, raw)
ID_00104 Melon FOODON:03301593 melon (raw)
ID_00105 *Peaches, plums, apricots, nectarines FOODON:03301107 nectarine (whole, raw)
ID_00106 *Strawberries, raspberries, kiwi fruit FOODON:03305656 fruit (dried)
ID_00106 *Strawberries, raspberries, kiwi fruit FOODON:03414363 kiwi
ID_00106 *Strawberries, raspberries, kiwi fruit FOODON:00001057 plant fruit food product
ID_00107 Tinned fruit FOODON:03305656 fruit (dried)

4.1 The similarity argument

Additionally, the similarity argument indicates the semantic similarity cutoff used at the last layer of the text mining pipeline. It is a numeric value between 1 (exact match) and 0 (very poor match). Users can modify this value to obtain more or less accurated annotations. Authors do not recommend values below 0.85 (default).

annotated_text_95 <- fobitools::annotate_foods(sample_ffq, similarity = 0.95)
#> 86.5% annotated
#> 2.927 sec elapsed

annotated_text_95$annotated %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))
FOOD_ID FOOD_NAME FOBI_ID FOBI_NAME
ID_00100 Oranges, satsumas, mandarins, tangerines, clementines FOODON:03309832 orange (whole, raw)
ID_00101 Grapefruit FOODON:03301702 grapefruit (whole, raw)
ID_00102 Bananas FOODON:03311513 banana (whole, ripe)
ID_00103 Grapes FOODON:03301123 grape (whole, raw)
ID_00104 Melon FOODON:03301593 melon (raw)
ID_00105 *Peaches, plums, apricots, nectarines FOODON:03301107 nectarine (whole, raw)
ID_00106 *Strawberries, raspberries, kiwi fruit FOODON:03305656 fruit (dried)
ID_00106 *Strawberries, raspberries, kiwi fruit FOODON:03414363 kiwi
ID_00106 *Strawberries, raspberries, kiwi fruit FOODON:00001057 plant fruit food product
ID_00107 Tinned fruit FOODON:03305656 fruit (dried)

See that by increasing the similarity value from 0.85 (default value) to 0.95 (a more accurate annotation), the percentage of annotated terms decreases from 89.57% to 86.5%. Let’s check those food items annotated with similarity = 0.85 but not with similarity = 0.95.

annotated_text$annotated %>%
  filter(!FOOD_ID %in% annotated_text_95$annotated$FOOD_ID) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))
FOOD_ID FOOD_NAME FOBI_ID FOBI_NAME
ID_00124 Beansprouts…130 FOODON:00002753 bean (whole)
ID_00127 Watercress FOODON:00002340 water food product
ID_00140 Beansprouts…171 FOODON:00002753 bean (whole)
ID_00143 Brocoli FOODON:03301713 broccoli floret (whole, raw)
ID_002 Beefburgers FOODON:00002737 beef hamburger (dish)

4.1.1 Network visualization of the annotated terms

Then, with the fobitools::fobi_graph() function we can visualize the annotated food terms with their corresponding FOBI relationships.

terms <- annotated_text$annotated %>%
  pull(FOBI_ID)

fobitools::fobi_graph(terms = terms,
                      get = NULL,
                      layout = "lgl",
                      labels = TRUE,
                      legend = TRUE,
                      labelsize = 6,
                      legendSize = 20)