---
title: "workflow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The `heartbeatr` package is designed to streamline the processing of data generated with ElectricBlue's PULSE systems (https://electricblue.eu/pulse).  

```{r pulse, out.width = "250px", echo = FALSE, fig.align = "left"}
knitr::include_graphics("images/pulse.jpg")
```

This vignette explains the structure of the input data and demonstrates the `heartbeatr` workflow based on a simple and compliant dataset. It also provides important recommendations on how to effectively process large datasets.

```{r packages}
# library(tidyverse)
library(dplyr)
library(stringr)
library(tibble)
library(ggplot2)
library(heartbeatr)
```

# Input data

`heartbeatr` is ONLY designed to handle data generated by PULSE systems and from a single experiment. 
`heartbeatr` is unlikely to be able to process files generated with any other device.
The automated processing will also likely fail if the input data doesn't fully conform to certain specifications, namely:

- the formatting of each input file (see [PULSE files])
- the consistency of the parameters used to generate all input files (see [one experiment]).

## PULSE files

The term "*PULSE files*" refers to the `.CSV` files generated by PULSE systems.  
Such files follow a strict structure composed by a [header section] and a [data section].

```
Please note that simply opening a PULSE file using Microsoft's Excel and closing again can introduce modifications to the files (the date_time format will automatically be converted to the same format used by one's machine), potentially making it unreadable by heartbeatr.
If you want to check the data stored in a PULSE file, either use a basic text editor (such as BBEdit) or create a copy of the file before opening it.
```

### header section
The header follows a two-column `field,value` structure and includes all the PULSE system parameter options used when the file was generated. 
The number of fields may change in future PULSE system versions, but will always include:

- device
- rate_Hz
- utc
- time_zone_h
- daylight_saving_time
- local_time

```{r input_header}
# pulse example allows quick access to test files
fn   <- pulse_example()[1] 
head <- read.csv(fn, nrows = 50, col.names = c("field", "value"))
skip <- max(grep("----------------", head$field))
head[1:skip,]
```

A few notes about the header fields:

- `utc` will always be equal to `0` because PULSE systems always record data using UTC0, without exception; this is crucial as it ensures full protection from potential mistakes originating when daylight saving time changes or when aggregating data created by different PULSE systems that could be running different values for the parameters `time_zone_h` and ` daylight_saving_time`.
- the `local_time` field facilitates the quick conversion of the logged timestamps (always in UTC0) to local time; however, note that `local_time` is calculated based on the values provided by the user for `time_zone_h` and ` daylight_saving_time`, and therefore will be incorrect if those parameters aren't adjusted properly.

### data section

The data section is composed by a line with all logging channel ids, followed by data lines with a timestamp and the values read by the PULSE system for each of the channels.
Note that there's always a reading for each channel, even if not all were in fact used during data collection.
```{r input_data, results = "hide"}
read.csv(fn, skip = skip + 1)
```
```{r input_data_print, echo = FALSE}
read.csv(fn, skip = skip + 1, nrows = 5)
```

## One experiment

Crucially, in order to process the dataset, `heartbeatr` must join data from all the files it receives as input.
Therefore, all files must have been generated suing the same key PULSE system parameter options, to ensure full consistency across files.
This usually means that datasets should be processed one experiment at a time, even if data are to be merged later on.
Thus, within the context of `heartbeatr`, **one experiment** concerns data collected:

- by a single PULSE system
- using the same parameter options 
  - firmware version
  - device name
  - sampling frequency
  - recording time
  - sleep time
- with the same channel ids
- without time overlaps

In summary, even if there are four distinct treatments in a scientific experiment and different PULSE system is being used simultaneously to record data from each treatment, then data for each of the four treatments should initially be processed by `heartbeatr` independently, and only later be merged.  

<br>

<br>

# The `heartbeatr` workflow

If the dataset follows all the rules and structuring mentioned above, it can be processed as follows:

## import data into R with `pulse_read()`

`pulse_read()` takes paths to PULSE files, reads all data and binds into a single, continuous stream.  

```{r workflow_read1}
pulse_data <- pulse_read(
  paths = pulse_example(), 
  msg = FALSE
  )
```

For the purpose of this vignette, we will restrict the number of channels being analysed (this makes the code run faster and the visualizations less crowded).
```{r o}
channels <- paste0("c", formatC(1:10, width = 2, flag = "0"))
discard  <- c("c01", "c02", "c05", "c07", "c08", "c09", "c10")
pulse_data$data <- pulse_data$data[, c("time", setdiff(channels, discard))]

pulse_data$data
```

Note that over the short span of this example dataset (just `r {x = diff(range(pulse_data$data$time)); units(x) = "mins"; round(as.numeric(x), 1)}` minutes), the resulting dataframe already comprises `r nrow(pulse_data$data)` rows of raw data.

## split continuous data over time windows with `pulse_split()`

`pulse_split()` takes the output of `pulse_read()` and splits it into chunks. 

```{r workflow_split1}
pulse_data_split <- pulse_split(
  pulse_data = pulse_data, 
  window_width_secs = 30, 
  window_shift_secs = 60, 
  msg = FALSE
  )

pulse_data_split
```


In this case, the combination of `window_width_secs` and `window_shift_secs` chosen resulted in `r nrow(pulse_data_split)` time windows with raw data.

```{r workflow_split2}
pulse_data_split$data[[1]]
```

## interpolate and smooth data with `pulse_optimize()`

`pulse_optimize()` takes the output from `pulse_split()` and refines the signals in each time window in order to increase the likelihood that heart rates can be determined accurately. This typically involves applying linear interpolation (to ensure that each wave peak is defined by a sufficient number of points) and smoothing (to remove low-frequency noise) to the data.

```{r workflow_optimize1}
pulse_data_optimized <- pulse_optimize(
  pulse_data_split = pulse_data_split, 
  interpolation_freq = 40,
  bandwidth = 0.75,
  raw_v_smoothed = FALSE,
  multi = TRUE
  )
```

- before: *(`$smoothed = FALSE` and variable number of rows in each window)*

```{r workflow_optimize2}
pulse_data_split
```

- after: *(`$smoothed = TRUE` and same number of rows in each window)*

```{r workflow_optimize3}
pulse_data_optimized
```

```{r workflow_optimize4, echo = FALSE, fig.height = 1, fig.width = 6}
x <- bind_rows(
  pulse_data_split$data[[1]] |> tibble::add_column(opt = "raw") |> dplyr::select(t = time, opt, val = c04),
  pulse_data_optimized$data[[1]] |> tibble::add_column(opt = "opt") |> dplyr::select(t = time, opt, val = c04)
) |>
  dplyr::mutate(opt = factor(opt, levels = c("raw", "opt")))

ggplot(x) + 
  geom_line(aes(t, val, col = opt)) + 
  # facet_grid(rows = vars(opt)) + 
  # scale_color_manual(values = c("raw" = "black", "opt" = "red"), guide = "none") + 
  scale_color_manual(values = c("raw" = "grey20", "opt" = "red"), name = NULL) + 
  theme_void()
```

## compute heart rates with `pulse_heart()`

`pulse_heart()` takes the output of `pulse_optimize()` and estimates a single value of heart rate frequency for each channel/time window combination. 

```{r workflow_heart1}
heart_rates <- pulse_heart(
  pulse_data_split = pulse_data_optimized,
  msg = FALSE
  )

heart_rates
```

- `$hz` holds the estimates of heart rate
- `$n`  is the number of wave peaks identified and used to compute `hz`
- `$sd` is the standard deviation of the lengths of all intervals between consecutive wave peaks

## detect heart rate doublings with `pulse_doublecheck()`

Under certain conditions, the algorithm used in `pulse_heart()` may identify heartbeat wave signals that feature two peaks per heartbeat as two individual heartbeats. This has the adverse effect of leading to estimates of heart rates that are twice as high as the real value. To minimize this, the output of `pulse_heart()` can be run through `pulse_doublecheck()`, which will identify and correct the majority of those instances. 

```{r workflow_double1}
heart_rates <- pulse_doublecheck(
  heart_rates = heart_rates
  )

heart_rates
```

Two columns (`d_r` and `d_f`) were added to `heart_rates`, providing the necessary information to screen and handle heart rate doublings.  

```{r workflow_double3, echo = FALSE, fig.height = 1, fig.width = 6}
heart_rates_double <- PULSE(
  pulse_example(), window_width_secs = 30, window_shift_secs = 60,
  bandwidth = 0.55, discard_channels = discard, raw_v_smoothed = FALSE, correct = FALSE)
heart_rates_corr <- PULSE(
  pulse_example(), window_width_secs = 30, window_shift_secs = 60, 
  bandwidth = 0.75, discard_channels = discard, raw_v_smoothed = FALSE, correct = TRUE)

x <- bind_rows(
  heart_rates_double |> 
  dplyr::filter(id == "c03", i == 9) |> 
  dplyr::pull(data) |> 
  dplyr::first() |> 
  tibble::add_column(opt = "double peaks"),
  heart_rates_corr |> 
  dplyr::filter(id == "c03", i == 9) |> 
  dplyr::pull(data) |> 
  dplyr::first() |> 
  tibble::add_column(opt = "corrected")
) |> 
  dplyr::mutate(opt = factor(opt, levels = c("double peaks", "corrected")))

pulse_plot_one(x) +
  ggplot2::facet_grid(rows = vars(opt)) +
  ggplot2::theme_void()
```

Note that by default `pulse_doublecheck()` runs with `correct = TRUE `. When that is the case, values for `hz` in the output of `pulse_doublecheck()` have already been halved whenever heart rate doublings have been identified.  

## select which heart rate estimates to keep with `pulse_choose_keep()`

The final step involves matching key quality indicators against user-provided thresholds to assess whether or not to keep each heart rate estimate.  
This is achieved by running the output of `pulse_doublecheck()` through `pulse_choose_keep()`, resulting in the addition of the column `$keep`, which can be used to filter out poor estimates. 

```{r workflow_keep1}
heart_rates <- pulse_choose_keep(
  heart_rates = heart_rates
  )

heart_rates
```

Note that `r sum(!heart_rates$keep)` entries were classified as not meeting the selection criteria.  
If we filter those out, we're left with `r sum(heart_rates$keep)` entries that can be used for subsequent analyses.

```{r workflow_keep2}
dplyr::filter(heart_rates, keep)
```

<br>

<br>

# The simple version

Users can also execute the entire `heartbeatr` workflow with a single call to the wrapper function `PULSE()`.

```{r simple1}
# the code below takes approximately 30 seconds   
#  to run on a machine with the following specs:
# - MacBook Air 2022
# - chip:   Apple M2
# - memory: 16 GB RAM
# - macOS:  Sequoia 15.3.1 
heart_rates_PULSE <- PULSE(
  paths = pulse_example(),
  window_width_secs = 30,
  window_shift_secs = 60,
  bandwidth         = 0.75,
  raw_v_smoothed    = FALSE,
  discard_channels  = discard
  )

heart_rates_PULSE
```

Note that the final output is exactly the same and that there are no performance advantages from using a single call to `PULSE()` compared to using the step-by-step workflow. 

```{r simple2}
identical(heart_rates_PULSE, heart_rates)
```

Despite its convenience, there are still moments when the step-by-step may come handy:  

- when there's a need to access the output of a particular function (e.g., when debugging an error)  
- during initial runs aimed at fine tuning the PULSE function parameters  

<br>

<br>

# Visualizing

The most basic visualization of the output of the `heartbeatr` workflow is a plot heart rate over time.  
While not meant to provide the highest quality graphics, the `heartbeatr` package includes two main plotting functions to ensure that users can quickly inspect the results.  

## check all data using `pulse_plot()`

Inspecting the entire processed dataset can be done with a simple call.

```{r plot_all1, fig.height = 4, fig.width = 6}
pulse_plot(heart_rates)
```

It is usually important to filter out estimates that didn't meet the quality thresholds.

```{r plot_all1b, fig.height = 4, fig.width = 6}
pulse_plot(dplyr::filter(heart_rates, keep))
```

Function parameters and general `ggplot2` syntax can be used to further customize the output of `pulse_plot()`.  

```{r plot_all2, fig.height = 3, fig.width = 6}
pulse_plot(
  dplyr::filter(heart_rates, keep), 
  facets = FALSE, 
  bpm    = TRUE, 
  smooth = FALSE, 
  points = FALSE
  )
```

```{r plot_all3, fig.height = 4, fig.width = 6}
pulse_plot(filter(heart_rates, keep)) +
  ggplot2::geom_vline(ggplot2::aes(xintercept = mean(heart_rates$time))) +
  ggplot2::theme_bw() +
  ggplot2::ggtitle("example")
```

It is also possible to focus on a single `id`.

```{r plot_all4, fig.height = 3, fig.width = 6}
pulse_plot(dplyr::filter(heart_rates, keep), ID = "c03")
```


## inspect raw data with `pulse_plot_raw()`

Often, it is crucial to check more closely the data underlying a particular heart rate estimate.  

There are to ways to plot a specific channel + time window:

- use a value pointing to an index `$i`
- provide a timestamp

If we want to check channel "c03" at around 2024-10-01 10:58...  

```{r plot_raw1}
dplyr::filter(heart_rates, id == "c03")
```

... we can use `target_i`  

```{r plot_raw2, fig.height = 3, fig.width = 6}
pulse_plot_raw(heart_rates, ID = "c03", target_i = 7)
```

... or we can use `target_time`. 

```{r plot_raw3, fig.height = 3, fig.width = 6}
pulse_plot_raw(heart_rates, ID = "c03", target_time = "2024-10-01 10:58")
```

Note that both approaches yield the same result. Also, when using `target_time`, there's no need to provide a timestamp accurate to the second. Instead, `pulse_plot_raw()` will return the time window that is closest to the timestamp provided.  

It is also possible to use the `range` parameter to check the raw data for a group of estimates.  

```{r plot_raw4, fig.height = 6, fig.width = 6}
pulse_plot_raw(heart_rates, ID = "c03", target_time = "2024-10-01 10:58", range = 2)
```

<br>

<br>

# Optimizing parameters (`bandwidth`)

The `heartbeatr` is a generic workflow that is not optimized for any organism in particular. This makes it suitable to process datasets derived from distinct species with different basal heart rates and other circulatory system characteristics, all of which could negatively impact the performance of a highly specific algorithm. It also ensures that a wide range of experimental conditions can be accommodated. 

However, this also means that users should exercise caution when selecting key function parameters, chiefly among them the smoothing parameter `bandwidth`, as that can have important implications on the performance of the data processing. Other parameters such as `window_width_secs`, `window_shift_secs`, `min_data_points`, and `interpolation_freq` are also important, however in their case, reading through each function's manual should be sufficient to make good guessess.

Regarding the trickier `bandwidth` parameter, it establishes the radius used for the operation of the smoothing function (`ksmooth`, from the `stats` package). Higher values remove more noise from the signal, improving the ability of the algorithm to select "true" peaks. However, values too high will end up masking out real wave crests, eventually leading to a degradation of performance. 

Because of inter- and intra-specific variability in heart rates, variability in the placement of sensors, and the co-variation of these factors with the rate of the heart being recorded (more stressful treatments lead to higher heart rates, where each heartbeat wave encompasses a narrower span of time, a situation that often amplifies limitations of the algorithm), it is not possible to provide users with definitive advice of how to set the `bandwidth` parameter.

Furthermore, the `bandwidth` of a Kernel smoother is a free parameter, meaning that it cannot, by definition, be predicted precisely or constrained *a priori*, and must always be estimated experimentally for each dataset.

Therefore, the recommended approach is the following:

- Use the `subset` and `subset_seed` parameters of the `PULSE` function to process a subset of your data (so that the processing runs quicker) using multiple combinations of the  parameters you want to test (in the example below, `bandwidth` is set to `0.2` and `1.5`). 
  - It is advised to set `subset_seed` to a fixed numeric value (e.g., `1`) so that the subsets generated always contain the same data points, which will facilitate the comparison of performances. 
  - It is also advised to set `raw_v_smoothed` to `FALSE` so that only the smoothed version of the raw data is used.
```{r bw1}
x_02 <- PULSE(
  paths = pulse_example(), bandwidth = 0.2,
  raw_v_smoothed = FALSE, discard_channels = paste0("c0", 1:9), 
  subset = 10, subset_seed = 1, subset_reindex = TRUE, show_progress = FALSE)
x_15 <- PULSE(
  paths = pulse_example(), bandwidth = 1.5,
  raw_v_smoothed = FALSE, discard_channels = paste0("c0", 1:9), 
  subset = 10, subset_seed = 1, subset_reindex = TRUE, show_progress = FALSE)
```

- Inspect the performance of the analyses, paying special attention to how correctly heart beats have been identified.
```{r bw2, fig.height = 2, fig.width = 6}
# when bandwith = 0.2, too many peaks are identified
pulse_plot_raw(x_02, ID = "c10", target_i = 1)
# when bandwith = 0.5, only one peak is identified in each heartbeat, as it should
pulse_plot_raw(x_15, ID = "c10", target_i = 1)
```

- Identify the combination of parameters that best suits your data
- Run the analysis normally (`subset` set to `0`) to process the entire dataset
- Review the resulting estimates 

<br>

<br>

# Warning against HRV analyses

HRV (Heart Rate Variability) analyses focus on the variability of the intervals between consecutive heartbeats and are extremely important for assessing cardiac performance and heart conditions. There are numerous studies focusing on this metric, especially those addressing human health.

The authors of this package want to make clear that the SD provided by the `heartbeatr` workflow - in effect, a measure of heart rate variability - **SHOULD NOT** be used for HRV analyses on data collected from shelled invertebrates.

This recommendation is justified because, while photoplethysmographic data is often used for HRV analyses, that is typically the case when such data is collected in a controlled manner, such as from a person’s finger. In such cases, the quality of the raw signal tends to be sufficiently dependable, and therefore, analyses can go beyond the mere study of the heart rate frequency to also focus on variations of the timing of the intervals between consecutive heartbeats over short periods of time (a.k.a., HRV). 

However, while we recognize the research value of such analyses, we cannot recommend a similar approach based on data collected with the PULSE system from shelled invertebrates. The reason behind this is that many factors often impact the quality of the signal captured, including the placement of the sensor on the shell, which may be “looking” at different components of the circulatory system, the impact of movements of the animal’s body unrelated to cardiac performance, and many other quality-degrading factors. 

As a result, we believe that making assertions about an organism’s physiological state based on the variability of the timing between consecutive peaks is beyond the capabilities of the PULSE-heartbeatr workflow. That is why we **only recommend** the use of the SD of the intervals between peaks during each split window for assisting in **determining the quality of the computed heart rate estimates** and filtering out low-quality data points – a task that is far less consequential and that therefore can accept a greater deal of uncertainty. 

On top of that, even the quantification of the accuracy of the timing of individual heartbeats isn’t straightforward, as it would necessitate a reference dataset for which the timing of the heartbeats was determinatively known – something that very rarely would be available to researchers, as it would imply the acquisition of a simultaneous ECG, which is invasive and beyond most researchers' technical capabilities (and likely not feasible even with today’s technology, given that the implanted ECG electrodes would need to remain in place for the entire duration of the experiment).

<br>

<br>

# Post-processing

Before moving onto the proper analysis of the results, it is often important to perform two final tweeks to the output of the `heartbeatr` workflow.

## Normalizing

Even individuals from the same species have different basal heart rates. This means that it is often better to compare the rate of change of heart rate instead of the absolute values themselves. To do that, the heart rate estimates must first be normalized using a stable reference period, usually during acclimation.

```{r plot_norm1}
heart_rates_norm <- pulse_normalize(
  dplyr::filter(heart_rates, keep), 
  t0 = "2024-10-01 10:52", 
  span_mins = 5
  )

heart_rates_norm
```

```{r plot_norm2, fig.height = 3, fig.width = 6}
pulse_plot(heart_rates_norm, normalized = FALSE, facets = FALSE)
pulse_plot(heart_rates_norm, normalized = TRUE,  facets = FALSE)
```

Note how all channels begin with the same normalized heart rate (around 1) and the different trajectories become evident.  

## Summarising

Lastly, after filtering out the entries that didn't meet the quality thresholds, we were left with several gaps in the final dataset. One way to handle this is by binning data over wider time windows.

```{r plot_sum1}
heart_rates_sum <- pulse_summarise(
  heart_rates_norm, 
  FUN = mean, 
  span_mins = 5, 
  min_data_points = 0)

heart_rates_sum
```

```{r plot_sum2, fig.height = 3, fig.width = 6}
pulse_plot(heart_rates_sum, normalized = FALSE, facets = FALSE)
```

Now, there are estimates at regular intervals for all channels.